Krux

February 24, 2026
Korea's ETRI Embeds 20 Safety Filters Inside Vision Models
Published: February 24, 2026 at 1:03 AM
Updated: February 24, 2026 at 1:03 AM
100-word summary
ETRI released Safe LLaVA, six vision-language models that bake safety directly into their architecture rather than relying on post-training fixes. The models embed roughly 20 harmful-content categorizers to automatically detect risks across seven areas and refuse unsafe requests with explanations. Six variants are now on Hugging Face, including Safe LLaVA 7B/13B, Safe Qwen-2.5-VL 7B/32B, and SafeGem 12B/27B, plus the HoliSafe-Bench dataset. Benchmarks show 93-97% safety rates. This shifts safety from moderation filters bolted on afterward to native model behavior, giving ML teams safer building blocks from day one.
What happened
ETRI released Safe LLaVA, six vision-language models that bake safety directly into their architecture rather than relying on post-training fixes. The models embed roughly 20 harmful-content categorizers to automatically detect risks across seven areas and refuse unsafe requests with explanations. Six variants are now on Hugging Face, including Safe LLaVA 7B/13B, Safe Qwen-2.5-VL 7B/32B, and SafeGem 12B/27B, plus the HoliSafe-Bench dataset. Benchmarks show 93-97% safety rates.
Why it matters
This shifts safety from moderation filters bolted on afterward to native model behavior, giving ML teams safer building blocks from day one.