Korea's ETRI Embeds 20 Safety Filters Inside Vision Models

Published: February 24, 2026 at 1:03 AM

Updated: February 24, 2026 at 1:03 AM

100-word summary

ETRI released Safe LLaVA, six vision-language models that bake safety directly into their architecture rather than relying on post-training fixes. The models embed roughly 20 harmful-content categorizers to automatically detect risks across seven areas and refuse unsafe requests with explanations. Six variants are now on Hugging Face, including Safe LLaVA 7B/13B, Safe Qwen-2.5-VL 7B/32B, and SafeGem 12B/27B, plus the HoliSafe-Bench dataset. Benchmarks show 93-97% safety rates. This shifts safety from moderation filters bolted on afterward to native model behavior, giving ML teams safer building blocks from day one.

What happened

Why it matters

This shifts safety from moderation filters bolted on afterward to native model behavior, giving ML teams safer building blocks from day one.

Sources

EurekAlert!