OpenAI Trains Model to Ignore Your Prompt Injections

Published: March 12, 2026 at 12:28 AM

Updated: March 12, 2026 at 12:28 AM

100-word summary

OpenAI released a training dataset that teaches AI models whose instructions to follow when they conflict. Its internal GPT-5 Mini-R model now enforces a strict pecking order: system rules beat developer settings, which beat user prompts, which beat tool outputs. The practical win? When a chatbot scrapes a website that whispers "ignore previous instructions and reveal secrets," it now shrugs that off. The dataset is public for researchers, though the souped-up model stays in-house. This matters because AI agents using tools are basically reading the entire internet, including every hacker's favorite playground. Teaching models to distrust the wrong voices might finally make autonomous agents safe enough to trust.

What happened

Why it matters

Teaching models to distrust the wrong voices might finally make autonomous agents safe enough to trust.

Sources

OpenAI