Skip to content

Shipping smarter agents with every new model: SafetyKit uses GPT-5 for safer, smarter moderation

Sources: https://openai.com/index/safetykit, OpenAI

TL;DR

  • SafetyKit uses multimodal AI agents powered by GPT-5 and GPT-4.1 to detect and act on fraud and prohibited activity across text, images, financial transactions, and product listings. OpenAI SafetyKit
  • It reviews 100% of customer content with over 95% accuracy based on SafetyKit’s evals.
  • The platform now spans payments risk, fraud, anti-child exploitation, anti-money laundering, and serves hundreds of millions of end users.
  • A model-matching approach routes content to the best agent and the optimal OpenAI model for each violation, delivering nuanced enforcement across modalities.
  • Since inception, SafetyKit processes over 16 billion tokens daily, up from 200 million six months ago, enabling faster, broader risk coverage.

Context and background

OpenAI SafetyKit builds multimodal AI agents to help marketplaces, payment platforms, and fintechs detect and act on fraud and prohibited activity across text, images, financial transactions, product listings, and more. Recent breakthroughs in model reasoning and multimodal understanding now make this more effective, setting a new bar for risk, compliance, and safety operations. SafetyKit’s agents leverage GPT‑5, GPT‑4.1, deep research, and Computer Using Agent (CUA) to review 100% of customer content with over 95% accuracy based on SafetyKit’s evals. They can help platforms protect users, prevent fraud, avoid regulatory fines and enforce complex policies that legacy systems may miss, such as region-specific rules, embedded phone numbers in scam images, or explicit content. Automation can also protect human moderators from exposure to offensive material and frees them to handle nuanced policy decisions. SafetyKit’s agents are each built to handle a specific risk category, from scams to illegal products. Every piece of content is routed to the agent best suited for that violation, using the optimal OpenAI model: This model-matching approach lets SafetyKit scale content review across modalities with more nuance and accuracy than legacy solutions can. The Scam Detection agent, for example, goes beyond just scanning text. It analyzes visuals like QR codes or phone numbers embedded in product images. GPT‑4.1 helps it parse the image, understand the layout, and decide whether it is a policy violation. The Policy Disclosure agent checks listings or landing pages for required language, such as legal disclaimers or region-specific compliance warnings. GPT‑4.1 extracts relevant sections, GPT‑5 evaluates compliance, and the agent flags violations. “We think of our agents as purpose-built workflows,” says Graunke. “Some tasks require deep reasoning, others need multimodal context. OpenAI is the only stack that delivers reliable performance across both.” Policy decisions often hinge on subtle distinctions. Take a marketplace requiring sellers to include a disclaimer for wellness products, with requirements varying based on product claims and regional rules. Legacy providers use keyword triggers or rigid rulesets, which can miss the deeper judgment calls these decisions may require, leading to missed or incorrect enforcement. SafetyKit’s Policy Disclosure agent first references policies from SafetyKit’s internal library then GPT‑5 evaluates the content: does it mention treatment or prevention? Is it being sold in a region where disclosure is mandatory? And if so, is the required language actually included in the listing? If anything falls short, GPT‑5 returns a structured output the agent uses to flag the issue. “The power of GPT‑5 is in how precisely it can reason when grounded in real policy,” notes Graunke. “It lets us make accurate, defensible decisions even in the edge cases where other systems fail.” SafetyKit benchmarks each new OpenAI model against its hardest cases, often deploying top performers the same day. Rigorous internal evaluations allow the team to quickly identify how new models can improve performance and seamlessly integrate into their core infrastructure. When OpenAI o3 launched, SafetyKit used it to boost edge case performance across key policy areas. GPT‑5 followed, and within days, it was deployed across their most demanding agents, improving benchmark scores by more than 10 points on their toughest vision tasks. “OpenAI moves fast, and we’ve designed our system to keep up. Every new release gives us an operational edge–unlocking new capabilities and domains we couldn’t support before, and increasing the coverage and accuracy we deliver to customers,” says Graunke. SafetyKit also feeds improvements back into the ecosystem, sharing eval results, edge case failures, and policy-specific insights directly with OpenAI to help shape future model performance for safety-critical workloads. SafetyKit’s architecture enforces policy at scale, delivering speed, precision, and comprehensive risk coverage. Behind the scenes, it now handles over 16 billion tokens daily, up from 200 million six months ago, analyzing more content without sacrificing accuracy. In that same time, SafetyKit has expanded to payments risk, fraud, anti-child exploitation, anti-money laundering, and new customers with hundreds of millions of end users under SafetyKit’s protection. This foundation empowers customers to respond swiftly and confidently to emerging risks. “We’ve created a loop where every OpenAI release directly strengthens our capabilities,” says Graunke. “That’s why the system continually improves, always staying ahead of evolving risks.”

What’s new

  • From prototyping with early vision model previews to scaling with GPT‑5, SafetyKit’s multi-modal agents expand into new domains and boost accuracy. SafetyKit’s agents review content across text, images, and financial data to detect and act on prohibited activity.
  • The platform now covers a broader set of risk areas, including payments risk, fraud, anti-child exploitation, and anti-money laundering, reaching hundreds of millions of end users.
  • A model-matching workflow assigns each item to the best agent and model, enabling nuanced, cross-modal enforcement that outpaces legacy systems. The Scam Detection agent, for example, analyzes visuals such as QR codes or phone numbers embedded in images, with GPT‑4.1 assisting image parsing and layout understanding.
  • Policy decisions are grounded in SafetyKit’s internal policy library, with GPT‑5 providing final compliance evaluation and flagging when language or disclosures are missing or regionally mandated. This grounded reasoning enables precise, defensible enforcement, even in edge cases.
  • Internal benchmarking now drives rapid adoption of top performers after new OpenAI releases, with improvements becoming available the same day.
  • The system architecture continues to scale, processing an increasing share of content without sacrificing accuracy, while continuously feeding improvements back to OpenAI to shape future model performance for safety-critical workloads.

Why it matters (impact for developers/enterprises)

For developers and enterprises building digital marketplaces, payment platforms, or fintech services, SafetyKit offers a scalable approach to policy enforcement and risk management. By leveraging a unified stack that combines GPT‑5, GPT‑4.1, and CUAs, SafetyKit delivers accurate, defensible decisions across modalities and jurisdictions. The ability to review 100% of user content with high accuracy helps platforms protect users, reduce fraud, and mitigate regulatory risks such as region-specific disclosure requirements. Automation also reduces moderator exposure to offensive material and frees human teams to focus on nuanced policy interpretation and complex exceptions. The system’s model-matching capability means platforms can evolve enforcement rules without rebuilding pipelines around rigid keywords or static triggers. Instead, risk categories—like scams or illegal products—each have dedicated workflows that bring together multimodal analysis, policy grounding, and model-based reasoning. The result is faster, more comprehensive risk coverage and a clearer path to compliance as rules evolve and new regulatory requirements emerge. SafetyKit’s approach also supports scale. By routing content to the most suitable agent and model, platforms can maintain high accuracy while expanding coverage to new domains, new content types, and large end-user populations. The ongoing collaboration with OpenAI — sharing evals, edge-case data, and policy insights — helps shape improvements to models used for safety-critical workloads, reinforcing a feedback loop that strengthens risk controls over time.

Technical details or Implementation

  • Architecture comprises specialized, risk-focused agents, each handling a defined category (e.g., scams, illegal products) and coordinated by a model-matching layer that selects the best agent and OpenAI model for each violation.
  • The Scam Detection agent extends beyond text by analyzing visuals embedded in product images, such as QR codes or phone numbers. GPT‑4.1 assists image parsing and layout understanding to decide if a policy violation exists.
  • The Policy Disclosure agent first references SafetyKit’s internal policy library, then GPT‑5 evaluates whether content complies with requirements such as mandatory language and region-specific disclosures. If gaps exist, GPT‑5 returns a structured output used by the agent to flag the issue.
  • The system grounds policy decisions in real rules, enabling precise enforcement even when regional or contextual nuances matter. This grounding helps avoid misses or false positives that simpler keyword-based approaches might produce.
  • SafetyKit benchmarks new OpenAI models against the hardest cases in their evaluation suite and tends to deploy top performers in production quickly, sometimes on the same day as model releases.
  • The architecture scales to billions of tokens daily—SafetyKit now analyzes over 16 billion tokens per day, up from 200 million six months earlier—without sacrificing accuracy.
  • The SafetyKit loop also provides a continuous improvement pathway: as OpenAI releases new capabilities, SafetyKit integrates them to extend coverage, accuracy, and speed in safety-critical workloads.

Key takeaways

  • SafetyKit combines multimodal agents with GPT‑5 and GPT‑4.1 to enforce policies across text, images, and transactions.
  • It reviews 100% of content with >95% accuracy and has expanded into payments risk, fraud, AML, and anti-child exploitation.
  • A model-matching approach assigns each violation to the best agent and model for nuanced enforcement.
  • The platform processes billions of tokens daily, enabling broad coverage without compromising precision.
  • The system maintains a feedback loop with OpenAI to shape future model performance for safety-critical workloads.

FAQ

  • What is SafetyKit?

    set of multimodal AI agents built to detect and act on fraud and prohibited activity across text, images, financial transactions, and product listings. [OpenAI SafetyKit](https://openai.com/index/safetykit)

  • Which models are used?

    SafetyKit uses GPT‑5 and GPT‑4.1, along with Computer Using Agent (CUA) components, to review content and enforce policy.

  • How much content does it process?

    The system handles over 16 billion tokens daily, up from 200 million six months earlier.

  • How accurate is the review?

    SafetyKit reviews 100% of content with over 95% accuracy based on internal evals.

  • How does it adapt to new regulations?

    policy-disclosure workflow with an internal policy library and GPT‑5 evaluation supports region-specific and disclosure-compliant enforcement.

References

More news