Skip to content

OpenAI and Anthropic share findings from joint safety evaluation

Sources: https://openai.com/index/openai-anthropic-safety-evaluation, OpenAI

TL;DR

  • OpenAI and Anthropic conducted a joint safety evaluation, testing each other’s models.
  • The assessment covered misalignment, instruction following, hallucinations, jailbreaking, and related safety areas.
  • The teams shared findings publicly, underscoring progress and challenges and the value of cross-lab collaboration.
  • The effort aims to inform safety practices for developers and enterprises deploying large-language-models.

Context and background

OpenAI and Anthropic undertook a first-of-its-kind joint safety evaluation, designed to test the behavior of each other’s models in safety-critical scenarios. By coordinating a cross-lab assessment, the teams aimed to explore how models respond to prompts that probe alignment with user intent, adherence to safeguards, and potential exploitation attempts. This kind of collaboration is presented as a way to accelerate learning about model behavior and to surface practical considerations for safety governance. The joint effort reflects a broader industry push toward shared safety practices and transparent disclosure of results that can inform developers, researchers, and enterprises deploying large-language-models. While the exact methods and findings are detailed in the published results, the central idea is to systematically examine how models perform under challenging conditions and where safeguards may need reinforcement.

What’s new

This release marks the first public instance of two leading AI labs sharing findings from a mutual safety evaluation. OpenAI and Anthropic describe progress toward safer model behavior while acknowledging ongoing challenges. The disclosures emphasize the value of cross-lab collaboration to identify blind spots, validate safety assumptions, and drive improvements that can be adopted across the industry. In practical terms, the joint results illustrate how coordinated testing can illuminate how models handle misalignment risks, respect instruction boundaries, resist jailbreaking attempts, and mitigate hallucinations, among other safety-relevant dimensions. The emphasis is on learning and improvement rather than attribution of fault to any single system.

Why it matters (impact for developers/enterprises)

For developers and enterprises, the findings offer a clearer view of safety considerations when deploying large-language-models. Cross-lab collaboration of this kind helps establish more robust safety practices, informs governance and risk management strategies, and supports decisions about model-in-use policies, monitoring, and escalation paths. Transparent safety assessments can also guide the development of tooling and guardrails that reduce the likelihood of unsafe outputs in real-world applications. By sharing progress and challenges, OpenAI and Anthropic underscore that safety is an ongoing, collaborative endeavor. The reported lessons can inspire broader industry standards and encourage organizations to adopt proactive safety reviews as part of their deployment lifecycle.

Technical details or Implementation

The joint evaluation focused on several primary domains:

  • Misalignment: assessing how model behavior aligns with user intent and safety constraints.
  • Instruction following: evaluating adherence to given user instructions while maintaining safeguards.
  • Hallucinations: identifying instances where models produce fabricated or incorrect information.
  • Jailbreaking: examining attempts to bypass safety constraints or extend model capabilities beyond intended safeguards.
  • Other safety-relevant areas: additional dimensions of model reliability and safety surfaced during testing. To summarize how these domains were approached, a compact table below outlines the core focus of each category:
CategoryFocus
MisalignmentSafety alignment with user prompts and safety constraints
Instruction followingAdherence to user instructions while respecting safeguards
HallucinationsOutputs that are fabricated or incorrect
JailbreakingAttempts to bypass safety constraints
OtherAdditional safety evaluation domains

Key takeaways

  • Cross-lab collaboration can accelerate safety improvements and the adoption of best practices.
  • The joint evaluation demonstrates practical testing across multiple safety domains, highlighting both progress and ongoing challenges.
  • Public sharing of findings supports better safety practices for developers and enterprises deploying language models.
  • The effort contributes to a broader conversation about safety standards and governance in AI—encouraging continued joint learning.

FAQ

  • What was evaluated in the joint safety assessment?

    They tested misalignment, instruction following, hallucinations, jailbreaking, and related safety areas in each other’s models.

  • Who conducted this evaluation?

    OpenAI and Anthropic conducted a joint safety evaluation and shared findings publicly.

  • Why is cross-lab collaboration important?

    It helps advance safety practices, surface challenges, and inform governance across labs and deployments.

  • Where can I read the findings?

    Details are available on OpenAI’s site at the linked page: https://openai.com/index/openai-anthropic-safety-evaluation.

References

More news