OpenAI and Anthropic share findings from joint safety evaluation
Sources: https://openai.com/index/openai-anthropic-safety-evaluation, OpenAI
TL;DR
- OpenAI and Anthropic conducted a joint safety evaluation, testing each other’s models.
- The assessment covered misalignment, instruction following, hallucinations, jailbreaking, and related safety areas.
- The teams shared findings publicly, underscoring progress and challenges and the value of cross-lab collaboration.
- The effort aims to inform safety practices for developers and enterprises deploying large-language-models.
Context and background
OpenAI and Anthropic undertook a first-of-its-kind joint safety evaluation, designed to test the behavior of each other’s models in safety-critical scenarios. By coordinating a cross-lab assessment, the teams aimed to explore how models respond to prompts that probe alignment with user intent, adherence to safeguards, and potential exploitation attempts. This kind of collaboration is presented as a way to accelerate learning about model behavior and to surface practical considerations for safety governance. The joint effort reflects a broader industry push toward shared safety practices and transparent disclosure of results that can inform developers, researchers, and enterprises deploying large-language-models. While the exact methods and findings are detailed in the published results, the central idea is to systematically examine how models perform under challenging conditions and where safeguards may need reinforcement.
What’s new
This release marks the first public instance of two leading AI labs sharing findings from a mutual safety evaluation. OpenAI and Anthropic describe progress toward safer model behavior while acknowledging ongoing challenges. The disclosures emphasize the value of cross-lab collaboration to identify blind spots, validate safety assumptions, and drive improvements that can be adopted across the industry. In practical terms, the joint results illustrate how coordinated testing can illuminate how models handle misalignment risks, respect instruction boundaries, resist jailbreaking attempts, and mitigate hallucinations, among other safety-relevant dimensions. The emphasis is on learning and improvement rather than attribution of fault to any single system.
Why it matters (impact for developers/enterprises)
For developers and enterprises, the findings offer a clearer view of safety considerations when deploying large-language-models. Cross-lab collaboration of this kind helps establish more robust safety practices, informs governance and risk management strategies, and supports decisions about model-in-use policies, monitoring, and escalation paths. Transparent safety assessments can also guide the development of tooling and guardrails that reduce the likelihood of unsafe outputs in real-world applications. By sharing progress and challenges, OpenAI and Anthropic underscore that safety is an ongoing, collaborative endeavor. The reported lessons can inspire broader industry standards and encourage organizations to adopt proactive safety reviews as part of their deployment lifecycle.
Technical details or Implementation
The joint evaluation focused on several primary domains:
- Misalignment: assessing how model behavior aligns with user intent and safety constraints.
- Instruction following: evaluating adherence to given user instructions while maintaining safeguards.
- Hallucinations: identifying instances where models produce fabricated or incorrect information.
- Jailbreaking: examining attempts to bypass safety constraints or extend model capabilities beyond intended safeguards.
- Other safety-relevant areas: additional dimensions of model reliability and safety surfaced during testing. To summarize how these domains were approached, a compact table below outlines the core focus of each category:
| Category | Focus |
|---|---|
| Misalignment | Safety alignment with user prompts and safety constraints |
| Instruction following | Adherence to user instructions while respecting safeguards |
| Hallucinations | Outputs that are fabricated or incorrect |
| Jailbreaking | Attempts to bypass safety constraints |
| Other | Additional safety evaluation domains |
Key takeaways
- Cross-lab collaboration can accelerate safety improvements and the adoption of best practices.
- The joint evaluation demonstrates practical testing across multiple safety domains, highlighting both progress and ongoing challenges.
- Public sharing of findings supports better safety practices for developers and enterprises deploying language models.
- The effort contributes to a broader conversation about safety standards and governance in AI—encouraging continued joint learning.
FAQ
-
What was evaluated in the joint safety assessment?
They tested misalignment, instruction following, hallucinations, jailbreaking, and related safety areas in each other’s models.
-
Who conducted this evaluation?
OpenAI and Anthropic conducted a joint safety evaluation and shared findings publicly.
-
Why is cross-lab collaboration important?
It helps advance safety practices, surface challenges, and inform governance across labs and deployments.
-
Where can I read the findings?
Details are available on OpenAI’s site at the linked page: https://openai.com/index/openai-anthropic-safety-evaluation.
References
More news
First look at the Google Home app powered by Gemini
The Verge reports Google is updating the Google Home app to bring Gemini features, including an Ask Home search bar, a redesigned UI, and Gemini-driven controls for the home.
Shadow Leak shows how ChatGPT agents can exfiltrate Gmail data via prompt injection
Security researchers demonstrated a prompt-injection attack called Shadow Leak that leveraged ChatGPT’s Deep Research to covertly extract data from a Gmail inbox. OpenAI patched the flaw; the case highlights risks of agentic AI.
Predict Extreme Weather in Minutes Without a Supercomputer: Huge Ensembles (HENS)
NVIDIA and Berkeley Lab unveil Huge Ensembles (HENS), an open-source AI tool that forecasts low-likelihood, high-impact weather events using 27,000 years of data, with ready-to-run options.
Scaleway Joins Hugging Face Inference Providers for Serverless, Low-Latency Inference
Scaleway is now a supported Inference Provider on the Hugging Face Hub, enabling serverless inference directly on model pages with JS and Python SDKs. Access popular open-weight models and enjoy scalable, low-latency AI workflows.
Google expands Gemini in Chrome with cross-platform rollout and no membership fee
Gemini AI in Chrome gains access to tabs, history, and Google properties, rolling out to Mac and Windows in the US without a fee, and enabling task automation and Workspace integrations.
Kaggle Grandmasters Playbook: 7 Battle-Tested Techniques for Tabular Data Modeling
A detailed look at seven battle-tested techniques used by Kaggle Grandmasters to solve large tabular datasets fast with GPU acceleration, from diversified baselines to advanced ensembling and pseudo-labeling.