Estimating Worst-Case Frontier Risks of Open-Weight LLMs

TL;DR

The paper studies the worst-case frontier risks of releasing gpt-oss, an open-weight LLM.
It introduces Malicious Fine-Tuning (MFT) to elicit maximum capabilities in two domains: biology and cybersecurity.
To maximize biological risk (biorisk), the authors curate threat-creation tasks and train gpt-oss in a reinforcement learning (RL) environment with web browsing.
To maximize cybersecurity risk, they train gpt-oss in an agentic coding environment to solve capture-the-flag (CTF) challenges.
MFT models are compared against open- and closed-weight LLMs on frontier risk evaluations. Compared to frontier closed-weight models, MFT gpt-oss underperforms OpenAI o3, a model described as below Preparedness High for biorisk and cybersecurity. Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier. Taken together, these results contributed to the decision to release the model, and the authors hope the MFT approach can guide harm estimation for future open-weight releases.

Context and background

This analysis examines the risks associated with releasing open-weight large language models (LLMs) at the frontier of capability. The study introduces Malicious Fine-Tuning (MFT) as a methodology to push an open-weight model toward higher capabilities in selected high-risk domains. The two domains explored are biology and cybersecurity. In biology, the approach focuses on tasks related to threat creation, with the model trained in an RL setting that includes web browsing. In cybersecurity, the model is trained in an agentic coding environment designed to tackle Capture-the-Flag (CTF) style challenges. The work then benchmarks these MFT variants against both open-weight and closed-weight LLMs using frontier-risk evaluation metrics. The authors explicitly compare MFT gpt-oss with a frontier closed-weight model (OpenAI o3) and with open-weight baselines to gauge relative risk profiles.

What’s new

The core novelty is the Malicious Fine-Tuning (MFT) framework applied to an open-weight LLM, targeting two high-risk domains with distinct training regimes:

Biology: Task curation around threat creation, trained in an RL environment with integrated web browsing to probe advanced capabilities.
Cybersecurity: Training in an agentic coding environment to address CTF-style problems. The study then situates the MFT gpt-oss within a landscape of frontier models by comparing it to both open-weight and closed-weight baselines on frontier-risk evaluations. The results indicate that, relative to closed-weight frontier models, the MFT gpt-oss underperforms OpenAI o3, which is described as below Preparedness High in biorisk and cybersecurity. Relative to open-weight models, gpt-oss shows only a marginal increase in biological capabilities and does not meaningfully advance the frontier.

Why it matters (impact for developers/enterprises)

The work highlights important considerations for organizations weighing open-weight releases against potential harms. By framing a concrete methodology to estimate frontier-risk harm through MFT, the authors aim to provide a useful guide for assessing potential misuse in future open-weight releases. The findings contribute to the ongoing discussion about how to quantify and compare risk across model families, particularly when evaluating open-weight options that can be fine-tuned by third parties.

Technical details or Implementation

Malicious Fine-Tuning (MFT): A formalization in which an open-weight model is steered toward higher capabilities in two risk domains. In biology, the focus is on threat-creation tasks; in cybersecurity, the model is pushed to excel in CTF-like coding tasks.
Domains and environments: Biological risk is approached via RL with web browsing to curate and optimize threat-related tasks. Cybersecurity risk is approached via an agentic coding environment designed to solve CTF challenges.
Evaluation: MFT gpt-oss is evaluated against frontier closed-weight models and open-weight baselines on frontier risk metrics. The comparison shows MFT gpt-oss underperforms a frontier closed-weight model (OpenAI o3) but may marginally lift biological capabilities versus open-weight baselines.
Release decision: The authors note that, taken together, these results contributed to the decision to release gpt-oss, and they frame the MFT approach as a useful guide for estimating harm from future open-weight releases.

Tables: model comparisons at a glance

| Model category | Example | Frontier risk evaluation note |---|---|---| | MFT gpt-oss | Maliciously fine-tuned open-weight model | Underperforms OpenAI o3 on frontier risk metrics |OpenAI o3 | Frontier-level closed-weight model | Below Preparedness High for biorisk and cybersecurity |Open-weight baselines | Baseline open-weight models | gpt-oss may marginally increase biological capabilities but not substantially advance frontier |

Key takeaways

Malicious Fine-Tuning (MFT) is a framework to test worst-case frontier risks by emphasizing targeted domain capabilities.
In practice, MFT gpt-oss does not match the frontier strength of a closed-weight model like OpenAI o3, though it surpasses some open-weight baselines in specific biological metrics.
The results informed the decision to release gpt-oss and provide a reference point for estimating harm from future open-weight releases.
The study underscores the value of structured frontier-risk evaluations when considering open-weight release strategies, especially in domains with dual-use implications.

FAQ

What does MFT stand for and aim to do?

Malicious Fine-Tuning; it seeks to elicit maximum capabilities in a model by focusing on high-risk domains such as biology and cybersecurity.
Which domains were used to test MFT on gpt-oss?

Biology (threat-creation tasks with RL and web browsing) and cybersecurity (agentic coding environment for CTF challenges).
How did MFT gpt-oss perform compared with other models?

It underperformed OpenAI o3 among frontier closed-weight models, and, when compared to open-weight models, showed only marginal biological gains without substantially advancing the frontier.
What is the practical takeaway for future model releases?

The results contribute to a framework for estimating harm from open-weight releases and informed the decision to release gpt-oss.