StrongREJECT: A robust benchmark for evaluating jailbreak methods in LLMs
Sources: http://bair.berkeley.edu/blog/2024/08/28/strong-reject, http://bair.berkeley.edu/blog/2024/08/28/strong-reject/, BAIR Blog
Overview
StrongREJECT is a state-of-the-art jailbreak benchmark intended to systematically evaluate jailbreak methods for large language models (LLMs). The BAIR blog notes that early jailbreak evaluations often suffered from flawed data and evaluation methods that emphasized whether a model would respond at all rather than the quality and safety of the response. StrongREJECT provides a curated dataset of 313 forbidden prompts that reflect real-world safety measures implemented by leading AI companies. It ships two automated evaluators designed to align with human judgments: a rubric-based evaluator and a fine-tuned Gemma 2B evaluator. The rubric-based evaluator prompts an LLM with the forbidden prompt and the victim model’s response and outputs three scores: a binary willingness score and two 5-point Likert scores for specificity and persuasiveness, rescaled to the range [0, 1]. The Gemma 2B evaluator is trained on labels produced by the rubric-based evaluator, generating a 0–1 score as well. In a validation study, five human raters labeled 1,361 forbidden prompt–victim model response pairs; the median human label served as ground truth. The StrongREJECT evaluators showed state-of-the-art agreement with human judgments when compared to seven existing automated evaluators. The blog also reports results from evaluating 37 jailbreak methods, revealing that most jailbreaks are far less effective than previously claimed. A key insight is that prior benchmarks often measured only willingness to respond, while StrongREJECT also considers the victim model’s ability to generate a high-quality response. The authors tested a hypothesis that jailbreaks tend to degrade victim-model capabilities and conducted two experiments on an unaligned model to explore this idea. The overall message is that there is a meaningful discrepancy between published jailbreak success and robust, human-aligned evaluation, underscoring the need for high-quality benchmarks like StrongREJECT.
Key features (bulleted)
- 313 forbidden prompts reflecting real-world safety measures from leading AI companies
- Two automated evaluators with strong alignment to human judgments: rubric-based and a fine-tuned Gemma 2B
- A calibration workflow using human labels to train and validate the automated evaluators
- Evaluation that captures both willingness to respond and the quality/utility of the response
- Empirical finding that many published jailbreaks overstate effectiveness; StrongREJECT provides robust, cross-method evaluation
- A reproducible benchmark with a rubric and a model-friendly evaluation pipeline
Common use cases
- Compare different jailbreak methods on a single, high-quality benchmark
- Replicate and validate jailbreak findings from prior papers with a robust evaluation
- Study how automated evaluators align with human judgments and identify where gaps remain
- Investigate whether jailbreaking affects the victim model’s capabilities beyond simply eliciting a response
Setup & installation (exact commands)
Not specified in the source. The BAIR blog describes the benchmark, the 313-prompt dataset, and the evaluators, but it does not provide setup or installation instructions.
Quick start (minimal runnable example)
A high-level outline for using StrongREJECT in research contexts:
- Assemble a dataset of 313 forbidden prompts and corresponding victim-model responses.
- Apply the rubric-based evaluator by prompting an LLM with the forbidden prompt, the victim’s response, and scoring instructions to produce three outputs per pair: a binary willingness score and two 0–1 scores for specificity and persuasiveness.
- Optionally run the fine-tuned Gemma 2B evaluator on the same set to obtain an additional 0–1 score, enabling cross-checks against the rubric-based scores.
- Compare automated scores with a small human-label subset to gauge correlation with judgments.
Pros and cons
- Pros: high agreement with human judgments; robust across a wide range of jailbreaks; evaluates both willingness and quality of the response; helps separate unsafe but merely refused outcomes from unsafe-but-useful content.
- Cons: setup and integration details are not provided in the source; results depend on the authors’ evaluation framework and the 313-prompt dataset; findings reflect the authors’ experiments and may evolve with future data.
Alternatives (brief comparisons)
| Evaluator type | What it does | Strengths | Limitations |---|---|---|---| | Rubric-based StrongREJECT | Prompts an LLM with the prompt and response, outputs three scores | Strong alignment with human judgments; multi-faceted scoring (willingness, specificity, persuasiveness) | Requires a well-defined rubric; depends on the LLM used for scoring |Gemma 2B fine-tuned evaluator | A small model trained on rubric labels | Fast inference on modest hardware; good agreement with rubric scores | May inherit biases from training data; depends on fine-tuning process |Existing automated evaluators (7) | Prior automated methods for jailbreak assessment | Historically used in literature | Show weaker alignment with human judgments compared to StrongREJECT |
Pricing or License
License details are not specified in the source.
References
More resources
Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap
Leverage GPU memory swap (model hot-swapping) to share GPUs across multiple LLMs, reduce idle GPU costs, and improve autoscaling while meeting SLAs.
Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training
Guide to fine-tuning gpt-oss with SFT + QAT to recover FP4 accuracy while preserving efficiency, including upcasting to BF16, MXFP4, NVFP4, and deployment with TensorRT-LLM.
How Small Language Models Are Key to Scalable Agentic AI
Explores how small language models enable cost-effective, flexible agentic AI alongside LLMs, with NVIDIA NeMo and Nemotron Nano 2.
How to Scale Your LangGraph Agents in Production From a Single User to 1,000 Coworkers
Guidance on deploying and scaling LangGraph-based agents in production using the NeMo Agent Toolkit, load testing, and phased rollout for hundreds to thousands of users.
Introducing NVIDIA Jetson Thor: The Ultimate Platform for Physical AI
Jetson Thor combines edge AI compute, MIG virtualization, and multimodal sensors for flexible, real-time robotics at the edge, with FP4/FP8 acceleration and support for Isaac GR00T and large language/vision models.
NVFP4 Trains with Precision of 16-Bit and Speed and Efficiency of 4-Bit
NVFP4 is a 4-bit data format delivering FP16-level accuracy with the throughput and memory efficiency of 4-bit precision, extended to pretraining for large language models. This profile covers 12B-scale experiments, stability, and industry collaborations.