Benchmark

Items tagged with “Benchmark”.

Sep 16, 2025 developer.nvidia.com

Reducing Cold Start Latency for LLM Inference with NVIDIA Run:ai Model Streamer

A detailed look at how NVIDIA Run:ai Model Streamer lowers cold-start times for LLM inference by streaming weights into GPU memory, with benchmarks across GP3, IO2, and S3 storage.

Nvidia LLM Inference

Sep 15, 2025 aws.amazon.com

How msg enhanced HR workforce transformation with Amazon Bedrock and msg.ProfileMap

This post explains how msg automated data harmonization for msg.ProfileMap using Amazon Bedrock to power LLM-driven data enrichment, boosting HR concept matching accuracy, reducing manual workload, and aligning with EU AI Act and GDPR.

Amazon LLM Inference

Sep 11, 2025 developer.nvidia.com

How Quantization Aware Training Enables Low-Precision Accuracy Recovery

Explores quantization aware training (QAT) and distillation (QAD) as methods to recover accuracy in low-precision models, leveraging NVIDIA's TensorRT Model Optimizer and FP8/NVFP4/MXFP4 formats.

Nvidia Inference Quantization

Sep 10, 2025 developer.nvidia.com

Accelerate Protein Structure Inference Over 100x with NVIDIA RTX PRO 6000 Blackwell Server Edition

NVIDIA’s RTX PRO 6000 Blackwell Server Edition dramatically speeds protein-structure inference, enabling end-to-end GPU-resident workflows with OpenFold and TensorRT—achieving up to 138x faster folding than AlphaFold2.

Nvidia Inference Benchmark

Sep 05, 2025 openai.com

Why language models hallucinate—and how OpenAI is changing evaluations to boost reliability

OpenAI explains that hallucinations in language models stem from evaluation incentives that favor guessing over uncertainty. The article outlines how updated scoring and uncertainty-focused benchmarks can reduce confident errors and improve reliability.

Openai Benchmark

Aug 29, 2025 developer.nvidia.com

Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training

Guide to fine-tuning gpt-oss with SFT + QAT to recover FP4 accuracy while preserving efficiency, including upcasting to BF16, MXFP4, NVFP4, and deployment with TensorRT-LLM.

Nvidia LLM Inference

Aug 22, 2025 machinelearning.apple.com

Checklist-Based Feedback Outperforms Reward Models for Aligning Language Models

A new RL approach uses instruction-derived checklists to guide alignment, outperforming fixed-criteria reward models across multiple benchmarks on Qwen2.5-7B-Instruct, presented at ICLR 2025.

Apple RL Benchmark

Aug 22, 2025 machinelearning.apple.com

SlowFast-LLaVA-1.5: Token-Efficient Video LLMs for Long-Form Understanding

Apple ML Research introduces SlowFast-LLaVA-1.5 (SF-LLaVA-1.5), a family of token-efficient video LLMs designed for long-form video understanding. It leverages SlowFast two-streams and public data to achieve state-of-the-art results at 1B–7B scales, with mobile-friendly implications.

Apple LLM Benchmark

A photo illustration of Daivd Luan, Amazon’s head of AGI Labs.

Aug 21, 2025 theverge.com

Amazon bets on AI agents to win the race, says AGI Labs chief David Luan

David Luan, head of Amazon's AGI Lab, argues that solving AI agents is the next major frontier, outlining a factory-like approach to building smarter models and stressing real-world task completion beyond chat.

Theverge Benchmark Open Source

Aug 19, 2025 aws.amazon.com

Benchmarking document information localization with Amazon Nova Pro on Bedrock

A benchmarking study shows how Amazon Nova Pro on Bedrock enables high-accuracy document field localization using multimodal models, with FATURA benchmarking and practical implementation guidance.

Amazon Benchmark

Aug 18, 2025 machinelearning.apple.com

Investigating Intersectional Bias in Large Language Models via Coreference Confidence Disparities

An in-depth look at intersectional bias in LLMs through a new benchmark and a confidence-based fairness metric, revealing reliability gaps in decision-support scenarios.

Apple LLM Benchmark

Aug 13, 2025 developer.nvidia.com

Scaling LLM Reinforcement Learning with ProRL v2: Prolonged Training for Continuous Improvement

NVIDIA Research introduces ProRL v2, the latest evolution of Prolonged Reinforcement Learning for LLMs. It explores thousands of extra RL steps, new stabilization techniques, and broad benchmarking to push sustained improvements beyond traditional RL schedules.

Nvidia LLM RL

Aug 13, 2025 developer.nvidia.com

Streamlining Quantum Error Correction and Application Development with CUDA-QX 0.4

CUDA-QX 0.4 advances quantum error correction workflows with automated detector error-model generation, a tensor-network decoder, improved BP+OSD decoding, and a Generative Quantum Eigensolver (GQE) in the Solvers library, accelerating end-to-end QEC development.

Nvidia Benchmark Research

Aug 12, 2025 machinelearning.apple.com

ICR2: Benchmarking In-Context Retrieval and Reasoning for Long-Context Language Models

A deep dive into In-Context Retrieval and Reasoning (ICR2) for long-context LLMs, including benchmarks, methods, and implications for retrieval-augmented generation (RAG).

Apple LLM RAG

Aug 12, 2025 huggingface.co

TextQuests: Evaluating LLMs in Text-Based Adventure Games

TextQuests is a benchmark testing LLM agents in 25 classic text-based Infocom games, emphasizing long-context reasoning and autonomous exploration.

Hugging Face LLM Inference

Aug 11, 2025 developer.nvidia.com

Maximize Robotics Performance with Post-Training NVIDIA Cosmos Reason

NVIDIA Cosmos Reason is an open, fully customizable reasoning vision-language model for physical AI and robotics. It enables step-by-step multimodal reasoning and boosts robotics performance through post-training refinements.

Nvidia LLM Robotics

Aug 07, 2025 openai.com

Introducing GPT-5: OpenAI’s Unified Thinking AI with Real-Time Routing

OpenAI unveils GPT-5, its most capable AI system yet, featuring unified thinking, a real-time router, reduced hallucinations, and strong performance in coding, writing, health, and more.

Openai Benchmark Open Source

Aug 07, 2025 openai.com

OpenAI Launches GPT-5 for Developers

OpenAI releases GPT-5 in the API with enhanced reasoning, new developer controls, and three serviceable sizes to optimize coding, agentic tasks, and tool use.

Openai Benchmark Open Source

Aug 05, 2025 openai.com

OpenAI launches gpt-oss-120b and gpt-oss-20b under Apache 2.0 license

OpenAI unveils gpt-oss-120b and gpt-oss-20b—two open-weight LLMs designed for strong real-world performance at low cost. Licensed under Apache 2.0, they emphasize reasoning, tool use, and efficient on-device deployment across consumer hardware.

Openai LLM Benchmark

Jul 23, 2025 huggingface.co

TimeScope Benchmark: How Long Can Vision-Language Models Understand Long Videos?

Open-source TimeScope tests long-video understanding by inserting short needle clips into base videos (1 minute to 8 hours), evaluating retrieval, synthesis, localization, and motion analysis.

Hugging Face Benchmark

Jul 23, 2025 huggingface.co

TimeScope: Benchmarking Long-Video Understanding in Vision-Language Models

TimeScope introduces a new open-source benchmark to measure how vision-language models process long videos by inserting short needles and evaluating retrieval, synthesis, localization, and motion analysis.

Hugging Face Benchmark Open Source

Jul 17, 2025 huggingface.co

Back to The Future: Evaluating AI Agents on Predicting Future Events

A deep dive into FutureBench, a benchmark that tests AI agents on forecasting future events using real-world data sources, with a focus on reasoning under uncertainty and verifiable outcomes.

Hugging Face Benchmark

Jul 17, 2025 huggingface.co

Consilium: When Multiple LLMs Collaborate to Reach Consensus

A deep dive into Consilium, the multi-LLM platform that enables models to discuss, debate, and reach consensus via MCP servers and a visual Gradio roundtable.

Hugging Face LLM Benchmark

Jul 16, 2025 huggingface.co

Seq vs Seq: Ettin — Paired Encoders and Decoders Redefine Open-Data LLM Benchmarks

Ettin introduces the first state-of-the-art paired encoder-only and decoder-only models trained with identical data and recipes, measuring apples-to-apples performance across tasks and scales.

Hugging Face LLM Benchmark

Sep 09, 2024 thegradient.pub

What’s Missing From LLM Chatbots: A Sense of Purpose in Dialogue

Explores why purposeful, multi round dialogue matters for LLM chatbots beyond one shot prompts, and outlines training, evaluation, and implementation challenges for engineers and enterprises.

Thegradient LLM Benchmark

Aug 28, 2024 bair.berkeley.edu

How StrongREJECT Improves Jailbreak Evaluation for Frontier LLMs

StrongREJECT advances jailbreak evaluation by pairing a high-quality forbidden-prompt dataset with automated evaluators aligned to human judgments, delivering more reliable measurements of jailbreak effectiveness against frontier LLMs.

Berkeley LLM Benchmark

Aug 28, 2024 bair.berkeley.edu

StrongREJECT: A robust benchmark for evaluating jailbreak methods in LLMs

Overview of a high-quality jailbreak benchmark with dual automated evaluators, a 313-prompt dataset, and findings that many jailbreaks underperform claims from earlier work.

Berkeley LLM Benchmark

Jul 20, 2024 bair.berkeley.edu

Visual Haystacks benchmark exposes limits of multi-image reasoning in LMMs

A new MIQA benchmark tests Large Multimodal Models on visual retrieval and reasoning across 1–10K images, revealing key limitations and introducing MIRAGE, a single-stage approach to scale LMMs.

Berkeley Benchmark Open Source

Jul 20, 2024 bair.berkeley.edu

Visual Haystacks (VHs): Benchmark for Visual Multi-Image Reasoning

Benchmark for long-context visual reasoning across large, uncorrelated image sets; introduces MIRAGE to extend LMMs beyond single-image VQA.

Berkeley Benchmark Open Source

Apr 08, 2024 thegradient.pub

A Brief Overview of Gender Bias in AI: Research, Findings, and Mitigations

Curated overview of key studies showing how AI systems reproduce and amplify gender bias, with concrete measures, benchmarks, and mitigations across embeddings, vision, NLP, and generative models.

Thegradient LLM NLP

Apr 08, 2024 thegradient.pub

A Resource Overview: Measuring and Mitigating Gender Bias in AI

Survey of key work measuring gender bias in AI, across word embeddings, coreference, facial recognition, QA benchmarks, and image generation; discusses mitigation, gaps, and the need for robust auditing.

Thegradient LLM Benchmark