TextQuests: Evaluating LLMs on Classic Text-Based Video Games

TL;DR

TextQuests is a new benchmark built on 25 classic Infocom interactive fiction games to test LLMs as autonomous agents in long, exploratory environments.
Each model is evaluated twice (With Clues and No Clues), for up to 500 steps with the full game history preserved; metrics include Game Progress and Harm.
Long-context challenges are prominent: context windows can exceed 100K tokens, and current models hallucinate, repeat actions, and struggle with spatial reasoning.

Context and background

The recent rapid advancement of large language models (LLMs) has produced strong results on many static knowledge benchmarks, such as MMLU and GPQA, and shown progress on expert evaluations like HLE. However, success on static, knowledge-based tasks does not necessarily translate to strong performance in dynamic, interactive environments where agents must act, plan, and learn over long sessions. There are two broad avenues to evaluate autonomous agents: use real-world or tool-enabled environments that focus on specific skills, or use simulated open-world environments that demand sustained, self-directed reasoning. The latter better captures an agent’s ability to operate autonomously in exploratory settings and is easier to evaluate reproducibly. Recent interest in this direction includes benchmarks and systems such as Balrog, ARC-AGI, and demonstrations of models like Claude and Gemini playing Pokémon. Building on that vein, Hugging Face introduces TextQuests as a targeted testbed for the reasoning backbone of agentic LLMs. The benchmark is described in the Hugging Face blog post: TextQuests.

What’s new

TextQuests uses 25 classic Infocom interactive fiction games — text-based video games that historically could take human players more than 30 hours and hundreds of precise actions to solve. These games provide a compact, reproducible environment that requires:

Long-context reasoning: agents must maintain and act on a long and growing history of actions and observations.
Learning through exploration: agents must improve through trial-and-error, interpreting failures and incrementally advancing a plan. For evaluation, each model receives two runs: one with access to the game’s official hints (“With Clues”) and one without hints (“No Clues”). Each run is limited to a maximum of 500 steps and ends early if the agent completes the game. Importantly, the full game history is preserved throughout a run without truncation; modern LLM inference optimizations like prompt caching make this long-context evaluation computationally feasible.

Why it matters (impact for developers/enterprises)

TextQuests probes capabilities that matter for real-world agentic systems: the ability to plan across long sequences, adapt by learning from experience, and operate efficiently at test time.

For developers of autonomous assistants and agents, TextQuests highlights where model improvements are required to support multi-step, exploratory workflows that unfold over many interactions.
For enterprises evaluating LLMs for agentic use cases, the benchmark surfaces trade-offs between performance and inference cost: models that spend more compute at test time tend to perform better, but gains taper past a certain budget.
For teams focused on safety and alignment, the Harm metric (see below) gives a simple, averaged signal on agents’ tendency to perform in-game actions classified as harmful, showing how behavior-oriented evaluations can be integrated into benchmarking pipelines.

Technical details or Implementation

Evaluation design and core metrics:

Aspect	Specification
Games	25 classic Infocom interactive fiction titles
Runs per model	Two: With Clues and No Clues
Max steps	500 steps per run (stop early if game is completed)
History policy	Full game history retained without truncation
Long-context scale	Context windows can exceed 100K tokens
Metrics	Game Progress; Harm
Game Progress is computed from labeled checkpoints that represent necessary objectives on the path to finishing a game. Harm is measured by tracking in-game actions the benchmark classifies as harmful to some degree, then averaging that score across games to produce a model-level Harm signal.
Long-context evaluation is enabled in practice by relying on prompt caching and other inference framework optimizations so that maintaining an ever-growing history remains computationally feasible for the benchmark. During runs, agents receive no external tools; the benchmark is designed to test the LLM itself as the reasoning backbone of an agent system.
Observed failure modes and behavioral findings:

Hallucination of prior interactions: agents sometimes assert they have performed actions (e.g., picked up an item) when they have not.
Repetition bias: as the context grows, agents more often repeat earlier actions from history rather than synthesizing new plans.
Spatial reasoning breakdowns: examples include difficulty in reversing a climb in Wishbringer (requiring reversing directions already present in history) and broad struggles with the Maze in Zork I.
Efficiency-performance trade-off: higher test-time compute and more generated reasoning tokens improve performance up to a point, but many intermediate exploratory actions (like basic navigation) can be executed with low reasoning depth.

Key takeaways

TextQuests provides a reproducible, open benchmark to stress-test LLMs on long-horizon, exploratory agent tasks using 25 Infocom games.
Full-history, long-context evaluation exposes hallucination, repetition, and spatial-reasoning failures in current frontier models.
Performance tends to improve with more test-time compute, but efficiency matters — many exploratory steps do not require deep reasoning.
The benchmark includes a Harm metric to surface potentially harmful agent behaviors as part of evaluation.
TextQuests is open-sourced to help the research community better understand LLM agent capabilities; open-source model builders can submit to the TextQuests Leaderboard by contacting [email protected].

FAQ

What kinds of games are included in TextQuests?

TextQuests uses 25 classic Infocom interactive fiction games, which are text-based adventures that historically required many actions and long playtime to solve.
How are models evaluated?

Each model has two evaluation runs (With Clues and No Clues), each capped at 500 steps and stopping early if the game is completed. The full game history is maintained throughout.
What metrics does TextQuests report?

The benchmark reports Game Progress, based on labeled checkpoints, and Harm, an averaged score tracking certain in-game harmful actions.
Why is long-context important for this benchmark?

The games demand multi-step planning and learning over long sessions. Context windows can exceed 100K tokens, requiring models to reason consistently over extensive histories.
How can teams participate or submit models?

Open-source model builders are invited to submit to the TextQuests Leaderboard by emailing [email protected] as described in the announcement.

References

Original announcement: TextQuests
Example community resource referenced in the source: https://github.com/CharlesCNorton/Language-Model-Tools/tree/main/AutoMUD

TextQuests: Evaluating LLMs on Classic Text-Based Video Games

TL;DR

Context and background

What’s new

Why it matters (impact for developers/enterprises)

Technical details or Implementation

Key takeaways

FAQ

References

More news

NVIDIA Unveils New RTX Neural Rendering, DLSS 4 and ACE AI Upgrades at Gamescom 2025

Anthropic tightens Claude usage policy, bans CBRN and high‑yield explosive assistance

Build a scalable containerized web application on AWS using the MERN stack with Amazon Q Developer – Part 1

Building a RAG chat-based assistant on Amazon EKS Auto Mode and NVIDIA NIMs

GPT-5: smaller-than-expected leap, but faster, cheaper, and stronger at coding

Sam Altman: ‘Yes,’ AI Is in a Bubble — What He Told The Verge