Scaling LangGraph Agents in Production: From One User to 1,000 Coworkers
Sources: https://developer.nvidia.com/blog/how-to-scale-your-langgraph-agents-in-production-from-a-single-user-to-1000-coworkers, https://developer.nvidia.com/blog/how-to-scale-your-langgraph-agents-in-production-from-a-single-user-to-1000-coworkers/, NVIDIA Dev Blog
TL;DR
- NVIDIA scaled a LangGraph-based AI-Q research agent (AI-Q) from a single user to hundreds, using the NeMo Agent Toolkit and an OpenShift-based production stack.
- The process began with evaluation and profiling to quantify behavior, timing, and token usage, guiding bottleneck identification and hardware decisions.
- Load testing with the toolkit sizing calculator explored 10–50 concurrent users, enabling hardware forecasting (e.g., one GPU for about 10 concurrent users) and replication planning.
- Observability and a phased rollout, using OTEL collectors and Datadog, provided traces, logs, and performance visibility across user sessions.
Context and background
You’ve built a powerful AI agent and are ready to share it with your colleagues, but have one big fear: Will the agent work if 10, 100, or even 1,000 coworkers try to use it at the same time? Answering this critical question is a key part of bringing an AI agent to production. NVIDIA faced this exact challenge during the internal deployment of a deep-research agent built with LangGraph, using the AI-Q NVIDIA Blueprint. The blueprint itself is open source and designed for on-premise deployment, forming the starting point for our production rollout of a deep-research assistant. The AI-Q research agent supports document upload with metadata extraction, access to internal data sources, and web search to generate research reports. The blueprint is implemented with the NeMo Agent Toolkit and leverages NVIDIA NeMo Retriever models for document ingest, retrieval, and large language model (LLM) invocations. Our production environment runs on an internal OpenShift cluster, following an AI factory reference architecture, with access to locally deployed NVIDIA NIM microservices and third-party observability tools. The central question was: what parts of the system needed to scale to support hundreds of users across different NVIDIA teams? We approached this with a three-step process, applying NeMo Agent Toolkit capabilities at each phase. A key insight is that there is no universal guideline like “one GPU per 100 users” because each agentic application behaves differently. The first step is to understand how the application behaves for a single user and to quantify its behavior through evaluation and profiling. The NeMo Agent Toolkit provides an evaluation and profiling system that makes it feasible to gather data and reach a quantitative understanding of the application’s behavior. To run the evaluation, we added an eval section to the application’s config file, including a dataset that contains sample user inputs. Agentic applications are not deterministic, so profiling across a range of inputs helps reveal performance characteristics under realistic usage. The AI-Q research agent is a LangGraph application that uses the NeMo Agent Toolkit function wrappers, which enable the profiler to automatically capture timing and token usage for different components. We can also mark sub-steps within the application by adding simple decorators to the functions of interest. The eval workflow executes across the input dataset and computes a range of useful metrics. One output is a Gantt (or Waterfall) chart, showing which functions run during each phase of a user session, helping identify bottlenecks. In our case, the main bottleneck was calls to the NVIDIA Llama Nemotron Super 49B reasoning LLM, guiding us to replicate and scale out the deployment of the NVIDIA NIM for that LLM. Beyond timing and token metrics, the evaluation tool can compute domain-specific metrics. We benchmarked different code versions to ensure optimizations did not compromise report quality. The toolkit supports multiple output formats, including exporting results to platforms like Weights and Biases to track experiments over time. The combination of profiling, evaluation, and external visualization allowed us to establish a baseline for a one-user experience and prepare for multi-user load testing. For deployment orchestration and observability, we relied on our internal OpenShift stack and the NeMo Agent Toolkit OpenTelemetry (OTEL) collector, along with Datadog, to capture logs, performance data, and LLM traces. The OTEL collector integration enables tracing at the level of individual user sessions, while aggregated traces provide a broader view of platform and LLM behavior across users. This integrated approach was essential for understanding both application performance and the LLM’s behavior as concurrency increased. We are grateful to work with the NeMo Agent Toolkit and our AI factory reference partners to deploy an internal version of the AI-Q NVIDIA Blueprint and build a research agent with confidence. Source
What’s new
The three-step approach described above culminated in a scalable deployment plan informed by concrete data rather than intuition. The key milestones included:
- Deep understanding of a single-user workflow using evaluation and profiling to capture timing, tokens, and sub-step execution.
- Identification of bottlenecks, notably the LLM invocations to the NVIDIA Llama Nemotron Super 49B, which guided deployment focus on replicating the NIM LLM deployment.
- Load testing across increasing concurrency (10, 20, 30, 40, 50 users) with the NeMo Agent Toolkit sizing calculator to simulate parallel workflows and gather metrics such as p95 timing for LLM invocations and the overall workflow.
- Use of the sizing calculator to forecast hardware requirements, including extrapolating GPU needs (e.g., one GPU supports about 10 concurrent users within the latency threshold, implying roughly 10 GPUs for 100 concurrent users).
- Bug discovery and remediation during load testing, including a CPU allocation issue in the NIM service due to a misconfigured helm chart, and the introduction of retries and improved error handling to tolerate LLM timeouts without cascading failures.
- Phased rollout: deploying with replicas across system components, guided by performance observations and a phased expansion from small teams to broader user groups.
- Enhanced observability: employing the NeMo Agent Toolkit OTEL collector and Datadog to capture and visualize traces, logs, and performance data, enabling per-session transparency and cross-session aggregation. This combination of profiling, load testing, and phased deployment provided the confidence required to scale the AI-Q research agent to an internal user base spread across NVIDIA teams. The results demonstrated how the NeMo Agent Toolkit can drive data-driven decisions about resource allocation and system design when expanding agentic applications in production.
Why it matters (impact for developers/enterprises)
For developers and enterprises delivering advanced agent-based applications, the NVIDIA approach exemplifies how to reduce risk when moving from a prototype to a production rollout. The key takeaways include:
- There is no universal heuristic for scaling AI agents; a data-driven approach rooted in profiling for a single-user scenario informs realistic multi-user capacity planning.
- A structured evaluation and profiling workflow helps identify bottlenecks early, enabling targeted replication of the most demanding components (for example, LLM calls) to meet concurrency targets.
- Sizing tools that simulate concurrent workflows and extract latency metrics provide concrete guidance for hardware investments and deployment topology, preventing over- or under-provisioning.
- Observability is essential: end-to-end traces, logs, and performance data enable operators to understand both application behavior and LLM dynamics under load, supporting graceful degradation and faster incident response.
- Phased rollouts coupled with monitoring minimize risk and allow teams to verify performance at scale before committing to broad deployment. By adopting these practices and leveraging open-source tooling like the NeMo Agent Toolkit and the NVIDIA Blueprint, organizations can systematically scale complex agent-based applications while preserving the quality of research reports and user experience.
Technical details or Implementation
The practical implementation followed a repeatable pattern anchored in the NeMo Agent Toolkit and NVIDIA blueprint philosophy. The steps included:
- Establishing a baseline with evaluation: add an eval section to the app’s config file, provide a data set of representative user inputs, and run the evaluation to collect timing, token usage, and sub-step metrics. The profiler attached to the LangGraph-based AI-Q agent captures timing and token data across function wrappers; simple decorators enable sub-step timing.
- Visualizing and interpreting results: a Gantt/Waterfall chart shows active functions during user sessions, highlighting which parts of the workflow are most sensitive to concurrency and where bottlenecks are likely to appear.
- Identifying bottlenecks: in the AI-Q case, the main bottleneck was calls to the NVIDIA Llama Nemotron Super 49B LLM, guiding replication and scaling of the NVIDIA NIM deployment to handle the LLM workload more effectively.
- Extending to multi-user load testing: the NeMo Agent Toolkit sizing calculator runs simulated workflows in parallel at different concurrency levels (10, 20, 30, 40, 50). The calculator records p95 timing for LLM invocations and for the overall workflow, enabling capacity planning and performance forecasting.
- Extrapolating hardware needs: using one GPU as a baseline, the team concluded that one GPU could support about 10 concurrent users within the latency threshold; this supported extrapolation to roughly 10 GPUs for 100 concurrent users, guiding replication and deployment design.
- Addressing issues uncovered during testing: a misconfiguration in the helm chart led to insufficient CPU allocation for a NIM microservice, and timeouts in LLM calls prompted the addition of retries and better error handling for graceful degradation under failure scenarios.
- Observability and monitoring: the OpenTelemetry (OTEL) collector, in concert with Datadog, captures per-session traces and aggregates performance data across traces. This visibility supports both application performance assessment and LLM behavior analysis during rollout.
- phased rollout and observation: after validating performance on smaller teams, the deployment progressed in phases, with careful observation of latency trends and session counts to ensure stable operation during scaling. These technical steps demonstrate how a production-grade deployment can be planned, tested, and scaled using the NeMo Agent Toolkit and related NVIDIA tooling, while maintaining focus on report quality and user experience. The approach is aligned with NVIDIA’s AI factory reference architecture and the on-premise blueprint for deep-research applications.
Key takeaways
- Start with a thorough single-user evaluation to quantify behavior, timing, and token usage before scaling.
- Use profiling and a Gantt-style view to identify bottlenecks early, with a specific focus on LLM invocations when scaling LangGraph agents.
- Employ a load-testing calculator to simulate real-world concurrency and forecast hardware needs before full deployment.
- Plan for hardware scaling in a phased manner, using replicas and observability to validate performance at each step.
- Implement robust error handling and retries to prevent intermittent LLM timeouts from cascading into broader failures.
- Leverage OpenTelemetry and Datadog to gain end-to-end visibility and to monitor both application performance and LLM behavior.
FAQ
-
What tooling was central to the scaling effort?
The NeMo Agent Toolkit was used for evaluation, profiling, and load testing, complemented by the NVIDIA blueprint for on-premise deployment and OpenTelemetry collectors with Datadog for observability. [NVIDIA Dev Blog](https://developer.nvidia.com/blog/how-to-scale-your-langgraph-agents-in-production-from-a-single-user-to-1000-coworkers/)
-
What bottleneck guided the scaling strategy in AI-Q?
The main bottleneck identified was calls to the NVIDIA Llama Nemotron Super 49B reasoning LLM, which informed where to scale the LLM deployment (NIM) to meet concurrency demands. [NVIDIA Dev Blog](https://developer.nvidia.com/blog/how-to-scale-your-langgraph-agents-in-production-from-a-single-user-to-1000-coworkers/)
-
How were hardware needs estimated for higher concurrency?
The team used the toolkit sizing calculator to run simulated workflows at increasing concurrency (10, 20, 30, 40, 50) and then extrapolated GPU requirements based on observed p95 timings and latency thresholds. [NVIDIA Dev Blog](https://developer.nvidia.com/blog/how-to-scale-your-langgraph-agents-in-production-from-a-single-user-to-1000-coworkers/)
-
How was observability implemented during rollout?
The OpenTelemetry collector, together with Datadog, captured logs, performance data, and LLM traces, enabling per-session tracing and cross-session performance analysis. [NVIDIA Dev Blog](https://developer.nvidia.com/blog/how-to-scale-your-langgraph-agents-in-production-from-a-single-user-to-1000-coworkers/)
-
What was a key practical outcome of the load-testing phase?
The load test uncovered configuration and timeout handling issues, which were addressed with a CPU allocation fix and improved retry/error-handling logic to prevent future failures during higher concurrency. [NVIDIA Dev Blog](https://developer.nvidia.com/blog/how-to-scale-your-langgraph-agents-in-production-from-a-single-user-to-1000-coworkers/)
References
- NVIDIA Dev Blog: How to Scale Your LangGraph Agents in Production From A Single User to 1,000 Coworkers — https://developer.nvidia.com/blog/how-to-scale-your-langgraph-agents-in-production-from-a-single-user-to-1000-coworkers/
More news
First look at the Google Home app powered by Gemini
The Verge reports Google is updating the Google Home app to bring Gemini features, including an Ask Home search bar, a redesigned UI, and Gemini-driven controls for the home.
NVIDIA HGX B200 Reduces Embodied Carbon Emissions Intensity
NVIDIA HGX B200 lowers embodied carbon intensity by 24% vs. HGX H100, while delivering higher AI performance and energy efficiency. This article reviews the PCF-backed improvements, new hardware features, and implications for developers and enterprises.
Shadow Leak shows how ChatGPT agents can exfiltrate Gmail data via prompt injection
Security researchers demonstrated a prompt-injection attack called Shadow Leak that leveraged ChatGPT’s Deep Research to covertly extract data from a Gmail inbox. OpenAI patched the flaw; the case highlights risks of agentic AI.
Predict Extreme Weather in Minutes Without a Supercomputer: Huge Ensembles (HENS)
NVIDIA and Berkeley Lab unveil Huge Ensembles (HENS), an open-source AI tool that forecasts low-likelihood, high-impact weather events using 27,000 years of data, with ready-to-run options.
Scaleway Joins Hugging Face Inference Providers for Serverless, Low-Latency Inference
Scaleway is now a supported Inference Provider on the Hugging Face Hub, enabling serverless inference directly on model pages with JS and Python SDKs. Access popular open-weight models and enjoy scalable, low-latency AI workflows.
Google expands Gemini in Chrome with cross-platform rollout and no membership fee
Gemini AI in Chrome gains access to tabs, history, and Google properties, rolling out to Mac and Windows in the US without a fee, and enabling task automation and Workspace integrations.