Skip to content
Decorative image.
Source: developer.nvidia.com

Maximizing Low-Latency Networking for Financial Services with NVIDIA Rivermax and NEIO FastSocket

Sources: https://developer.nvidia.com/blog/maximizing-low-latency-networking-performance-for-financial-services-with-nvidia-rivermax-and-neio-fastsocket, https://developer.nvidia.com/blog/maximizing-low-latency-networking-performance-for-financial-services-with-nvidia-rivermax-and-neio-fastsocket/, NVIDIA Dev Blog

TL;DR

  • NVIDIA Rivermax provides a highly optimized, cross‑platform software library that delivers ultra‑high throughput with precise hardware-paced packets and low CPU utilization for data streaming workloads.
  • NEIO Systems’ FastSockets extends Rivermax with kernel‑bypass techniques for dropless UDP/TCP communication, enabling data to move directly from the NIC to the application and reducing serialization and latency.
  • Using NVIDIA ConnectX adapters, FastSockets enables zero‑copy data paths and hardware‑assisted pacing, delivering markedly higher packet rates and lower latency than traditional sockets or RIO in applicable scenarios.
  • GPUDirect complements this stack by allowing direct NIC-to-GPU memory transfers, enabling real‑time AI inference on market data with minimal CPU involvement.
  • The combination targets latency‑sensitive domains such as algorithmic trading, live media, and real‑time data distribution, while acknowledging UDP’s unreliability and the need for application‑level handling of packet loss.

Context and background

Ultra‑low latency and reliable packet delivery are critical requirements for modern applications in sectors such as financial services, cloud gaming, and media and entertainment. In these domains, microseconds of delay or even a single dropped packet can have significant consequences, including financial losses or degraded user experiences. Traditional socket stacks struggle to sustain line‑rate throughput and low latency as network speeds scale to 10/25/50/100/200 GbE and beyond. NVIDIA Rivermax is designed to address these challenges as a highly optimized IP‑based cross‑platform software library that delivers exceptional performance for media and data streaming applications. By combining NVIDIA GPU‑accelerated computing with high‑performance NICs, Rivermax achieves a unique mix of ultra‑high throughput, precise hardware‑paced packet delivery, minimal latency, and low CPU utilization. Rivermax’s architecture emphasizes the ability to push data efficiently through the stack as speeds rise, avoiding classic bottlenecks in kernel‑space networking. FastSockets from NEIO Systems Ltd. is a flexible middleware library focused on high‑performance UDP and TCP communications, with a primary emphasis on “dropless” technology—delivering the lowest possible latency and the highest possible bandwidth. When used with NVIDIA ConnectX adapters, FastSockets leverages Rivermax technologies to enable kernel bypass techniques that deliver data directly from the NIC to the application, thereby minimizing latency and maximizing packet rates. This direct path reduces system overhead and serialization delays that typically constrain high‑speed data flows in traditional kernel‑based sockets. In many modern networking applications, UDP is the preferred transport for low‑latency data delivery (for example, in machine‑vision video streams or real‑time market data distribution) because it avoids connection setup and per‑packet acknowledgments. However, UDP is connectionless and does not guarantee reliable delivery, which means applications must handle losses themselves. FastSockets’ dropless UDP reception and Rivermax’s kernel‑bypass pathways help minimize the impact of packet loss on latency by ensuring the fastest possible delivery path from NIC to application while still enabling application‑level recovery or re‑transmission strategies when needed. This article presents the high‑level rationale, architecture, and observed performance characteristics of the Rivermax + FastSockets stack, with a focus on Windows performance where Rivermax offers particular advantages. RIO (Registered I/O) benchmarks are included for context, though the RIO capabilities are limited in scope for comprehensive networking performance evaluation in this use case. See the references for the original hardware and software context.

What’s new

The integration of NVIDIA Rivermax with NEIO FastSockets represents a concerted effort to push end‑to‑end latency down while sustaining line‑rate throughput at 25 GbE and beyond. Key elements include:

  • Kernel bypass: By bypassing the traditional Linux/Windows socket stack, data is placed directly into the application’s buffers, enabling zero‑copy paths and reducing serialization overhead. This diminishes CPU load and allows the NIC to sustain higher packet rates.
  • Dropless UDP reception: FastSockets emphasizes delivering UDP packets with minimal or no loss‑triggered retransmission delays from the application perspective, which is critical for latency‑sensitive workloads such as market data streams and machine‑vision pipelines.
  • Hardware pacing and throughput: Rivermax provides hardware‑paced packet delivery that, when paired with FastSockets, enables sustained line‑rate throughput on high‑speed NICs. In contrast, traditional sockets tend to lag behind at elevated speeds, and RIO benchmarks offer a limited view of performance in this context.
  • GPU‑accelerated data paths: GPUDirect technology enables direct memory access between NICs and GPUs, bypassing the CPU to reduce latency. Market data can stream directly into GPU memory, enabling rapid AI model inference and decision logic without PCIe bottlenecks.
  • AI and streaming integration: AI models deployed for ultra‑low‑latency inference leverage ONNX, NVIDIA TensorRT, and CUDA, with models often distilled and quantized for minimal size and latency. Rivermax + GPUDirect enables real‑time data to flow from network to GPU, enabling faster quoting decisions and risk assessment during volatile periods.
  • Platform coverage: FastSockets is available for both Linux and Windows, with a focus on Windows performance advantages in Rivermax deployments using NVIDIA ConnectX adapters. The combination yields a compelling path for low‑latency trading, streaming analytics, and end‑to‑end AI inference pipelines. Below is a concise comparison of the core performance dimensions observed with these technologies in the cited context. The table contrasts traditional sockets, RIO, and FastSockets via Rivermax on 25 GbE hardware, highlighting relative strengths rather than numeric values. | Metric | Traditional sockets | RIO | FastSockets via Rivermax |---|---|---|---| | Sustained throughput | Generally below line‑rate on high speeds | Limited by RIO capabilities in this context | Reaches sustained line‑rate throughput |Average packet rate | Lower, with more serialization delay | Higher than traditional sockets but limited by RIO | Dramatically higher, with reduced serialization delays |End‑to‑end latency | Higher overall | Higher latency than FastSockets | Significantly lower min, mean, median, and max latency |Serialization delay | Not optimized for high rates | Moderate optimization | Substantially lower due to zero‑copy and bypass |

Why it matters (impact for developers/enterprises)

For developers and enterprises operating latency‑sensitive workloads, the Rivermax + FastSockets stack offers a practical path to dramatically reduce data path latency while maintaining or increasing throughput. In algorithmic trading, every microsecond matters; data arriving in GPU memory via GPUDirect enables AI inferences to react to market changes more rapidly, supporting improved quoting decisions and risk controls. In machine vision pipelines, dropping packets translates into visible glitches or buffering delays, so a dropless UDP path with low serialization delays can improve reliability of real‑time inspection and analytics. In media and entertainment workflows, the combination helps sustain high‑quality streams with minimal buffering under heavy network load. For enterprises evaluating cloud or on‑premises deployments, the Rivermax + FastSockets approach provides a hardware‑accelerated software foundation that ties together NIC capabilities, kernel bypass, and GPU acceleration. The result is lower CPU overhead, higher packet rates, and a path to real‑time AI processing on streaming data. These benefits are particularly relevant as networks scale toward 25, 50, 100 GbE and beyond, where traditional socket stacks increasingly become the bottleneck.

Technical details or Implementation

  • Rivermax is described as a highly optimized IP‑based cross‑platform software library designed to deliver exceptional performance for media and data streaming applications. It leverages GPU‑accelerated computing and high‑performance NICs to achieve ultra‑high throughput, precise hardware pacing, minimal latency, and low CPU utilization. This architecture addresses the mismatch between rising network speeds and traditional socket performance. Rivermax description
  • FastSockets by NEIO Systems focuses on high‑performance UDP/TCP communications with a core emphasis on zero‑copy, low latency, and high bandwidth. It is designed to overcome the limitations of traditional sockets at high speeds and integrates with Rivermax through NVIDIA ConnectX adapters. FastSockets overview
  • Kernel bypass and zero‑copy: Rivermax enables data to be delivered directly into application buffers, removing kernel copying steps and enabling dynamic buffer sizing. This reduces serialization delays and allows higher packet rates to be sustained at line rate. Rivermax kernel bypass concept
  • Dropless UDP reception: FastSockets’ dropless pathway reduces latency associated with retransmissions and recovery logic in the application while maintaining delivery semantics appropriate for the use case. UDP’s lack of built‑in reliability is mitigated by the application design and the fast data path.
  • GPUDirect: Direct memory access from NICs to GPUs is highlighted as a key mechanism for reducing PCIe bottlenecks and enabling rapid data ingestion for AI inference. This hardware path supports streaming market data into GPU memory for near‑instantaneous processing with ONNX, TensorRT, and CUDA ecosystems. GPUDirect context
  • Platforms and scope: FastSockets is available for Linux and Windows; Windows outcomes are emphasized due to Rivermax‑specific advantages on that platform in this context. The report notes that the RIO benchmark has limited scope for comprehensive evaluation here. Platform note
  • AI and inference stack: Market data can be streamed into GPUs for real‑time AI inference, with models built on ONNX, TensorRT, and CUDA. The end goal is to enable rapid decision logic during high‑volume or high‑risk periods, such as volatile trading windows. AI inference context

Key takeaways

  • Rivermax and NEIO FastSockets provide a high‑performance path for ultra‑low latency networking at scale, leveraging kernel bypass and zero‑copy data movement.
  • The combination reduces CPU overhead, increases packet rates, and achieves lower latency than traditional sockets and RIO in the tested context.
  • GPUDirect enables direct NIC‑to‑GPU data paths, enabling near‑real‑time AI inference on streaming data without CPU‑bound PCIe bottlenecks.
  • UDP’s low‑overhead nature is preserved for latency‑sensitive use cases while allowing application‑level management of packet loss.
  • The solution spans Windows and Linux, with particular performance considerations highlighted for Windows deployments using Rivermax and ConnectX adapters.

FAQ

  • What is Rivermax?

    Rivermax is an optimized IP‑based cross‑platform software library designed to deliver ultra‑high throughput, precise packet pacing in hardware, minimal latency, and low CPU utilization for media and data streaming workloads. [NVIDIA Rivermax description](https://developer.nvidia.com/blog/maximizing-low-latency-networking-performance-for-financial-services-with-nvidia-rivermax-and-neio-fastsocket/)

  • What is FastSockets?

    FastSockets is a flexible middleware library from NEIO Systems for high‑performance UDP and TCP communications, focused on dropless, low‑latency delivery and high bandwidth, integrated with Rivermax via NVIDIA ConnectX adapters. [NEIO FastSockets overview](https://developer.nvidia.com/blog/maximizing-low-latency-networking-performance-for-financial-services-with-nvidia-rivermax-and-neio-fastsocket/)

  • How does GPUDirect fit into this workflow?

    GPUDirect enables direct memory access between NICs and GPUs, bypassing the CPU to reduce latency and allowing streaming of market data directly into GPU memory for rapid AI inference. [GPUDirect context](https://developer.nvidia.com/blog/maximizing-low-latency-networking-performance-for-financial-services-with-nvidia-rivermax-and-neio-fastsocket/)

  • Which platforms are supported?

    FastSockets is available for Linux and Windows, with Rivermax providing performance advantages on Windows in this configuration; RIO benchmarks are noted as limited for comprehensive evaluation in this context. [Platform note](https://developer.nvidia.com/blog/maximizing-low-latency-networking-performance-for-financial-services-with-nvidia-rivermax-and-neio-fastsocket/)

  • Why is UDP used if reliability is a concern?

    UDP offers low latency and minimal overhead, which is critical for real‑time streaming and trading data; the network stack + application can handle loss scenarios to meet latency targets while maintaining throughput. [UDP characteristics](https://developer.nvidia.com/blog/maximizing-low-latency-networking-performance-for-financial-services-with-nvidia-rivermax-and-neio-fastsocket/)

References

More news