Networking in an AI-Centric Data Center
AI workloads require:
- Ultra-low latency
- High bandwidth
- Deterministic performance
- Scalability across nodes
Networking must support:
- GPU-to-GPU communication
- Storage access
- Cluster management
- Infrastructure monitoring
Distributed training requires:
- High bandwidth
- Low latency
- Efficient collective communication
- Uses:
- NCCL
- RDMA
- InfiniBand
- NVLink
Latency
Time taken for a single data transfer.
Important for:
- Real-time inference
- Synchronization
Throughput
Total data transferred per second.
Important for:
- Large distributed training
- Checkpointing
- Dataset streaming
Network Separation
AI data centers use separate network planes.
1. Compute Network
- GPU-to-GPU communication
- Used for training & distributed workloads
- Technologies:
- InfiniBand
- RoCE (RDMA over Converged Ethernet)
- NVLink (inside node)
- Priority: Ultra-low latency & high throughput
2. In-Band Management - Network
- Ultra-low latency
- Lower bandwidth, higher latency than compute fabric
- Critical for cluster operations and monitoring
- Technologies:
- SSH
- Job scheduling (Slurm)
- Kubernetes traffic
- DNS, cluster APIs
- Priority: Reliability and availability
3. Out-of-Band Management Network
- Always available even when server is offline
- Remote power control
- Remote console (IPMI, Redfish)
- Priority: Always-on access for management and recovery
4. Storage Network
- High throughput for dataset access and checkpointing
- Technologies:
- NVMe-oF (NVMe over Fabrics)
- Parallel file systems (Lustre, BeeGFS)
- Often uses RDMA for low latency
- Priority: High bandwidth and low contention
DMA (Direct Memory Access)
Direct memory access without CPU copying data.
- Bypasses CPU for data transfer
- Reduces latency
- Increases throughput
- Used in GPU interconnects and storage access
- Enables GPUDirect for efficient data movement
- Critical for high-performance AI workloads
- Supports zero-copy transfers between GPU and network/storage
RDMA (Remote Direct Memory Access)
- Across servers Direct GPU memory access over network
- Memory access across hosts
Traditional Networking
CPU handles:
- Packet processing
- Memory copying
- Interrupts
RDMA
- Bypasses CPU
- Direct memory access across hosts
- Reduces latency
- Reduces CPU utilization
- Increases throughput
InfiniBand vs Ethernet
1. Ethernet
- General-purpose networking widely used
- Higher latency (~10–100 µs typical)
- Uses
TCP/IPstack - Commodity hardware
- Widely supported
2. InfiniBand
High throughput ,low latency with low CPU overhead for connecting to Storage
- Ultra-low latency (1–2 µs)
- Uses
Native RDMAStack to access remote memory directly without CPU involvement - Used in large HPC / AI clusters: over 50% HPC clusters use InfiniBand
- HCA (Infiniband Network Interface Cards): allows hardware offload of RDMA operations
- Managed by Open Subnet Manager (SM).
3. RDMA over Converged Ethernet (RoCE)
RDMA + Ethernet: Enables
RDMAover Ethernet
- Open source alternative to InfiniBand
- More flexible than infiniBand
- Cheaper Enterprise-friendly
- Used in enterprise AI clusters
NVIDIA hardware:
- Spectrum switches (Ethernet) + BlueField DPUs support
RoCEfor high-performance Ethernet-based AI clusters - Nvidia Quantum-X 800 Infiniband switch for high-performance InfiniBand-based AI clusters
GPU Interconnects (Compute Fabric)
1. PCIe
- Standard connection
- Higher latency
- Limited bandwidth (16–32 GB/s)
- Not ideal for multi-GPU scaling
2. NVLink Chip-to-chip interconnect
GPU to GPU inside same node → NVLink
- High-speed
GPU-to-GPUcommunication inside server - Up to 600 GB/s
- Faster than PCIe
- Enables multi-GPU scaling
- Uses NVSwitch for scale
3. NVSwitch Fabric
Connects multiple GPUs in large systems
- Enables full bandwidth communication across large GPU arrays
- Used in DGX SuperPOD to connect 8× H100 GPUs
- Provides non-blocking, high-speed interconnect for large multi-GPU systems
3. GPUDirect RDMA
GPU to GPU across nodes → GPUDirect RDMA
- GPU memory ↔ remote GPU
- No CPU involvement
4. GPUDirect Storage
- Storage ↔ GPU memory
- Avoids system memory bottleneck
- No CPU involvement
NVIDIA Network Operator
Simplifies the provisioning and management of NVIDIA networking resources in a Kubernetes cluster.
