Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 3 Networking

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🤯 Your stomach gets a new lining every 3–4 days.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Cover Image for AI Infrastructure

AI Infrastructure

AI infrastructure fundamentals covering GPU hardware selection, cluster scaling, power and cooling design, networking, high-speed interconnects, and DPU integration for modern data centers.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Thu Feb 19 2026

Share This on

Networking in an AI-Centric Data Center

AI workloads require:

  • Ultra-low latency
  • High bandwidth
  • Deterministic performance
  • Scalability across nodes

Networking must support:

  • GPU-to-GPU communication
  • Storage access
  • Cluster management
  • Infrastructure monitoring

Distributed training requires:

  • High bandwidth
  • Low latency
  • Efficient collective communication
  • Uses:
    • NCCL
    • RDMA
    • InfiniBand
    • NVLink

Latency

Time taken for a single data transfer.

Important for:

  • Real-time inference
  • Synchronization

Throughput

Total data transferred per second.

Important for:

  • Large distributed training
  • Checkpointing
  • Dataset streaming

Network Separation

AI data centers use separate network planes.

1. Compute Network

  • GPU-to-GPU communication
  • Used for training & distributed workloads
  • Technologies:
    • InfiniBand
    • RoCE (RDMA over Converged Ethernet)
    • NVLink (inside node)
  • Priority: Ultra-low latency & high throughput

2. In-Band Management - Network

  • Ultra-low latency
  • Lower bandwidth, higher latency than compute fabric
  • Critical for cluster operations and monitoring
  • Technologies:
    • SSH
    • Job scheduling (Slurm)
    • Kubernetes traffic
    • DNS, cluster APIs
  • Priority: Reliability and availability

3. Out-of-Band Management Network

  • Always available even when server is offline
  • Remote power control
  • Remote console (IPMI, Redfish)
  • Priority: Always-on access for management and recovery

4. Storage Network

  • High throughput for dataset access and checkpointing
  • Technologies:
    • NVMe-oF (NVMe over Fabrics)
    • Parallel file systems (Lustre, BeeGFS)
  • Often uses RDMA for low latency
  • Priority: High bandwidth and low contention

DMA (Direct Memory Access)

Direct memory access without CPU copying data.

  • Bypasses CPU for data transfer
  • Reduces latency
  • Increases throughput
  • Used in GPU interconnects and storage access
  • Enables GPUDirect for efficient data movement
  • Critical for high-performance AI workloads
  • Supports zero-copy transfers between GPU and network/storage

RDMA (Remote Direct Memory Access)

  • Across servers Direct GPU memory access over network
  • Memory access across hosts

Traditional Networking

CPU handles:

  • Packet processing
  • Memory copying
  • Interrupts

RDMA

  • Bypasses CPU
  • Direct memory access across hosts
  • Reduces latency
  • Reduces CPU utilization
  • Increases throughput

InfiniBand vs Ethernet

1. Ethernet

  • General-purpose networking widely used
  • Higher latency (~10–100 µs typical)
  • Uses TCP/IP stack
  • Commodity hardware
  • Widely supported

2. InfiniBand

High throughput ,low latency with low CPU overhead for connecting to Storage

  • Ultra-low latency (1–2 µs)
  • Uses Native RDMA Stack to access remote memory directly without CPU involvement
  • Used in large HPC / AI clusters: over 50% HPC clusters use InfiniBand
  • HCA (Infiniband Network Interface Cards): allows hardware offload of RDMA operations
  • Managed by Open Subnet Manager (SM).

3. RDMA over Converged Ethernet (RoCE)

RDMA + Ethernet: Enables RDMA over Ethernet

  • Open source alternative to InfiniBand
  • More flexible than infiniBand
  • Cheaper Enterprise-friendly
  • Used in enterprise AI clusters

NVIDIA hardware:

  • Spectrum switches (Ethernet) + BlueField DPUs support RoCE for high-performance Ethernet-based AI clusters
  • Nvidia Quantum-X 800 Infiniband switch for high-performance InfiniBand-based AI clusters

GPU Interconnects (Compute Fabric)

1. PCIe

  • Standard connection
  • Higher latency
  • Limited bandwidth (16–32 GB/s)
  • Not ideal for multi-GPU scaling

2. NVLink Chip-to-chip interconnect

GPU to GPU inside same node → NVLink

  • High-speed GPU-to-GPU communication inside server
  • Up to 600 GB/s
  • Faster than PCIe
  • Enables multi-GPU scaling
  • Uses NVSwitch for scale

3. NVSwitch Fabric

Connects multiple GPUs in large systems

  • Enables full bandwidth communication across large GPU arrays
  • Used in DGX SuperPOD to connect 8× H100 GPUs
  • Provides non-blocking, high-speed interconnect for large multi-GPU systems

3. GPUDirect RDMA

GPU to GPU across nodes → GPUDirect RDMA

  • GPU memory ↔ remote GPU
  • No CPU involvement

4. GPUDirect Storage

  • Storage ↔ GPU memory
  • Avoids system memory bottleneck
  • No CPU involvement

NVIDIA Network Operator

Simplifies the provisioning and management of NVIDIA networking resources in a Kubernetes cluster.

AI-ML/3-Networking
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.