AI Infrastructure

AI infrastructure fundamentals covering GPU hardware selection, cluster scaling, power and cooling design, networking, high-speed interconnects, and DPU integration for modern data centers.

Written by Hitesh Sahu, a passionate developer and blogger.

Thu Feb 19 2026

Share This on

Networking in an AI-Centric Data Center

AI workloads require:

Ultra-low latency
High bandwidth
Deterministic performance
Scalability across nodes

Networking must support:

GPU-to-GPU communication
Storage access
Cluster management
Infrastructure monitoring

Distributed training requires:

High bandwidth
Low latency
Efficient collective communication
Uses:
- NCCL
- RDMA
- InfiniBand
- NVLink

Latency

Time taken for a single data transfer.

Important for:

Real-time inference
Synchronization

Throughput

Total data transferred per second.

Important for:

Large distributed training
Checkpointing
Dataset streaming

Network Separation

AI data centers use separate network planes.

1. Compute Network

GPU-to-GPU communication
Used for training & distributed workloads
Technologies:
- InfiniBand
- RoCE (RDMA over Converged Ethernet)
- NVLink (inside node)
Priority: Ultra-low latency & high throughput

2. In-Band Management - Network

Ultra-low latency
Lower bandwidth, higher latency than compute fabric
Critical for cluster operations and monitoring
Technologies:
- SSH
- Job scheduling (Slurm)
- Kubernetes traffic
- DNS, cluster APIs
Priority: Reliability and availability

3. Out-of-Band Management Network

Always available even when server is offline
Remote power control
Remote console (IPMI, Redfish)
Priority: Always-on access for management and recovery

4. Storage Network

High throughput for dataset access and checkpointing
Technologies:
- NVMe-oF (NVMe over Fabrics)
- Parallel file systems (Lustre, BeeGFS)
Often uses RDMA for low latency
Priority: High bandwidth and low contention

DMA (Direct Memory Access)

Direct memory access without CPU copying data.

Bypasses CPU for data transfer
Reduces latency
Increases throughput
Used in GPU interconnects and storage access
Enables GPUDirect for efficient data movement
Critical for high-performance AI workloads
Supports zero-copy transfers between GPU and network/storage

RDMA (Remote Direct Memory Access)

Across servers Direct GPU memory access over network
Memory access across hosts

Traditional Networking

CPU handles:

Packet processing
Memory copying
Interrupts

RDMA

Bypasses CPU
Direct memory access across hosts
Reduces latency
Reduces CPU utilization
Increases throughput

InfiniBand vs Ethernet

1. Ethernet

General-purpose networking widely used
Higher latency (~10–100 µs typical)
Uses TCP/IP stack
Commodity hardware
Widely supported

2. InfiniBand

High throughput ,low latency with low CPU overhead for connecting to Storage

Ultra-low latency (1–2 µs)
Uses Native RDMA Stack to access remote memory directly without CPU involvement
Used in large HPC / AI clusters: over 50% HPC clusters use InfiniBand
HCA (Infiniband Network Interface Cards): allows hardware offload of RDMA operations
Managed by Open Subnet Manager (SM).

3. RDMA over Converged Ethernet (`RoCE`)

RDMA + Ethernet: Enables RDMA over Ethernet

Open source alternative to InfiniBand
More flexible than infiniBand
Cheaper Enterprise-friendly
Used in enterprise AI clusters

NVIDIA hardware:

Spectrum switches (Ethernet) + BlueField DPUs support RoCE for high-performance Ethernet-based AI clusters
Nvidia Quantum-X 800 Infiniband switch for high-performance InfiniBand-based AI clusters

GPU Interconnects (Compute Fabric)

1. PCIe

Standard connection
Higher latency
Limited bandwidth (16–32 GB/s)
Not ideal for multi-GPU scaling

2. NVLink Chip-to-chip interconnect

GPU to GPU inside same node → NVLink

High-speed GPU-to-GPU communication inside server
Up to 600 GB/s
Faster than PCIe
Enables multi-GPU scaling
Uses NVSwitch for scale

3. NVSwitch Fabric

Connects multiple GPUs in large systems

Enables full bandwidth communication across large GPU arrays
Used in DGX SuperPOD to connect 8× H100 GPUs
Provides non-blocking, high-speed interconnect for large multi-GPU systems

3. GPUDirect RDMA

GPU to GPU across nodes → GPUDirect RDMA

GPU memory ↔ remote GPU
No CPU involvement

4. GPUDirect Storage

Storage ↔ GPU memory
Avoids system memory bottleneck
No CPU involvement

NVIDIA Network Operator

Simplifies the provisioning and management of NVIDIA networking resources in a Kubernetes cluster.

AI Infrastructure

AI infrastructure fundamentals covering GPU hardware selection, cluster scaling, power and cooling design, networking, high-speed interconnects, and DPU integration for modern data centers.

Written by Hitesh Sahu, a passionate developer and blogger.

Thu Feb 19 2026

Share This on

Networking in an AI-Centric Data Center

AI workloads require:

Ultra-low latency
High bandwidth
Deterministic performance
Scalability across nodes

Networking must support:

GPU-to-GPU communication
Storage access
Cluster management
Infrastructure monitoring

Distributed training requires:

High bandwidth
Low latency
Efficient collective communication
Uses:
- NCCL
- RDMA
- InfiniBand
- NVLink

Latency

Time taken for a single data transfer.

Important for:

Real-time inference
Synchronization

Throughput

Total data transferred per second.

Important for:

Large distributed training
Checkpointing
Dataset streaming

Network Separation

AI data centers use separate network planes.

1. Compute Network

GPU-to-GPU communication
Used for training & distributed workloads
Technologies:
- InfiniBand
- RoCE (RDMA over Converged Ethernet)
- NVLink (inside node)
Priority: Ultra-low latency & high throughput

2. In-Band Management - Network

Ultra-low latency
Lower bandwidth, higher latency than compute fabric
Critical for cluster operations and monitoring
Technologies:
- SSH
- Job scheduling (Slurm)
- Kubernetes traffic
- DNS, cluster APIs
Priority: Reliability and availability

3. Out-of-Band Management Network

Always available even when server is offline
Remote power control
Remote console (IPMI, Redfish)
Priority: Always-on access for management and recovery

4. Storage Network

High throughput for dataset access and checkpointing
Technologies:
- NVMe-oF (NVMe over Fabrics)
- Parallel file systems (Lustre, BeeGFS)
Often uses RDMA for low latency
Priority: High bandwidth and low contention

DMA (Direct Memory Access)

Direct memory access without CPU copying data.

Bypasses CPU for data transfer
Reduces latency
Increases throughput
Used in GPU interconnects and storage access
Enables GPUDirect for efficient data movement
Critical for high-performance AI workloads
Supports zero-copy transfers between GPU and network/storage

RDMA (Remote Direct Memory Access)

Across servers Direct GPU memory access over network
Memory access across hosts

Traditional Networking

CPU handles:

Packet processing
Memory copying
Interrupts

RDMA

Bypasses CPU
Direct memory access across hosts
Reduces latency
Reduces CPU utilization
Increases throughput

InfiniBand vs Ethernet

1. Ethernet

General-purpose networking widely used
Higher latency (~10–100 µs typical)
Uses TCP/IP stack
Commodity hardware
Widely supported

2. InfiniBand

High throughput ,low latency with low CPU overhead for connecting to Storage

Ultra-low latency (1–2 µs)
Uses Native RDMA Stack to access remote memory directly without CPU involvement
Used in large HPC / AI clusters: over 50% HPC clusters use InfiniBand
HCA (Infiniband Network Interface Cards): allows hardware offload of RDMA operations
Managed by Open Subnet Manager (SM).

3. RDMA over Converged Ethernet (`RoCE`)

RDMA + Ethernet: Enables RDMA over Ethernet

Open source alternative to InfiniBand
More flexible than infiniBand
Cheaper Enterprise-friendly
Used in enterprise AI clusters

NVIDIA hardware:

Spectrum switches (Ethernet) + BlueField DPUs support RoCE for high-performance Ethernet-based AI clusters
Nvidia Quantum-X 800 Infiniband switch for high-performance InfiniBand-based AI clusters

GPU Interconnects (Compute Fabric)

1. PCIe

Standard connection
Higher latency
Limited bandwidth (16–32 GB/s)
Not ideal for multi-GPU scaling

2. NVLink Chip-to-chip interconnect

GPU to GPU inside same node → NVLink

High-speed GPU-to-GPU communication inside server
Up to 600 GB/s
Faster than PCIe
Enables multi-GPU scaling
Uses NVSwitch for scale

3. NVSwitch Fabric

Connects multiple GPUs in large systems

Enables full bandwidth communication across large GPU arrays
Used in DGX SuperPOD to connect 8× H100 GPUs
Provides non-blocking, high-speed interconnect for large multi-GPU systems

3. GPUDirect RDMA

GPU to GPU across nodes → GPUDirect RDMA

GPU memory ↔ remote GPU
No CPU involvement

4. GPUDirect Storage

Storage ↔ GPU memory
Avoids system memory bottleneck
No CPU involvement

NVIDIA Network Operator

Simplifies the provisioning and management of NVIDIA networking resources in a Kubernetes cluster.

AI Infrastructure

AI infrastructure fundamentals covering GPU hardware selection, cluster scaling, power and cooling design, networking, high-speed interconnects, and DPU integration for modern data centers.

Written by Hitesh Sahu, a passionate developer and blogger.

Networking in an AI-Centric Data Center

Latency

Throughput

Network Separation

1. Compute Network

2. In-Band Management - Network

3. Out-of-Band Management Network

4. Storage Network

DMA (Direct Memory Access)

RDMA (Remote Direct Memory Access)

Traditional Networking

RDMA

InfiniBand vs Ethernet

1. Ethernet

2. InfiniBand

3. RDMA over Converged Ethernet (RoCE)

GPU Interconnects (Compute Fabric)

1. PCIe

2. NVLink Chip-to-chip interconnect

3. NVSwitch Fabric

3. GPUDirect RDMA

4. GPUDirect Storage

Fetching content, this won’t take long…

🤯 Your stomach gets a new lining every 3–4 days.

AI Infrastructure

AI infrastructure fundamentals covering GPU hardware selection, cluster scaling, power and cooling design, networking, high-speed interconnects, and DPU integration for modern data centers.

Written by Hitesh Sahu, a passionate developer and blogger.

Networking in an AI-Centric Data Center

Latency

Throughput

Network Separation

1. Compute Network

2. In-Band Management - Network

3. Out-of-Band Management Network

4. Storage Network

DMA (Direct Memory Access)

RDMA (Remote Direct Memory Access)

Traditional Networking

RDMA

InfiniBand vs Ethernet

1. Ethernet

2. InfiniBand

3. RDMA over Converged Ethernet (RoCE)

GPU Interconnects (Compute Fabric)

1. PCIe

2. NVLink Chip-to-chip interconnect

3. NVSwitch Fabric

3. GPUDirect RDMA

4. GPUDirect Storage

3. RDMA over Converged Ethernet (`RoCE`)

3. RDMA over Converged Ethernet (`RoCE`)