Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. โ€บ
  3. posts
  4. โ€บ
  5. โ€ฆ

  6. โ€บ
  7. 2 3 NCCL

Loading โณ
Fetching content, this wonโ€™t take longโ€ฆ


๐Ÿ’ก Did you know?

๐Ÿคฏ Your stomach gets a new lining every 3โ€“4 days.

๐Ÿช This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Cover Image for NCCL and Distributed GPU Communication: CUDA, AllReduce, Multi-GPU and AI Cluster Networking

NCCL and Distributed GPU Communication: CUDA, AllReduce, Multi-GPU and AI Cluster Networking

Comprehensive overview of NVIDIA NCCL covering GPU-to-GPU communication, AllReduce operations, distributed AI training, CUDA integration, tensor synchronization, multi-node scaling, InfiniBand networking, and high performance communication for large-scale AI and HPC workloads.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Tue May 19 2026

Share This on

โ† Previous

TensorRT and High-Performance AI Inference: CUDA, ONNX, TensorRT-LLM and GPU Optimization

Next โ†’

ONNX (Open Neural Network Exchange): Portable AI Models, TensorRT and Cross-Framework Inference

NCCL (NVIDIA Collective Communications Library) ๐Ÿ”—

NCCL is NVIDIAโ€™s high-performance communication library designed for fast GPU-to-GPU communication.

NCCL Communication Architecture

flowchart TD

    subgraph NCCL_Ring["NCCL Ring Communication"]

        A["GPU 0 ๐Ÿงฎ"]
        B["GPU 1 ๐Ÿงฎ"]
        C["GPU 2 ๐Ÿงฎ"]
        D["GPU 3 ๐Ÿงฎ"]

        A <--> B
        B <--> C
        C <--> D
        D <--> A
    end

    E["NCCL<br/>Collective Communication Layer ๐Ÿ”—"]

    E -. Manages AllReduce / Broadcast / AllGather .-> NCCL_Ring

NCCL often uses:

  • Ring topology
  • Tree topology
  • Hybrid communication patterns

depending on hardware and cluster size.

Why NCCL Exists

Training large AI models requires GPUs to exchange data continuously.

Example:

  • gradients
  • embeddings
  • activations
  • model parameters

Without optimized communication:

GPU compute becomes idle waiting for data transfer.

NCCL minimizes this bottleneck.

Provides the following collective communication primitives :

  • Reduce
  • Gather
  • Scatter
  • ReduceScatter
  • AllReduce
  • AllGather
  • AlltoAll
  • Broadcast

Why NCCL Matters

Modern AI training is often limited by:

Communication speed between GPUs

not raw compute.

NCCL helps scale:

  • from 1 GPU
  • to thousands of GPUs efficiently.

Without NCCL:

  • distributed training becomes slow
  • GPU utilization drops
  • scaling efficiency collapses

NCCL optimizes communication between GPUs across:

  • Single machine
  • Multiple machines/Nodes
  • NVLink
  • PCIe
  • InfiniBand
  • Ethernet clusters

Use Cases

It is heavily used in:

  • Distributed AI training
  • Multi-GPU inference
  • RAPIDS
  • PyTorch Distributed
  • TensorFlow Distributed
  • DeepSpeed
  • Megatron-LM
  • TensorRT-LLM

Common NCCL Operations

Operation Purpose
AllReduce Aggregate gradients across GPUs
Broadcast Send data from one GPU to all GPUs
Reduce Combine data to one GPU
AllGather Collect tensors from all GPUs
ReduceScatter Reduce + distribute chunks

Most Important Operation: AllReduce

During distributed training:

  1. Each GPU computes gradients locally
  2. NCCL synchronizes gradients
  3. All GPUs receive the same updated values

AllReduce Example

flowchart LR

    A["GPU 0 ๐Ÿงฎ <br/>Gradients"]
    B["GPU 1 ๐Ÿงฎ <br/>Gradients"]
    C["GPU 2 ๐Ÿงฎ <br/>Gradients"]

    D["NCCL AllReduce ๐Ÿ”—"]

    E["Shared Averaged<br/>Gradients"]

    A --> D
    B --> D
    C --> D

    D --> E

Hardware NCCL Can Use

Interconnect Speed
PCIe Standard
NVLink Very fast
NVSwitch Extremely fast
InfiniBand Multi-node high-speed networking
Ethernet Slower fallback

NCCL + CUDA

NCCL is deeply integrated with CUDA.

Communication operations execute directly on GPUs using CUDA streams.

Benefits:

  • asynchronous execution
  • overlap communication with computation
  • reduced CPU overhead

NCCL in Distributed Training

Typical stack:

flowchart TD

    A["PyTorch Distributed ๐Ÿ"]
        --> B["NCCL Backend ๐Ÿ”—"]

    B --> C["CUDA Runtime ๐Ÿ“Ÿ"]

    C --> D["NVIDIA GPUs ๐Ÿงฎ"]

Example:


import torch.distributed as dist

dist.init_process_group(backend="nccl")

NCCL + Multi-Node Training

Example cluster:

flowchart LR

    subgraph N1["Node 1 ๐Ÿงพ"]
        A["GPU 0 ๐Ÿงฎ"]
        B["GPU 1 ๐Ÿงฎ"]

        A <--> B
    end

    subgraph N2["Node 2 ๐Ÿงพ"]
        C["GPU 0 ๐Ÿงฎ"]
        D["GPU 1 ๐Ÿงฎ"]

        C <--> D
    end

    B <--> |"NCCL / InfiniBand ๐Ÿ”—"| C

Cross-node communication often uses:

  • InfiniBand
  • RoCE
  • High-speed Ethernet

NCCL vs MPI

MPI is a standardized communication protocol used for distributed computing across multiple processes, CPUs, and machines.

  • MPI is broader.
  • NCCL is specialized for GPU deep learning workloads.

It is one of the core technologies behind: ercomputers

  • High Performance Computing (HPC)
  • Scientific simulations
  • Distributed CPU clusters

MPI allows processes running on different machines to:

  • send messages
  • exchange data
  • synchronize computation
  • coordinate parallel workloads

Difference between MPI & NCCL

Feature NCCL MPI
GPU-aware YES Partial / implementation dependent
Optimized for AI YES NO
CPU distributed workloads Limited Excellent
HPC general-purpose NO YES
Deep learning focus Strong Moderate
GPU communication optimization Excellent Limited
Ease of integration with PyTorch/TensorFlow Native Requires wrappers
Multi-node training support Excellent Excellent
Typical use case Deep learning Scientific computing / HPC
flowchart LR

    A["Deep Learning Framework<br/>PyTorch / TensorFlow"]
        --> B["NCCL"]

    C["HPC Applications"]
        --> D["MPI"]

    B --> E["NVIDIA GPUs"]

    D --> F["CPU Clusters / Supercomputers"]

NCCL Performance Optimizations

NCCL improves performance using:

  • topology-aware routing
  • peer-to-peer GPU transfers
  • kernel fusion
  • pipelined communication
  • overlapping compute + communication
  • zero-copy transfers where possible

NCCL in Modern AI Systems

NCCL powers:

  • LLM training
  • Distributed inference
  • Tensor parallelism
  • Pipeline parallelism
  • Data parallelism
  • Mixture-of-Experts routing

Frameworks using NCCL:

  • PyTorch DDP
  • DeepSpeed
  • Megatron-LM
  • Ray Train
  • RAPIDS
  • TensorRT-LLM

AI-Infrastructure/2-3-NCCL
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich ๐Ÿฅจ, Germany ๐Ÿ‡ฉ๐Ÿ‡ช, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
ย  Home/About
ย  Skills
ย  Work/Projects
ย  Lab/Experiments
ย  Contribution
ย  Awards
ย  Art/Sketches
ย  Thoughts
ย  Contact
Links
ย  Sitemap
ย  Legal Notice
ย  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| ยฉ 2026 All rights reserved.