AI Programming Model
Overview of NVIDIA's AI programming model, including core libraries (CUDA, NCCL, cuDNN), training vs inference workloads, and compute scaling models (data parallelism and model parallelism) for AI infrastructure.
Core Libraries & Frameworks
1. CUDA (Compute Unified Device Architecture)
Parallel computing platform enabling GPU programming.
- Thousands of parallel threads
- Native C/C++/Python integration
- General-purpose GPU computing
CUDA parallel model:
- Break problem into small identical tasks
- Launch thousands of threads (workers) to do them simultaneously, Collect results when everyone finishes
2. NCCL (NVIDIA Collective Communications Library)
NCCL implements both collective communication and point-to-point send/receive primitives.
- pronounced “Nickel”
- Used by PyTorch & TensorFlow
- It is not a full-blown parallel programming framework; rather, it is a library focused on accelerating inter-GPU communication.
Provides the following collective communication primitives :
- Reduce
- Gather
- Scatter
- ReduceScatter
- AllReduce
- AllGather
- AlltoAll
- Broadcast
3. cuDNN (CUDA Deep Neural Network library)
GPU-accelerated library for deep learning primitives.
Provides highly tuned implementations for standard routines such as:
- forward and backward convolution
- attention
- matmul
- pooling
- normalization.
Training vs Inference
AI Workflow:
Data Preperation
|--> Model Training
|--> Optimization
|--> Inference/Deployment
Model Training
compute intensive
- Forward + backward pass
- Multi-GPU scaling
- High memory + compute demand
- Uses NCCL, NVLink, RDMA
Model Inference
latency optimized
- Forward pass only
- Lower latency focus
- Often containerized (Kubernetes)
| Training | Inference |
|---|---|
| Model learning | Model usage |
| High compute + memory | Lower latency focus |
| Batch workloads | Real-time workloads |
| Multi-GPU scaling | Edge + cloud deployment |
Compute Scaling Models
1. Data Parallelism
- Same model on multiple GPUs
- Split dataset across GPUs
2. Model Parallelism
- Model split across GPUs
- Used for very large models
