Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 5 Programming

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🤯 Your stomach gets a new lining every 3–4 days.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Cover Image for AI Programming Model

AI Programming Model

Overview of NVIDIA's AI programming model, including core libraries (CUDA, NCCL, cuDNN), training vs inference workloads, and compute scaling models (data parallelism and model parallelism) for AI infrastructure.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Thu Feb 19 2026

Share This on

Core Libraries & Frameworks

1. CUDA (Compute Unified Device Architecture)

Parallel computing platform enabling GPU programming.

  • Thousands of parallel threads
  • Native C/C++/Python integration
  • General-purpose GPU computing

CUDA parallel model:

  • Break problem into small identical tasks
  • Launch thousands of threads (workers) to do them simultaneously, Collect results when everyone finishes

2. NCCL (NVIDIA Collective Communications Library)

NCCL implements both collective communication and point-to-point send/receive primitives.

  • pronounced “Nickel”
  • Used by PyTorch & TensorFlow
  • It is not a full-blown parallel programming framework; rather, it is a library focused on accelerating inter-GPU communication.

Provides the following collective communication primitives :

  • Reduce
  • Gather
  • Scatter
  • ReduceScatter
  • AllReduce
  • AllGather
  • AlltoAll
  • Broadcast

3. cuDNN (CUDA Deep Neural Network library)

GPU-accelerated library for deep learning primitives.

Provides highly tuned implementations for standard routines such as:

  • forward and backward convolution
    • attention
    • matmul
    • pooling
    • normalization.

Training vs Inference

AI Workflow:

 Data Preperation 
  |--> Model Training 
     |--> Optimization 
         |--> Inference/Deployment

Model Training

compute intensive

  • Forward + backward pass
  • Multi-GPU scaling
  • High memory + compute demand
  • Uses NCCL, NVLink, RDMA

Model Inference

latency optimized

  • Forward pass only
  • Lower latency focus
  • Often containerized (Kubernetes)
TrainingInference
Model learningModel usage
High compute + memoryLower latency focus
Batch workloadsReal-time workloads
Multi-GPU scalingEdge + cloud deployment

Compute Scaling Models

1. Data Parallelism

  • Same model on multiple GPUs
  • Split dataset across GPUs

2. Model Parallelism

  • Model split across GPUs
  • Used for very large models

AI-Infra/5-Programming
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.