Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. โ€บ
  3. posts
  4. โ€บ
  5. โ€ฆ

  6. โ€บ
  7. 2 1 Rapids

Loading โณ
Fetching content, this wonโ€™t take longโ€ฆ


๐Ÿ’ก Did you know?

๐Ÿฆฅ Sloths can hold their breath longer than dolphins ๐Ÿฌ.

๐Ÿช This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Cover Image for RAPIDS and GPU Accelerated Data Science: cuDF, cuML, CUDA, NCCL and Distributed AI Pipelines

RAPIDS and GPU Accelerated Data Science: cuDF, cuML, CUDA, NCCL and Distributed AI Pipelines

Comprehensive overview of the RAPIDS ecosystem covering GPU accelerated DataFrames, machine learning, graph analytics, CUDA execution, distributed computing with Dask and NCCL, TensorRT integration, and large-scale AI data processing pipelines on NVIDIA GPUs.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Tue May 19 2026

Share This on

Next โ†’

TensorRT and High-Performance AI Inference: CUDA, ONNX, TensorRT-LLM and GPU Optimization

NVIDIA Rapid

How RAPIDS Works Under the Hood ๐–ฃ˜

Data stays on the GPU

One of RAPIDSโ€™ biggest advantages is minimizing data movement.

Traditional workflows often look like this:

flowchart TD

    A["Disk ๐Ÿ›ข"]
        --> B["CPU RAM ๐Ÿ“"]
        --> C["GPU ๐Ÿงฎ "]
        --> D["CPU ๐Ÿงพ"]
        --> E["GPU ๐Ÿงฎ"]

RAPIDS pipelines are closer to:

flowchart TD

    A["Disk ๐Ÿ›ข"]
        --> B["GPU Memory ๐Ÿ“ผ"]
        --> C["GPU Processing ๐Ÿงฎ"]
        --> D["GPU Training ๐–ฃ˜"]

This avoids PCIe transfer overhead, which is often slower than GPU computation itself.

Workflow comparison

Traditional CPU Workflow RAPIDS GPU Workflow
Few CPU cores Thousands of CUDA cores
Frequent memory transfers Data remains on GPU
Sequential execution Massive parallelism
Slower for large datasets Optimized for big data + AI

Example pipeline:


import cudf
from cuml.linear_model import LinearRegression

# Load data into GPU memory
gdf = cudf.read_parquet("train.parquet")

X = gdf[["feature1", "feature2"]]
y = gdf["target"]

# Train directly on GPU data
model = LinearRegression()
model.fit(X, y)

Typical RAPIDS Use Cases

  • Large-scale ETL pipelines
  • Feature engineering
  • Recommendation systems
  • Fraud detection
  • Real-time analytics
  • Graph analytics
  • GPU-accelerated ML training
  • LLM preprocessing pipelines

RAPIDS Architecture Overview

flowchart TD

    A["Python API<br/>cuDF / cuML"]
        --> B["CUDA Kernels ๐Ÿ“Ÿ<br/>Parallel Compute"]

    B --> C["GPU Memory (VRAM) ๐Ÿ“ผ"]

    C --> D["NVIDIA GPU ๐Ÿงฎ"]

Data is loaded directly into GPU memory

RAPIDS uses GPU-accelerated I/O through libraries like cuDF and CUDA-based readers to load formats such as CSV, Parquet, ORC, and JSON directly into GPU memory.

This minimizes expensive CPU โ†” GPU memory copies and reduces ingestion bottlenecks.

import cudf

# Load CSV directly into GPU memory
gdf = cudf.read_csv("large_dataset.csv")

K-Mean Example

With CPU


# CPU (Pandas + NumPy + scikit-learn)

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

# Create DataFrame
df = pd.DataFrame({
    "x": np.random.rand(1000),
    "y": np.random.rand(1000)
})

# Train ML model
model = KMeans(n_clusters=3, random_state=42)
model.fit(df)

print(model.labels_[:10])

With GPU and CUDA

# GPU (RAPIDS: cuDF + CuPy + cuML)

import cudf
import cupy as cp
from cuml.cluster import KMeans

# Create GPU DataFrame
gdf = cudf.DataFrame({
    "x": cp.random.rand(1000),
    "y": cp.random.rand(1000)
})

# Train GPU-accelerated ML model
model = KMeans(n_clusters=3, random_state=42)
model.fit(gdf)

print(model.labels_[:10])

Graph Analytics Example

With CPU


# CPU: NetworkX
import networkx as nx

G = nx.karate_club_graph()
pagerank_scores = nx.pagerank(G)

print(list(pagerank_scores.items())[:5])

With GPU


# GPU: cuGraph

import cugraph

# Load graph into GPU
G = cugraph.karate.get_graph()

# Run PageRank on GPU
pagerank_df = cugraph.pagerank(G)

print(pagerank_df.head())

GPU-native operations

Libraries such as cuDF, cuML, and cuGraph execute operations directly on the GPU.

Instead of using a few CPU cores, RAPIDS distributes work across thousands of CUDA cores simultaneously.

GPU equivalent Lib

Category Common Python (CPU) RAPIDS (GPU)
DataFrames Pandas cuDF
Arrays NumPy cuPy
Data Ingestion Pandas / PyArrow cuIO
Machine Learning scikit-learn cuML
Graph Analytics NetworkX cuGraph

Typical accelerated operations include:

  • Filtering
  • GroupBy aggregations
  • Sorting
  • Joins
  • Machine learning training
  • Graph traversal algorithms

# GPU DataFrame filtering
filtered = gdf[gdf["sales"] > 1000]

# GPU aggregation
summary = gdf.groupby("region").sales.mean()

Parallel execution with CUDA

RAPIDS is built on NVIDIA CUDA.

CUDA enables GPUs to launch thousands of lightweight threads in parallel.

For example:

  • CPUs โ†’ optimized for sequential tasks
  • GPUs โ†’ optimized for massively parallel workloads

A GPU can process millions of rows concurrently.


import cupy as cp

# Array lives on GPU
arr = cp.random.rand(10_000_000)

# Parallel GPU computation
result = cp.sqrt(arr)

Optional scaling

Multi-GPU with Dask + RAPIDS

When a dataset exceeds a single GPUโ€™s memory, RAPIDS can distribute workloads across multiple GPUs using Dask.

from dask_cuda import LocalCUDACluster
from dask.distributed import Client

cluster = LocalCUDACluster()
client = Client(cluster)

Benefits:

  • Parallel processing across GPUs
  • Larger-than-memory datasets
  • Distributed ML training

Multi-node scaling with NCCL + Dask

For clusters spanning multiple machines:

  • Dask handles task scheduling
  • NCCL handles fast GPU-to-GPU communication

NCCL is optimized for:

  • GPU collectives
  • All-reduce operations
  • High-speed NVLink / InfiniBand transfers

Architecture example:

Node 1 GPU 0  โ†โ†’  Node 2 GPU 1
        โ†‘              โ†‘
      NCCL communication
flowchart TD

    A["Node 1 ๐Ÿงพ<br/>GPU 0 ๐Ÿงฎ"]
    B["Node 2 ๐Ÿงพ<br/>GPU 1 ๐Ÿงฎ"]
    A <--> B

    C["NCCL Communication"]
    C -.-> A
    C -.-> B


AI-Infrastructure/2-1-Rapids
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich ๐Ÿฅจ, Germany ๐Ÿ‡ฉ๐Ÿ‡ช, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
ย  Home/About
ย  Skills
ย  Work/Projects
ย  Lab/Experiments
ย  Contribution
ย  Awards
ย  Art/Sketches
ย  Thoughts
ย  Contact
Links
ย  Sitemap
ย  Legal Notice
ย  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| ยฉ 2026 All rights reserved.