Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. work
  4. ›
  5. …

  6. ›
  7. 5 gpu fabric bench

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

AI-Machine-Learning

    AI & Machine Learning
    • AI Infrastructure & LLM Platform

    • Model Gym

    • RAG Factory

    • NVIDIA Super POD

    • GPU Fabric Bench

    • Prompt Bridge


    Cloud & DevOps

    Full-Stack Applications

    Mobile Development

Cover Image for GPU Fabric Bench
AI & Machine Learning

GPU Fabric Bench

Personal / Open Source

Ongoing

Creator / Maintainer

AI Infrastructure & LLM

Tech Stack
NCCL
EFA
RDMA
MPI
Terraform
Ansible

Summary

RDMA/EFA fabric benchmarking for multi-node GPU training, measuring NCCL collective communication throughput at near-peak EFA bandwidth.


What I Built

Project Overview

GPU Fabric Bench is an open-source benchmarking platform designed to evaluate the performance of high-speed networking fabrics used in distributed AI training environments.

Modern large-scale model training depends heavily on efficient GPU-to-GPU communication across multiple nodes. The project focuses on measuring and analyzing the performance characteristics of NCCL collective operations running over AWS Elastic Fabric Adapter (EFA) and RDMA-based networking infrastructures.

The goal is to provide reproducible benchmarks that help engineers understand communication bottlenecks, validate cluster configurations, and optimize infrastructure for large-scale distributed training workloads.

By combining infrastructure automation, GPU benchmarking, and network performance analysis, the project enables direct evaluation of the systems that underpin modern AI supercomputing environments.


Key Features

Distributed GPU Benchmarking

Benchmarks multi-node GPU communication performance across distributed training clusters.

NCCL Collective Evaluation

Measures the performance of critical collective operations including:

  • AllReduce
  • AllGather
  • ReduceScatter
  • Broadcast

RDMA & EFA Validation

Evaluates AWS Elastic Fabric Adapter (EFA) networking and RDMA transport performance under realistic AI training workloads.

Infrastructure Automation

Automates provisioning and configuration of benchmark environments using Infrastructure as Code.

Performance Analysis

Generates metrics and reports for:

  • Collective latency
  • Bus bandwidth
  • Network utilization
  • Communication efficiency
  • Scaling behavior

My Contributions

  • Designed the benchmarking architecture and test methodology.
  • Provisioned distributed GPU clusters using Terraform.
  • Automated cluster configuration and benchmarking workflows using Ansible.
  • Configured NCCL and EFA environments for high-performance GPU communication.
  • Executed large-scale AllReduce benchmark sweeps across message sizes ranging from 1 KB to 4 GB.
  • Analyzed communication efficiency and bandwidth utilization across distributed nodes.
  • Built reproducible benchmarking pipelines for validating AI infrastructure deployments.
  • Documented performance characteristics and optimization techniques for distributed training environments.

Technical Highlights

High-Performance AI Networking

Evaluated communication fabrics used by modern distributed training systems where network throughput directly impacts model training performance.

Near-Line-Rate Performance

Achieved approximately 55 GB/s bus bandwidth across multi-node GPU clusters, representing roughly 90% utilization of the available 400 Gbps EFA networking capacity.

Multi-Node GPU Infrastructure

Executed benchmarks across 2 × p4d.24xlarge instances comprising 16 NVIDIA A100 GPUs connected through AWS EFA networking.

Distributed Systems Engineering

Analyzed communication bottlenecks and scaling characteristics that affect large-scale model training workloads.

Reproducible Infrastructure

Automated cluster provisioning and benchmarking workflows, enabling consistent performance validation across environments.


Challenges & Solutions

Challenge

Large language model training is often limited by communication overhead rather than raw GPU compute. Small infrastructure misconfigurations can significantly reduce cluster efficiency and increase training costs.

Solution

Built an automated benchmarking framework capable of validating networking performance, measuring collective communication efficiency, and identifying infrastructure bottlenecks before production workloads are deployed.

Outcome

Delivered a reproducible benchmarking platform that enables engineers to validate distributed GPU infrastructure, optimize communication performance, and maximize utilization of expensive AI compute resources.


Technology Stack

Distributed Training NCCL, MPI

Networking AWS EFA, RDMA

Infrastructure Terraform, Ansible

Compute AWS p4d.24xlarge, NVIDIA A100 GPUs

Benchmarking NCCL Tests, Collective Communication Analysis

Cloud AWS HPC Infrastructure

Domain Distributed AI Training, GPU Networking, HPC, AI Infrastructure, Performance Engineering

← Previous

NVIDIA Super POD

Next →

Prompt Bridge

Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.