Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. â€ē
  3. posts
  4. â€ē
  5. â€Ļ

  6. â€ē
  7. Agent Deployment

Loading âŗ
Fetching content, this won’t take longâ€Ļ


💡 Did you know?

đŸ¤¯ Your stomach gets a new lining every 3–4 days.

đŸĒ This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

AI-AgenticAI

  • AI-AgenticAI Index

  • NVIDIA Agentic AI Professional Certification Path

  • Building Production-Ready Agentic AI Systems

  • Understanding Agentic AI Workflows

  • Understanding Agentic AI Memory

  • Evaluating Agentic AI Systems

  • Error Analysis in Agentic AI

  • Error Analysis for Agentic AI

  • Tool Use in Agentic AI

  • Code Execution in Agentic AI

  • Understanding the Model Context Protocol (MCP)

  • Optimizing Agentic AI Systems

  • Multi-Agent Systems in Agentic AI

  • Understanding Model Fusion in AI Systems

  • Deploying Agents at Scale

  • Deploying Agentic AI to Production

Cover Image for Deploying Agents at Scale

Deploying Agents at Scale

Learn how to deploy AI agents reliably in production using containerization, orchestration, observability, evaluation pipelines, guardrails, retries, scaling strategies, and resilient architectures. Explore best practices for running agentic systems across cloud environments while maintaining performance, reliability, security, and cost efficiency.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Sun Jun 07 2026

Share This on

Next →

Deploying Agentic AI to Production

Deploying Agents at Scale

NVIDIA Inference Stack ⚡

flowchart TD

    Training["Training <br/> PyTorch / NeMo Model"]--> Model["ONNX đŸ“Ļ"]

    User --> NIM["NIM <br/> Production APIs"]

    NIM --> Triton["Triton Server đŸŗ <br/>Inference"]


    Triton--> TensorRT-LLM["TensorRT-LLM 🖲 <br/> Runtime"]

    Model-->TensorRT-LLM

    TensorRT-LLM--> GPU["GPU Rack 🧮"]

Components

Component Purpose
CUDA GPU execution platform
TensorRT-LLM LLM optimization
Triton Model serving
NIM Packaged inference microservice
Kubernetes Deployment & scaling

Build using Docker đŸŗ

Packages the agent, its model config, tool dependencies, and runtime into a single reproducible image.

Pin all versions model weights, libraries, Python for deterministic builds.

Pipeline

flowchart TD
    
    Commit["code commit 📤"]
    Eval["Automated Eval 🔎"]
    Build["Container Build đŸ“Ļ"]
    Deploy["Shadow Deployment 🚛 "]
    rollback["Promote/Rollback đŸŗ"]

    Commit-->Eval-->Build-->Deploy-->rollback

Automated Eval 🔎

Running a benchmark suite against a golden dataset to catch regression in agent behaviour before it reaches users.

Shadow Deployment 🚛

Benchmarking new build with real user traffic against existing prod deployment without affecting any user

Goal

  • Observe performance with real world traffic
  • Validate if model is performing well
  • Benchmark with current Live Model

More of this in next Post

Validate with real traffic.

flowchart LR

    User
    --> Production

    User -. Mirror .-> Shadow

    Production --> Response

    Shadow -. Discard .-> Trash

Promote/Rollback đŸŗ

Final Rollout to production using BlueGreen or Canary Deployment

Expose to a real users gradually with Canary.


Kubernetes â˜¸ī¸

Orchestrates multiple containers at scale.

Handles scheduling, health checks, rolling updates, and auto-scaling.

Each agent type runs as a Deployment with its own replica count and resource limits.

production-grade pattern used by many AI agent platforms.

  1. Deploy stateless agent workers as a Kubernetes Deployment.
  2. Use Redis or Kafka as a task queue.
  3. Expose queue depth as an external metric.
  4. Configure an HPA or KEDA ScaledObject.
  5. Scale replicas based on queue depth.
  6. Store session state and memory externally.

flowchart LR

    Producers --> Queue

    Queue --> Agent1
    Queue --> Agent2
    Queue --> Agent3

    Queue -. task_queue_depth .-> HPA

    HPA -. Scale Up/Down .-> Deployment

    Deployment --> Agent1
    Deployment --> Agent2
    Deployment --> Agent3

Horizontal Pod Autoscaler (HPA) â†”ī¸

Scales replica count up/down based on CPU, memory, or custom metrics (e.g. queue depth).

Handles traffic spikes without manual intervention.

Example application/deployment.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler # Horizontally scale pods
metadata:
  name: agent-worker-hpa  # identifier for deployment
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: agent-worker
  minReplicas: 2 # HPA will never scale below 2 replicas
  maxReplicas: 20 # Never scale above 20 pods
  metrics:
  - type: External
    external:
      metric:
        name: task_queue_depth   # custom metric from Redis
      target:
        type: AverageValue
        averageValue: "10"       # scale when >10 tasks/replica

Deploying agent-worker-hpa

Deployment

kubectl apply -f https://k8s.io/examples/application/deployment.yaml

Validate

# Verify deployment
kubectl get deployment agent-worker

# Verify HPA
kubectl get hpa

# Debug HPA
kubectl describe hpa agent-worker-hpa


Hierarchical orchestration

One orchestrator agent fans out sub-tasks to specialist workers.

The orchestrator holds the plan; workers are stateless executors.

At scale, the orchestrator itself can be replicated with task queues (Redis, Kafka) providing coordination.

Horizontal agent scaling

Deploy multiple identical worker agent replicas behind a load balancer.

Each replica handles independent tasks

Stateless design is key so any replica can pick up any task.

flowchart TD

    Producers["Producers / API Gateway 🔀 "]
    LoadBalancer["Load Balancer đŸšĻ "]

    AgentA["Agent Replica A 🤖 <br/>Stateless"]
    AgentB["Agent Replica B🤖 <br/>Stateless"]
    AgentC["Agent Replica C 🤖<br/>Stateless"]

    Queue["Task Queue đŸ“Ĩ <br/>Redis / Kafka"]

    Session["Session State â„šī¸ <br/>Redis / DynamoDB"]
    LTM["Long-Term Memory đŸ›ĸ <br/>Vector DB / RAG"]
    ToolCache[Tool Cache<br/>Redis / Memcached]

    Producers --> LoadBalancer

    LoadBalancer --> |monitor| AgentA
    LoadBalancer --> |monitor| AgentB
    LoadBalancer --> |monitor| AgentC

    Producers --> Queue

    Queue --> |poll| AgentA
    Queue --> |poll| AgentB
    Queue --> |poll| AgentC

    AgentA <--> Session
    AgentB <--> Session
    AgentC <--> Session

    AgentA <--> LTM
    AgentB <--> LTM
    AgentC <--> LTM

    AgentA <--> ToolCache
    AgentB <--> ToolCache
    AgentC <--> ToolCache

KEDA (Kubernetes Event-Driven Autoscaling)

Extends Kubernetes autoscaling beyond CPU and memory.

Common triggers:

  • Redis Queue Depth
  • Kafka Lag
  • RabbitMQ Queue Length
  • AWS SQS Messages

KEDA automatically creates and manages an HPA.

flowchart LR

    Queue -. Queue Depth .-> KEDA

    KEDA --> HPA

    HPA --> Deployment

Task queue decoupling đŸ“Ĩ

Decouple task submission from execution via a queue.

Producers push tasks; agent workers pull and process.

Enables backpressure, retry, and independent scaling of producers vs consumers.

1. Routes by queue depth

Reads the actual task queue depth from each replica (or from Redis) and routes to the one with the shortest queue.

The most accurate signal for agent workloads — accounts for queued but not-yet-started tasks.

2. Round Robin

Cycles through replicas in order regardless of their current load.

Simple and fair for uniform workloads, but can create hotspots when some agent tasks take much longer than others.

3. Least Connection

Always picks the replica with the fewest active tasks.

Ideal for agents because task duration varies wildly

a 20-step reasoning chain holds a connection far longer than a 2-step lookup.

4. Weighted

Replicas are assigned weights reflecting their capacity.

  • A GPU-backed replica might get weight 3, a CPU replica weight 1 — meaning it receives 3× the traffic.

Good when replicas have different specs.

5. Random

Picks a replica at random.

Statistically converges to even distribution at scale but can cluster by chance on small request counts.

Low overhead — no state to track.

ConfigMap / Secret 🔑

Externalise model endpoint URLs, API keys, and policy configs from the image enables config changes without a rebuild.

GPU node pool

Schedule inference pods on GPU nodes using nodeSelector or taints/tolerations. NVIDIA device plugin exposes GPUs as schedulable resources.


Next →

Deploying Agentic AI to Production

AI-AgenticAI/Agent-Deployment
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich đŸĨ¨, Germany 🇩đŸ‡Ē, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| Š 2026 All rights reserved.