Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🦈 Sharks existed before trees 🌳.

Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🐙 Octopuses have three hearts and blue blood.

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

AI-AgenticAI

Deploying Agents at Scale

Learn how to deploy AI agents reliably in production using containerization, orchestration, observability, evaluation pipelines, guardrails, retries, scaling strategies, and resilient architectures. Explore best practices for running agentic systems across cloud environments while maintaining performance, reliability, security, and cost efficiency.

Artificial Intelligence

Agentic AI

AI Agents

Deployment

MLOps

Kubernetes

← Previous

Stanford AI Scientist Roadmap 2026

Deploying Agentic AI to Production

Deploying Agents at Scale

NVIDIA Inference Stack ⚡

flowchart TD

    Training["Training <br/> PyTorch / NeMo Model"]--> Model["ONNX 📦"]

    User --> NIM["NIM <br/> Production APIs"]

    NIM --> Triton["Triton Server 🐳 <br/>Inference"]


    Triton--> TensorRT-LLM["TensorRT-LLM 🖲 <br/> Runtime"]

    Model-->TensorRT-LLM

    TensorRT-LLM--> GPU["GPU Rack 🧮"]

Components

Component	Purpose
`CUDA`	GPU execution platform
`TensorRT-LLM`	LLM optimization
`Triton`	Model serving
`NIM`	Packaged inference microservice
`Kubernetes`	Deployment & scaling

Build using Docker 🐳

Packages the agent, its model config, tool dependencies, and runtime into a single reproducible image.

Pin all versions model weights, libraries, Python for deterministic builds.

Pipeline

flowchart TD
    
    Commit["code commit 📤"]
    Eval["Automated Eval 🔎"]
    Build["Container Build 📦"]
    Deploy["Shadow Deployment 🚛 "]
    rollback["Promote/Rollback 🐳"]

    Commit-->Eval-->Build-->Deploy-->rollback

Automated Eval 🔎

Running a benchmark suite against a golden dataset to catch regression in agent behaviour before it reaches users.

Shadow Deployment 🚛

Benchmarking new build with real user traffic against existing prod deployment without affecting any user

Goal

Observe performance with real world traffic
Validate if model is performing well
Benchmark with current Live Model

Promote/Rollback 🐳

Final Rollout to production using BlueGreen or Canary Deployment

Expose to a real users gradually with Canary.

Kubernetes ☸️

Orchestrates multiple containers at scale.

Handles scheduling, health checks, rolling updates, and auto-scaling.

Each agent type runs as a Deployment with its own replica count and resource limits.

production-grade pattern used by many AI agent platforms.

Deploy stateless agent workers as a Kubernetes Deployment.
Use Redis or Kafka as a task queue.
Expose queue depth as an external metric.
Configure an HPA or KEDA ScaledObject.
Scale replicas based on queue depth.
Store session state and memory externally.


flowchart LR

    Producers --> Queue

    Queue --> Agent1
    Queue --> Agent2
    Queue --> Agent3

    Queue -. task_queue_depth .-> HPA

    HPA -. Scale Up/Down .-> Deployment

    Deployment --> Agent1
    Deployment --> Agent2
    Deployment --> Agent3

Horizontal Pod Autoscaler (`HPA`) ↔️

Scales replica count up/down based on CPU, memory, or custom metrics (e.g. queue depth).

Handles traffic spikes without manual intervention.

Example application/deployment.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler # Horizontally scale pods
metadata:
  name: agent-worker-hpa  # identifier for deployment
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: agent-worker
  minReplicas: 2 # HPA will never scale below 2 replicas
  maxReplicas: 20 # Never scale above 20 pods
  metrics:
  - type: External
    external:
      metric:
        name: task_queue_depth   # custom metric from Redis
      target:
        type: AverageValue
        averageValue: "10"       # scale when >10 tasks/replica

Deploying agent-worker-hpa

Deployment

kubectl apply -f https://k8s.io/examples/application/deployment.yaml

Validate

# Verify deployment
kubectl get deployment agent-worker

# Verify HPA
kubectl get hpa

# Debug HPA
kubectl describe hpa agent-worker-hpa

Hierarchical orchestration

One orchestrator agent fans out sub-tasks to specialist workers.

The orchestrator holds the plan; workers are stateless executors.

At scale, the orchestrator itself can be replicated with task queues (Redis, Kafka) providing coordination.

Horizontal agent scaling

Deploy multiple identical worker agent replicas behind a load balancer.

Each replica handles independent tasks

Stateless design is key so any replica can pick up any task.

flowchart TD

    Producers["Producers / API Gateway 🔀 "]
    LoadBalancer["Load Balancer 🚦 "]

    AgentA["Agent Replica A 🤖 <br/>Stateless"]
    AgentB["Agent Replica B🤖 <br/>Stateless"]
    AgentC["Agent Replica C 🤖<br/>Stateless"]

    Queue["Task Queue 📥 <br/>Redis / Kafka"]

    Session["Session State ℹ️ <br/>Redis / DynamoDB"]
    LTM["Long-Term Memory 🛢 <br/>Vector DB / RAG"]
    ToolCache[Tool Cache<br/>Redis / Memcached]

    Producers --> LoadBalancer

    LoadBalancer --> |monitor| AgentA
    LoadBalancer --> |monitor| AgentB
    LoadBalancer --> |monitor| AgentC

    Producers --> Queue

    Queue --> |poll| AgentA
    Queue --> |poll| AgentB
    Queue --> |poll| AgentC

    AgentA <--> Session
    AgentB <--> Session
    AgentC <--> Session

    AgentA <--> LTM
    AgentB <--> LTM
    AgentC <--> LTM

    AgentA <--> ToolCache
    AgentB <--> ToolCache
    AgentC <--> ToolCache

KEDA (Kubernetes Event-Driven Autoscaling)

Extends Kubernetes autoscaling beyond CPU and memory.

Common triggers:

Redis Queue Depth
Kafka Lag
RabbitMQ Queue Length
AWS SQS Messages

KEDA automatically creates and manages an HPA.

flowchart LR

    Queue -. Queue Depth .-> KEDA

    KEDA --> HPA

    HPA --> Deployment

Task queue decoupling 📥

Decouple task submission from execution via a queue.

Producers push tasks; agent workers pull and process.

Enables backpressure, retry, and independent scaling of producers vs consumers.

1. Routes by queue depth

Reads the actual task queue depth from each replica (or from Redis) and routes to the one with the shortest queue.

The most accurate signal for agent workloads — accounts for queued but not-yet-started tasks.

2. Round Robin

Cycles through replicas in order regardless of their current load.

Simple and fair for uniform workloads, but can create hotspots when some agent tasks take much longer than others.

3. Least Connection

Always picks the replica with the fewest active tasks.

Ideal for agents because task duration varies wildly

a 20-step reasoning chain holds a connection far longer than a 2-step lookup.

4. Weighted

Replicas are assigned weights reflecting their capacity.

A GPU-backed replica might get weight 3, a CPU replica weight 1 — meaning it receives 3× the traffic.

Good when replicas have different specs.

5. Random

Picks a replica at random.

Statistically converges to even distribution at scale but can cluster by chance on small request counts.

Low overhead — no state to track.

ConfigMap / Secret 🔑

Externalise model endpoint URLs, API keys, and policy configs from the image enables config changes without a rebuild.

GPU node pool

Schedule inference pods on GPU nodes using nodeSelector or taints/tolerations. NVIDIA device plugin exposes GPUs as schedulable resources.

NVIDIA NIM: Optimized Inference Microservices — the NVIDIA inference stack referenced in this post; NIM Operator manages the LLM endpoints that agents call
GPU Autoscaling: KEDA, HPA, Cluster Autoscaler — deep dive into the HPA and KEDA patterns described here: DCGM metrics pipeline, ScaledObject for queue depth, Cluster Autoscaler for GPU node groups
Kueue: Job Queuing and Quota Management — managing GPU quota across multiple agent deployments sharing the same cluster
Kubernetes Networking: Pods, Services, Ingress, and CNI — Service types and Ingress routing for exposing agent API endpoints
Helm: Kubernetes Package Manager — deploying agent infrastructure (NIM, monitoring, Kueue) via Helm charts

Written by Hitesh Sahu, a passionate developer and blogger.

Sun Jun 07 2026

Share This on

← Previous

Stanford AI Scientist Roadmap 2026

Deploying Agentic AI to Production

AI-AgenticAI/Agent-Deployment

Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🦈 Sharks existed before trees 🌳.

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

AI-AgenticAI

Deploying Agents at Scale

Learn how to deploy AI agents reliably in production using containerization, orchestration, observability, evaluation pipelines, guardrails, retries, scaling strategies, and resilient architectures. Explore best practices for running agentic systems across cloud environments while maintaining performance, reliability, security, and cost efficiency.

Artificial Intelligence

Agentic AI

AI Agents

Deployment

MLOps

Kubernetes

← Previous

Stanford AI Scientist Roadmap 2026

Deploying Agentic AI to Production

Deploying Agents at Scale

NVIDIA Inference Stack ⚡

flowchart TD

    Training["Training <br/> PyTorch / NeMo Model"]--> Model["ONNX 📦"]

    User --> NIM["NIM <br/> Production APIs"]

    NIM --> Triton["Triton Server 🐳 <br/>Inference"]


    Triton--> TensorRT-LLM["TensorRT-LLM 🖲 <br/> Runtime"]

    Model-->TensorRT-LLM

    TensorRT-LLM--> GPU["GPU Rack 🧮"]

Components

Component	Purpose
`CUDA`	GPU execution platform
`TensorRT-LLM`	LLM optimization
`Triton`	Model serving
`NIM`	Packaged inference microservice
`Kubernetes`	Deployment & scaling

Build using Docker 🐳

Packages the agent, its model config, tool dependencies, and runtime into a single reproducible image.

Pin all versions model weights, libraries, Python for deterministic builds.

Pipeline

flowchart TD
    
    Commit["code commit 📤"]
    Eval["Automated Eval 🔎"]
    Build["Container Build 📦"]
    Deploy["Shadow Deployment 🚛 "]
    rollback["Promote/Rollback 🐳"]

    Commit-->Eval-->Build-->Deploy-->rollback

Automated Eval 🔎

Running a benchmark suite against a golden dataset to catch regression in agent behaviour before it reaches users.

Shadow Deployment 🚛

Benchmarking new build with real user traffic against existing prod deployment without affecting any user

Goal

Observe performance with real world traffic
Validate if model is performing well
Benchmark with current Live Model

Promote/Rollback 🐳

Final Rollout to production using BlueGreen or Canary Deployment

Expose to a real users gradually with Canary.

Kubernetes ☸️

Orchestrates multiple containers at scale.

Handles scheduling, health checks, rolling updates, and auto-scaling.

Each agent type runs as a Deployment with its own replica count and resource limits.

production-grade pattern used by many AI agent platforms.

Deploy stateless agent workers as a Kubernetes Deployment.
Use Redis or Kafka as a task queue.
Expose queue depth as an external metric.
Configure an HPA or KEDA ScaledObject.
Scale replicas based on queue depth.
Store session state and memory externally.


flowchart LR

    Producers --> Queue

    Queue --> Agent1
    Queue --> Agent2
    Queue --> Agent3

    Queue -. task_queue_depth .-> HPA

    HPA -. Scale Up/Down .-> Deployment

    Deployment --> Agent1
    Deployment --> Agent2
    Deployment --> Agent3

Horizontal Pod Autoscaler (`HPA`) ↔️

Scales replica count up/down based on CPU, memory, or custom metrics (e.g. queue depth).

Handles traffic spikes without manual intervention.

Example application/deployment.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler # Horizontally scale pods
metadata:
  name: agent-worker-hpa  # identifier for deployment
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: agent-worker
  minReplicas: 2 # HPA will never scale below 2 replicas
  maxReplicas: 20 # Never scale above 20 pods
  metrics:
  - type: External
    external:
      metric:
        name: task_queue_depth   # custom metric from Redis
      target:
        type: AverageValue
        averageValue: "10"       # scale when >10 tasks/replica

Deploying agent-worker-hpa

Deployment

kubectl apply -f https://k8s.io/examples/application/deployment.yaml

Validate

# Verify deployment
kubectl get deployment agent-worker

# Verify HPA
kubectl get hpa

# Debug HPA
kubectl describe hpa agent-worker-hpa

Hierarchical orchestration

One orchestrator agent fans out sub-tasks to specialist workers.

The orchestrator holds the plan; workers are stateless executors.

At scale, the orchestrator itself can be replicated with task queues (Redis, Kafka) providing coordination.

Horizontal agent scaling

Deploy multiple identical worker agent replicas behind a load balancer.

Each replica handles independent tasks

Stateless design is key so any replica can pick up any task.

flowchart TD

    Producers["Producers / API Gateway 🔀 "]
    LoadBalancer["Load Balancer 🚦 "]

    AgentA["Agent Replica A 🤖 <br/>Stateless"]
    AgentB["Agent Replica B🤖 <br/>Stateless"]
    AgentC["Agent Replica C 🤖<br/>Stateless"]

    Queue["Task Queue 📥 <br/>Redis / Kafka"]

    Session["Session State ℹ️ <br/>Redis / DynamoDB"]
    LTM["Long-Term Memory 🛢 <br/>Vector DB / RAG"]
    ToolCache[Tool Cache<br/>Redis / Memcached]

    Producers --> LoadBalancer

    LoadBalancer --> |monitor| AgentA
    LoadBalancer --> |monitor| AgentB
    LoadBalancer --> |monitor| AgentC

    Producers --> Queue

    Queue --> |poll| AgentA
    Queue --> |poll| AgentB
    Queue --> |poll| AgentC

    AgentA <--> Session
    AgentB <--> Session
    AgentC <--> Session

    AgentA <--> LTM
    AgentB <--> LTM
    AgentC <--> LTM

    AgentA <--> ToolCache
    AgentB <--> ToolCache
    AgentC <--> ToolCache

KEDA (Kubernetes Event-Driven Autoscaling)

Extends Kubernetes autoscaling beyond CPU and memory.

Common triggers:

Redis Queue Depth
Kafka Lag
RabbitMQ Queue Length
AWS SQS Messages

KEDA automatically creates and manages an HPA.

flowchart LR

    Queue -. Queue Depth .-> KEDA

    KEDA --> HPA

    HPA --> Deployment

Task queue decoupling 📥

Decouple task submission from execution via a queue.

Producers push tasks; agent workers pull and process.

Enables backpressure, retry, and independent scaling of producers vs consumers.

1. Routes by queue depth

Reads the actual task queue depth from each replica (or from Redis) and routes to the one with the shortest queue.

The most accurate signal for agent workloads — accounts for queued but not-yet-started tasks.

2. Round Robin

Cycles through replicas in order regardless of their current load.

Simple and fair for uniform workloads, but can create hotspots when some agent tasks take much longer than others.

3. Least Connection

Always picks the replica with the fewest active tasks.

Ideal for agents because task duration varies wildly

a 20-step reasoning chain holds a connection far longer than a 2-step lookup.

4. Weighted

Replicas are assigned weights reflecting their capacity.

A GPU-backed replica might get weight 3, a CPU replica weight 1 — meaning it receives 3× the traffic.

Good when replicas have different specs.

5. Random

Picks a replica at random.

Statistically converges to even distribution at scale but can cluster by chance on small request counts.

Low overhead — no state to track.

ConfigMap / Secret 🔑

Externalise model endpoint URLs, API keys, and policy configs from the image enables config changes without a rebuild.

GPU node pool

Schedule inference pods on GPU nodes using nodeSelector or taints/tolerations. NVIDIA device plugin exposes GPUs as schedulable resources.

NVIDIA NIM: Optimized Inference Microservices — the NVIDIA inference stack referenced in this post; NIM Operator manages the LLM endpoints that agents call
GPU Autoscaling: KEDA, HPA, Cluster Autoscaler — deep dive into the HPA and KEDA patterns described here: DCGM metrics pipeline, ScaledObject for queue depth, Cluster Autoscaler for GPU node groups
Kueue: Job Queuing and Quota Management — managing GPU quota across multiple agent deployments sharing the same cluster
Kubernetes Networking: Pods, Services, Ingress, and CNI — Service types and Ingress routing for exposing agent API endpoints
Helm: Kubernetes Package Manager — deploying agent infrastructure (NIM, monitoring, Kueue) via Helm charts

Written by Hitesh Sahu, a passionate developer and blogger.

Sun Jun 07 2026

Share This on

← Previous

Stanford AI Scientist Roadmap 2026

Deploying Agentic AI to Production

AI-AgenticAI/Agent-Deployment

Fetching content, this won’t take long…

🦈 Sharks existed before trees 🌳.

Fetching content, this won’t take long…

🐙 Octopuses have three hearts and blue blood.

AI-AgenticAI

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

Deploying Agents at Scale

Deploying Agents at Scale

NVIDIA Inference Stack ⚡

Components

Build using Docker 🐳

Automated Eval 🔎

Shadow Deployment 🚛

Promote/Rollback 🐳

Kubernetes ☸️

production-grade pattern used by many AI agent platforms.

Horizontal Pod Autoscaler (HPA) ↔️

Hierarchical orchestration

Horizontal agent scaling

KEDA (Kubernetes Event-Driven Autoscaling)

Task queue decoupling 📥

1. Routes by queue depth

2. Round Robin

3. Least Connection

4. Weighted

5. Random

ConfigMap / Secret 🔑

GPU node pool

Related Posts

Written by Hitesh Sahu, a passionate developer and blogger.

Fetching content, this won’t take long…

🦈 Sharks existed before trees 🌳.

AI-AgenticAI

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

Deploying Agents at Scale

Deploying Agents at Scale

NVIDIA Inference Stack ⚡

Components

Build using Docker 🐳

Automated Eval 🔎

Shadow Deployment 🚛

Promote/Rollback 🐳

Kubernetes ☸️

production-grade pattern used by many AI agent platforms.

Horizontal Pod Autoscaler (HPA) ↔️

Hierarchical orchestration

Horizontal agent scaling

KEDA (Kubernetes Event-Driven Autoscaling)

Task queue decoupling 📥

1. Routes by queue depth

2. Round Robin

3. Least Connection

Horizontal Pod Autoscaler (`HPA`) ↔️

Horizontal Pod Autoscaler (`HPA`) ↔️