AgenixHub company logo AgenixHub
Menu
Technical Documentation

On-Premises LLM Architecture Blueprint

Version 1.0 | Last Updated: December 2024

Abstract

This document provides a reference architecture for deploying large language models (LLMs) in air-gapped enterprise environments. It covers infrastructure requirements (compute, storage, networking), deployment patterns (single-node development through geo-distributed production), model serving stack components, and scaling considerations. Target audience: infrastructure engineers, ML engineers, and technical architects planning on-premises LLM deployments.

Infrastructure Requirements

Compute: GPU Specifications

GPU Model VRAM Recommended Use Max Model Size
NVIDIA A100 (40GB) 40 GB Development, small models 7B-13B parameters
NVIDIA A100 (80GB) 80 GB Production, medium models 13B-34B parameters
NVIDIA H100 80 GB Production, large models 34B-70B parameters
4x A100 (80GB) 320 GB total High-availability cluster 70B+ parameters

Storage Requirements

  • Model Storage: 500GB - 2TB NVMe SSD (model weights, checkpoints)
  • Vector Database: 1TB - 10TB depending on document corpus size
  • Application Data: 100GB - 500GB (user sessions, queries, logs)
  • Backup Storage: 2x primary storage (for disaster recovery)
  • IOPS: Minimum 50,000 IOPS for model serving

Networking

  • Internal Network: 10Gbps+ (40Gbps recommended for multi-GPU)
  • Load Balancer: Layer 7 (NGINX, HAProxy) for request distribution
  • External Access: VPN or private link (if not fully air-gapped)
  • Latency: Sub-1ms between GPU nodes (InfiniBand preferred for clusters)

Backup & Disaster Recovery

  • Model Backups: Daily snapshots of model weights and configurations
  • Data Backups: Hourly vector DB snapshots, continuous for critical data
  • Recovery Time Objective (RTO): 4 hours for production systems
  • Recovery Point Objective (RPO): 1 hour (maximum acceptable data loss)

Deployment Patterns

Pattern 1: Single-Node Development

┌─────────────────────────────────────┐
│        Development Workstation       │
│                                     │
│  ┌─────────────────────────────┐   │
│  │   LLM Inference Server      │   │
│  │   (vLLM / TGI)              │   │
│  │   GPU: 1x A100 40GB         │   │
│  └─────────────────────────────┘   │
│                                     │
│  ┌─────────────────────────────┐   │
│  │   Vector Database           │   │
│  │   (Qdrant / Milvus)         │   │
│  └─────────────────────────────┘   │
│                                     │
│  ┌─────────────────────────────┐   │
│  │   Application Layer         │   │
│  │   (API, UI)                 │   │
│  └─────────────────────────────┘   │
└─────────────────────────────────────┘
              

Use Case: Development, testing, proof-of-concept

Capacity: 1-10 concurrent users

Cost: $15K - $30K (GPU + server)

Limitations: No redundancy, single point of failure, limited scale

Pattern 2: Multi-Node Production

┌──────────────────────────────────────────────────┐
│               Load Balancer (HA)                  │
└────────┬────────────────────────┬─────────────────┘
         │                        │
    ┌────▼─────┐            ┌────▼─────┐
    │  Node 1  │            │  Node 2  │
    │ (Primary)│            │ (Standby)│
    │          │            │          │
    │ LLM      │            │ LLM      │
    │ Server   │            │ Server   │
    │ 2x A100  │            │ 2x A100  │
    └────┬─────┘            └────┬─────┘
         │                        │
         └────────┬───────────────┘
                  │
         ┌────────▼────────┐
         │  Vector DB      │
         │  (Replicated)   │
         │  3-node cluster │
         └─────────────────┘
              

Use Case: Production deployment, 100-500 users

Capacity: 50-100 concurrent users

Cost: $100K - $200K

Features: Active-standby failover, replicated data, 99.9% uptime

Pattern 3: High-Availability Cluster

                 ┌───────────────┐
                 │  API Gateway  │
                 │  (NGINX HA)   │
                 └───────┬───────┘
                         │
        ┌────────────────┼────────────────┐
        │                │                │
   ┌────▼────┐      ┌───▼─────┐    ┌────▼────┐
   │ Node 1  │      │ Node 2  │    │ Node 3  │
   │ 4x H100 │      │ 4x H100 │    │ 4x H100 │
   │ Active  │      │ Active  │    │ Active  │
   └────┬────┘      └────┬────┘    └────┬────┘
        │                │                │
        └────────────────┼────────────────┘
                         │
                 ┌───────▼───────┐
                 │  Vector DB    │
                 │  5-node       │
                 │  HA Cluster   │
                 └───────────────┘
              

Use Case: Enterprise production, 1000+ users

Capacity: 200+ concurrent users

Cost: $400K - $800K

Features: Active-active load balancing, auto-scaling, 99.99% uptime

Pattern 4: Geo-Distributed

Use Case: Multi-region deployment for latency optimization or data residency requirements. Each region has a full cluster (Pattern 3), with asynchronous replication between regions.

Capacity: 500+ concurrent users per region

Cost: $1M+ (multiple clusters)

Features: Sub-50ms regional latency, disaster recovery across regions

Model Serving Stack

Inference Servers

Server Best For Pros Cons
vLLM High throughput, production Fastest inference, PagedAttention Limited model support
TensorRT-LLM NVIDIA GPUs, max performance Optimized kernels, low latency Complex setup, NVIDIA-only
Text Generation Inference (TGI) Hugging Face models Easy deployment, broad support Slower than vLLM

Supporting Components

  • Load Balancer: NGINX or HAProxy for request distribution
    • - Round-robin or least-connections algorithm
    • - Health checks every 10 seconds
    • - Automatic failover in <1 second
  • Caching Layer: Redis for frequently requested completions
    • - Cache hit rate: 30-40% typical
    • - Reduces GPU load by 25-35%
    • - TTL: 1 hour for dynamic content
  • Monitoring & Observability:
    • - Prometheus for metrics (GPU utilization, request latency)
    • - Grafana for visualization
    • - Alerts: GPU > 90%, latency > 5s, error rate > 1%

Scaling Considerations

Horizontal vs Vertical Scaling

Scaling Type When to Use Example
Vertical (Scale Up) Model too large for current GPU Replace 2x A100 40GB with 2x H100 80GB
Horizontal (Scale Out) Concurrent users increasing Add 2 more inference nodes to cluster

Model Sharding

For models larger than single-GPU memory (e.g., 70B+ parameter models):

  • Tensor Parallelism: Split model layers across GPUs (requires fast interconnect)
  • Pipeline Parallelism: Different layers on different GPUs (sequential processing)
  • Recommended: 4-8 GPUs for 70B models, 8-16 GPUs for 175B+ models

Request Batching

Dramatically improves throughput by processing multiple requests simultaneously:

  • Batch Size: 8-32 requests typical (depends on GPU memory)
  • Throughput Gain: 3-5x vs sequential processing
  • Latency Impact: +50-200ms per request (acceptable for most use cases)
  • Dynamic Batching: vLLM automatically batches incoming requests

Performance Benchmarks

Measured on Llama 2 70B model with vLLM, batch size 16, context length 2048 tokens.

Configuration Throughput (req/sec) Latency (P95) Cost/1M Tokens
2x A100 80GB 12 req/sec 850ms $8
4x A100 80GB 28 req/sec 720ms $4
4x H100 80GB 45 req/sec 580ms $2.50

Related Documentation