On-Premises LLM Architecture Blueprint
Version 1.0 | Last Updated: December 2024
Abstract
This document provides a reference architecture for deploying large language models (LLMs) in air-gapped enterprise environments. It covers infrastructure requirements (compute, storage, networking), deployment patterns (single-node development through geo-distributed production), model serving stack components, and scaling considerations. Target audience: infrastructure engineers, ML engineers, and technical architects planning on-premises LLM deployments.
Infrastructure Requirements
Compute: GPU Specifications
| GPU Model | VRAM | Recommended Use | Max Model Size |
|---|---|---|---|
| NVIDIA A100 (40GB) | 40 GB | Development, small models | 7B-13B parameters |
| NVIDIA A100 (80GB) | 80 GB | Production, medium models | 13B-34B parameters |
| NVIDIA H100 | 80 GB | Production, large models | 34B-70B parameters |
| 4x A100 (80GB) | 320 GB total | High-availability cluster | 70B+ parameters |
Storage Requirements
- Model Storage: 500GB - 2TB NVMe SSD (model weights, checkpoints)
- Vector Database: 1TB - 10TB depending on document corpus size
- Application Data: 100GB - 500GB (user sessions, queries, logs)
- Backup Storage: 2x primary storage (for disaster recovery)
- IOPS: Minimum 50,000 IOPS for model serving
Networking
- Internal Network: 10Gbps+ (40Gbps recommended for multi-GPU)
- Load Balancer: Layer 7 (NGINX, HAProxy) for request distribution
- External Access: VPN or private link (if not fully air-gapped)
- Latency: Sub-1ms between GPU nodes (InfiniBand preferred for clusters)
Backup & Disaster Recovery
- Model Backups: Daily snapshots of model weights and configurations
- Data Backups: Hourly vector DB snapshots, continuous for critical data
- Recovery Time Objective (RTO): 4 hours for production systems
- Recovery Point Objective (RPO): 1 hour (maximum acceptable data loss)
Deployment Patterns
Pattern 1: Single-Node Development
┌─────────────────────────────────────┐
│ Development Workstation │
│ │
│ ┌─────────────────────────────┐ │
│ │ LLM Inference Server │ │
│ │ (vLLM / TGI) │ │
│ │ GPU: 1x A100 40GB │ │
│ └─────────────────────────────┘ │
│ │
│ ┌─────────────────────────────┐ │
│ │ Vector Database │ │
│ │ (Qdrant / Milvus) │ │
│ └─────────────────────────────┘ │
│ │
│ ┌─────────────────────────────┐ │
│ │ Application Layer │ │
│ │ (API, UI) │ │
│ └─────────────────────────────┘ │
└─────────────────────────────────────┘
Use Case: Development, testing, proof-of-concept
Capacity: 1-10 concurrent users
Cost: $15K - $30K (GPU + server)
Limitations: No redundancy, single point of failure, limited scale
Pattern 2: Multi-Node Production
┌──────────────────────────────────────────────────┐
│ Load Balancer (HA) │
└────────┬────────────────────────┬─────────────────┘
│ │
┌────▼─────┐ ┌────▼─────┐
│ Node 1 │ │ Node 2 │
│ (Primary)│ │ (Standby)│
│ │ │ │
│ LLM │ │ LLM │
│ Server │ │ Server │
│ 2x A100 │ │ 2x A100 │
└────┬─────┘ └────┬─────┘
│ │
└────────┬───────────────┘
│
┌────────▼────────┐
│ Vector DB │
│ (Replicated) │
│ 3-node cluster │
└─────────────────┘
Use Case: Production deployment, 100-500 users
Capacity: 50-100 concurrent users
Cost: $100K - $200K
Features: Active-standby failover, replicated data, 99.9% uptime
Pattern 3: High-Availability Cluster
┌───────────────┐
│ API Gateway │
│ (NGINX HA) │
└───────┬───────┘
│
┌────────────────┼────────────────┐
│ │ │
┌────▼────┐ ┌───▼─────┐ ┌────▼────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ 4x H100 │ │ 4x H100 │ │ 4x H100 │
│ Active │ │ Active │ │ Active │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└────────────────┼────────────────┘
│
┌───────▼───────┐
│ Vector DB │
│ 5-node │
│ HA Cluster │
└───────────────┘
Use Case: Enterprise production, 1000+ users
Capacity: 200+ concurrent users
Cost: $400K - $800K
Features: Active-active load balancing, auto-scaling, 99.99% uptime
Pattern 4: Geo-Distributed
Use Case: Multi-region deployment for latency optimization or data residency requirements. Each region has a full cluster (Pattern 3), with asynchronous replication between regions.
Capacity: 500+ concurrent users per region
Cost: $1M+ (multiple clusters)
Features: Sub-50ms regional latency, disaster recovery across regions
Model Serving Stack
Inference Servers
| Server | Best For | Pros | Cons |
|---|---|---|---|
| vLLM | High throughput, production | Fastest inference, PagedAttention | Limited model support |
| TensorRT-LLM | NVIDIA GPUs, max performance | Optimized kernels, low latency | Complex setup, NVIDIA-only |
| Text Generation Inference (TGI) | Hugging Face models | Easy deployment, broad support | Slower than vLLM |
Supporting Components
- Load Balancer: NGINX or HAProxy for request distribution
- - Round-robin or least-connections algorithm
- - Health checks every 10 seconds
- - Automatic failover in <1 second
- Caching Layer: Redis for frequently requested completions
- - Cache hit rate: 30-40% typical
- - Reduces GPU load by 25-35%
- - TTL: 1 hour for dynamic content
- Monitoring & Observability:
- - Prometheus for metrics (GPU utilization, request latency)
- - Grafana for visualization
- - Alerts: GPU > 90%, latency > 5s, error rate > 1%
Scaling Considerations
Horizontal vs Vertical Scaling
| Scaling Type | When to Use | Example |
|---|---|---|
| Vertical (Scale Up) | Model too large for current GPU | Replace 2x A100 40GB with 2x H100 80GB |
| Horizontal (Scale Out) | Concurrent users increasing | Add 2 more inference nodes to cluster |
Model Sharding
For models larger than single-GPU memory (e.g., 70B+ parameter models):
- Tensor Parallelism: Split model layers across GPUs (requires fast interconnect)
- Pipeline Parallelism: Different layers on different GPUs (sequential processing)
- Recommended: 4-8 GPUs for 70B models, 8-16 GPUs for 175B+ models
Request Batching
Dramatically improves throughput by processing multiple requests simultaneously:
- Batch Size: 8-32 requests typical (depends on GPU memory)
- Throughput Gain: 3-5x vs sequential processing
- Latency Impact: +50-200ms per request (acceptable for most use cases)
- Dynamic Batching: vLLM automatically batches incoming requests
Performance Benchmarks
Measured on Llama 2 70B model with vLLM, batch size 16, context length 2048 tokens.
| Configuration | Throughput (req/sec) | Latency (P95) | Cost/1M Tokens |
|---|---|---|---|
| 2x A100 80GB | 12 req/sec | 850ms | $8 |
| 4x A100 80GB | 28 req/sec | 720ms | $4 |
| 4x H100 80GB | 45 req/sec | 580ms | $2.50 |