Technical Documentation

On-Premises LLM Architecture Blueprint

Version 1.0 | Last Updated: December 2024

Abstract

This document provides a reference architecture for deploying large language models (LLMs) in air-gapped enterprise environments. It covers infrastructure requirements (compute, storage, networking), deployment patterns (single-node development through geo-distributed production), model serving stack components, and scaling considerations. Target audience: infrastructure engineers, ML engineers, and technical architects planning on-premises LLM deployments.

Infrastructure Requirements

Compute: GPU Specifications

GPU Model	VRAM	Recommended Use	Max Model Size
NVIDIA A100 (40GB)	40 GB	Development, small models	7B-13B parameters
NVIDIA A100 (80GB)	80 GB	Production, medium models	13B-34B parameters
NVIDIA H100	80 GB	Production, large models	34B-70B parameters
4x A100 (80GB)	320 GB total	High-availability cluster	70B+ parameters

Storage Requirements

Model Storage: 500GB - 2TB NVMe SSD (model weights, checkpoints)
Vector Database: 1TB - 10TB depending on document corpus size
Application Data: 100GB - 500GB (user sessions, queries, logs)
Backup Storage: 2x primary storage (for disaster recovery)
IOPS: Minimum 50,000 IOPS for model serving

Networking

Internal Network: 10Gbps+ (40Gbps recommended for multi-GPU)
Load Balancer: Layer 7 (NGINX, HAProxy) for request distribution
External Access: VPN or private link (if not fully air-gapped)
Latency: Sub-1ms between GPU nodes (InfiniBand preferred for clusters)

Backup & Disaster Recovery

Model Backups: Daily snapshots of model weights and configurations
Data Backups: Hourly vector DB snapshots, continuous for critical data
Recovery Time Objective (RTO): 4 hours for production systems
Recovery Point Objective (RPO): 1 hour (maximum acceptable data loss)

Deployment Patterns

Pattern 1: Single-Node Development

┌─────────────────────────────────────┐
│        Development Workstation       │
│                                     │
│  ┌─────────────────────────────┐   │
│  │   LLM Inference Server      │   │
│  │   (vLLM / TGI)              │   │
│  │   GPU: 1x A100 40GB         │   │
│  └─────────────────────────────┘   │
│                                     │
│  ┌─────────────────────────────┐   │
│  │   Vector Database           │   │
│  │   (Qdrant / Milvus)         │   │
│  └─────────────────────────────┘   │
│                                     │
│  ┌─────────────────────────────┐   │
│  │   Application Layer         │   │
│  │   (API, UI)                 │   │
│  └─────────────────────────────┘   │
└─────────────────────────────────────┘

Use Case: Development, testing, proof-of-concept

Capacity: 1-10 concurrent users

Cost: $15K - $30K (GPU + server)

Limitations: No redundancy, single point of failure, limited scale

Pattern 2: Multi-Node Production

┌──────────────────────────────────────────────────┐
│               Load Balancer (HA)                  │
└────────┬────────────────────────┬─────────────────┘
         │                        │
    ┌────▼─────┐            ┌────▼─────┐
    │  Node 1  │            │  Node 2  │
    │ (Primary)│            │ (Standby)│
    │          │            │          │
    │ LLM      │            │ LLM      │
    │ Server   │            │ Server   │
    │ 2x A100  │            │ 2x A100  │
    └────┬─────┘            └────┬─────┘
         │                        │
         └────────┬───────────────┘
                  │
         ┌────────▼────────┐
         │  Vector DB      │
         │  (Replicated)   │
         │  3-node cluster │
         └─────────────────┘

Use Case: Production deployment, 100-500 users

Capacity: 50-100 concurrent users

Cost: $100K - $200K

Features: Active-standby failover, replicated data, 99.9% uptime

Pattern 3: High-Availability Cluster

                 ┌───────────────┐
                 │  API Gateway  │
                 │  (NGINX HA)   │
                 └───────┬───────┘
                         │
        ┌────────────────┼────────────────┐
        │                │                │
   ┌────▼────┐      ┌───▼─────┐    ┌────▼────┐
   │ Node 1  │      │ Node 2  │    │ Node 3  │
   │ 4x H100 │      │ 4x H100 │    │ 4x H100 │
   │ Active  │      │ Active  │    │ Active  │
   └────┬────┘      └────┬────┘    └────┬────┘
        │                │                │
        └────────────────┼────────────────┘
                         │
                 ┌───────▼───────┐
                 │  Vector DB    │
                 │  5-node       │
                 │  HA Cluster   │
                 └───────────────┘

Use Case: Enterprise production, 1000+ users

Capacity: 200+ concurrent users

Cost: $400K - $800K

Features: Active-active load balancing, auto-scaling, 99.99% uptime

Pattern 4: Geo-Distributed

Use Case: Multi-region deployment for latency optimization or data residency requirements. Each region has a full cluster (Pattern 3), with asynchronous replication between regions.

Capacity: 500+ concurrent users per region

Cost: $1M+ (multiple clusters)

Features: Sub-50ms regional latency, disaster recovery across regions

Model Serving Stack

Inference Servers

Server	Best For	Pros	Cons
vLLM	High throughput, production	Fastest inference, PagedAttention	Limited model support
TensorRT-LLM	NVIDIA GPUs, max performance	Optimized kernels, low latency	Complex setup, NVIDIA-only
Text Generation Inference (TGI)	Hugging Face models	Easy deployment, broad support	Slower than vLLM

Supporting Components

Load Balancer: NGINX or HAProxy for request distribution
- - Round-robin or least-connections algorithm
- - Health checks every 10 seconds
- - Automatic failover in <1 second
Caching Layer: Redis for frequently requested completions
- - Cache hit rate: 30-40% typical
- - Reduces GPU load by 25-35%
- - TTL: 1 hour for dynamic content
Monitoring & Observability:
- - Prometheus for metrics (GPU utilization, request latency)
- - Grafana for visualization
- - Alerts: GPU > 90%, latency > 5s, error rate > 1%

Scaling Considerations

Horizontal vs Vertical Scaling

Scaling Type	When to Use	Example
Vertical (Scale Up)	Model too large for current GPU	Replace 2x A100 40GB with 2x H100 80GB
Horizontal (Scale Out)	Concurrent users increasing	Add 2 more inference nodes to cluster

Model Sharding

For models larger than single-GPU memory (e.g., 70B+ parameter models):

Tensor Parallelism: Split model layers across GPUs (requires fast interconnect)
Pipeline Parallelism: Different layers on different GPUs (sequential processing)
Recommended: 4-8 GPUs for 70B models, 8-16 GPUs for 175B+ models

Request Batching

Dramatically improves throughput by processing multiple requests simultaneously:

Batch Size: 8-32 requests typical (depends on GPU memory)
Throughput Gain: 3-5x vs sequential processing
Latency Impact: +50-200ms per request (acceptable for most use cases)
Dynamic Batching: vLLM automatically batches incoming requests

Performance Benchmarks

Measured on Llama 2 70B model with vLLM, batch size 16, context length 2048 tokens.

Configuration	Throughput (req/sec)	Latency (P95)	Cost/1M Tokens
2x A100 80GB	12 req/sec	850ms	$8
4x A100 80GB	28 req/sec	720ms	$4
4x H100 80GB	45 req/sec	580ms	$2.50