Generative AI

Scaling LLM Inference: From Single GPU to Planet-Scale Serving

March 2026 18 min read Navneet Singh

Training an LLM is a one-time cost. Inference is a perpetual tax on every interaction. A model that costs $2M to train might cost $20M per year to serve. Yet inference optimization remains an afterthought for most teams — they train a model, wrap it in a Flask API, and wonder why their P99 latency is 12 seconds and their GPU utilization is 15%.

After architecting inference systems serving 50M+ requests/day across production LLMs ranging from 7B to 405B parameters, I’ve learned that the difference between a naive deployment and an optimized one is 10–50x in throughput and 3–5x in cost. This article covers every technique that matters, from KV-cache mechanics to a novel Adaptive Inference Orchestrator (AIO) that routes requests to the cheapest model capable of handling them.

50x

Throughput Gain (Optimized vs Naive)

200ms

P99 TTFT (70B Model)

65%

Cost Reduction (AIO)

95%

GPU Utilization Target

1. Anatomy of LLM Inference

LLM inference has two fundamentally different phases, each with distinct computational characteristics:

  LLM INFERENCE: TWO-PHASE PIPELINE
  ====================================

  Phase 1: PREFILL (Prompt Processing)          Phase 2: DECODE (Token Generation)
  ===================================          ====================================

  Input: "Explain quantum computing"           Input: Previously generated tokens
                                                      + KV-Cache
  +----------------------------------+
  | Tokenize: [Explain, quantum,     |         +----------------------------------+
  |            computing]             |         | Generate token 1: "Quantum"      |
  +----------------------------------+         +----------------------------------+
           |                                            |
           v                                            v
  +----------------------------------+         +----------------------------------+
  | Process ALL tokens in PARALLEL   |         | Process ONE token at a time      |
  |                                  |         | (autoregressive)                 |
  | - Self-attention over full       |         |                                  |
  |   prompt sequence                |         | - Attend to all previous tokens  |
  | - Compute-bound (matrix-matrix)  |         |   via KV-Cache lookup            |
  | - High arithmetic intensity      |         | - Memory-bound (matrix-vector)   |
  | - GPU compute saturated          |         | - Low arithmetic intensity       |
  +----------------------------------+         | - GPU memory bandwidth limited   |
           |                                   +----------------------------------+
           v                                            |
  +----------------------------------+                  v
  | Output: KV-Cache for all prompt  |         +----------------------------------+
  | tokens (stored for decode phase) |         | Output: one token per forward    |
  +----------------------------------+         | pass, appended to KV-Cache       |
                                               +----------------------------------+
                                                        |
  Characteristics:                                      | (repeat until EOS or max_tokens)
  - Latency: Time-to-First-Token (TTFT)                |
  - Scales with prompt length                  Characteristics:
  - Parallelizable                             - Latency: Inter-Token Latency (ITL)
  - Typically 10-100ms                         - Scales with output length
                                               - Sequential (not parallelizable)
                                               - Typically 20-50ms per token

The KV-Cache: Memory Hog

During decode, each new token must attend to all previous tokens. Without caching, this means recomputing every token’s key and value projections at every step — O(n²) computation. The KV-cache stores these projections, reducing decode to O(n) per step, but at significant memory cost:

# KV-Cache memory calculation
def kv_cache_memory_gb(
    num_layers: int,
    num_kv_heads: int,
    head_dim: int,
    max_seq_len: int,
    batch_size: int,
    dtype_bytes: int = 2  # FP16
) -> float:
    """Calculate KV-cache memory in GB.

    For Llama 3.1 70B:
      - num_layers = 80, num_kv_heads = 8 (GQA), head_dim = 128
      - max_seq_len = 8192, batch_size = 32

    KV-cache = 2 * 80 * 8 * 128 * 8192 * 32 * 2 bytes
             = 2 * 80 * 8 * 128 * 8192 * 32 * 2
             = ~67 GB  (!!!)
    """
    total_bytes = (
        2 *                  # K and V
        num_layers *
        num_kv_heads *
        head_dim *
        max_seq_len *
        batch_size *
        dtype_bytes
    )
    return total_bytes / (1024 ** 3)

# Llama 3.1 70B with batch=32, seq=8192
mem = kv_cache_memory_gb(80, 8, 128, 8192, 32)
print(f"KV-Cache: {mem:.1f} GB")  # ~67.1 GB
# Model weights (FP16): ~140 GB
# Total: ~207 GB across GPUs — KV-cache is 32% of total!

The 70/30 Rule

In production LLM serving, roughly 70% of GPU memory goes to model weights and 30% to KV-cache. But KV-cache scales with batch size and sequence length, so at high load, it can exceed model weight memory. Every optimization in this article ultimately addresses this memory pressure.

2. KV-Cache Optimization

PagedAttention (vLLM)

The breakthrough insight from Kwon et al. (2023): treat KV-cache like virtual memory. Instead of pre-allocating a contiguous block for each sequence’s maximum length, PagedAttention divides the KV-cache into fixed-size pages (blocks) and allocates them on-demand using a page table.

  TRADITIONAL KV-CACHE                    PAGEDATTENTION (vLLM)
  ====================                    =====================

  Sequence A (len=1024, max=4096)         Sequence A (len=1024)
  +====+====+====+====+                   Page Table A:
  |Used|Used|Used|Used|                   [Block 0] -> Physical Block 7
  |256 |256 |256 |256 |                   [Block 1] -> Physical Block 2
  +----+----+----+----+                   [Block 2] -> Physical Block 15
  |    WASTED: 75%    |                   [Block 3] -> Physical Block 9
  |    (3072 tokens)  |                   (4 blocks * 256 tokens = 1024, no waste)
  +-------------------+
                                          Physical Memory Pool:
  Sequence B (len=512, max=4096)          +---+---+---+---+---+---+---+---+
  +====+====+----+----+                   | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
  |Used|Used|    |    |                   | B | B |A1 | - |B2 | - | - |A0 |
  |256 |256 |    |    |                   +---+---+---+---+---+---+---+---+
  +----+----+----+----+                   | 8 | 9 |10 |11 |12 |13 |14 |15 |
  |    WASTED: 87.5%  |                   | - |A3 | - | - | - | - | - |A2 |
  |    (3584 tokens)  |                   +---+---+---+---+---+---+---+---+
  +-------------------+
                                          Benefits:
  Problem:                                - Near-zero internal fragmentation
  - ~60-80% memory wasted                 - Blocks allocated on demand
  - Max batch size severely limited       - Blocks freed immediately on completion
  - Can't serve variable-length well      - 2-4x higher batch size = 2-4x throughput

Grouped-Query Attention (GQA)

Multi-Head Attention (MHA) uses separate K/V heads per query head (e.g., 32 Q heads, 32 K heads, 32 V heads). GQA groups multiple query heads to share K/V heads:

Attention Type	Q Heads	K/V Heads	KV-Cache Size (relative)	Quality Impact
MHA (GPT-3)	96	96	1.0x (baseline)	Best
GQA (Llama 3)	32	8	0.25x	~Same
MQA (Falcon)	64	1	0.016x	Slight degradation

3. Batching: The Throughput Multiplier

The single biggest throughput lever is batching strategy. A naive static-batch approach wastes enormous GPU cycles; continuous batching can improve throughput by 10–23x.

  STATIC BATCHING (Naive)                CONTINUOUS BATCHING (vLLM/TGI)
  =======================                ==============================

  Time --->                              Time --->

  Batch 1:                               Iteration pool (dynamic):
  Req A: [====PREFILL====][DDDDDDDDD]   Req A: [==PF==][DDDDDDDDDD]
  Req B: [====PREFILL====][DDD]------    Req B: [==PF==][DDD]
  Req C: [====PREFILL====][DDDDDDD]--   Req C:              [=PF=][DDDDDDD]
                                ^        Req D:         [PF][DDDDD]
                                |        Req E:                   [PF][DDDD]
            Wait for LONGEST    |
            sequence to finish  |        Key: New requests JOIN mid-batch
            before accepting    |              Completed requests LEAVE immediately
            new requests        |              GPU always fully utilized
                                |
  Batch 2 (delayed):                     Throughput: up to 23x vs static
  Req D: [====PREFILL====][DDDDD]       Latency: P50 reduced by 5-8x
  Req E: [====PREFILL====][DDDD]-
  Req F: [====PREFILL====][DDDDDDDDD]

  GPU util: ~15-30%                      GPU util: ~85-95%

Chunked Prefill (Sarathi)

A key challenge: prefill is compute-bound, decode is memory-bound. When a new request arrives and triggers prefill, it can stall all decode operations in the batch. Sarathi (Agrawal et al., 2023) solves this by chunking the prefill into smaller pieces and interleaving them with decode iterations:

class ChunkedPrefillScheduler:
    """Sarathi-style chunked prefill interleaved with decode."""

    def __init__(self, chunk_size: int = 512, max_batch_tokens: int = 8192):
        self.chunk_size = chunk_size
        self.max_batch_tokens = max_batch_tokens
        self.prefill_queue = []    # Requests awaiting prefill
        self.decode_pool = []      # Requests in decode phase
        self.pending_chunks = {}   # Partial prefills

    def schedule_iteration(self):
        """Schedule one forward pass mixing prefill chunks + decodes."""
        batch = []
        token_budget = self.max_batch_tokens

        # 1. Always include active decode requests (1 token each)
        for req in self.decode_pool:
            if token_budget > 0:
                batch.append(("decode", req, 1))
                token_budget -= 1

        # 2. Fill remaining budget with prefill chunks
        while self.prefill_queue and token_budget >= self.chunk_size:
            req = self.prefill_queue[0]
            remaining = req.prompt_len - self.pending_chunks.get(req.id, 0)

            chunk = min(self.chunk_size, remaining, token_budget)
            batch.append(("prefill_chunk", req, chunk))
            token_budget -= chunk

            self.pending_chunks[req.id] = self.pending_chunks.get(req.id, 0) + chunk

            # If prefill complete, move to decode pool
            if self.pending_chunks[req.id] >= req.prompt_len:
                self.prefill_queue.pop(0)
                self.decode_pool.append(req)
                del self.pending_chunks[req.id]

        return batch

    def forward_pass(self, batch):
        """Execute mixed prefill-chunk + decode batch."""
        # All tokens in the batch are processed in a single forward pass
        # Prefill chunks build the KV-cache incrementally
        # Decode tokens generate the next output token
        # This eliminates prefill-induced stalls
        pass

4. Speculative Decoding: Breaking the Autoregressive Bottleneck

Autoregressive decode generates one token per forward pass through the full model. For a 70B model, each forward pass takes ~30ms. Generating 500 tokens = 15 seconds. Speculative decoding (Leviathan et al., 2022) breaks this bottleneck by using a small draft model to generate candidate tokens cheaply, then verifying them in parallel with the target model.

  SPECULATIVE DECODING PIPELINE
  ===============================

  Draft Model (1B, ~2ms/token)          Target Model (70B, ~30ms/token)
  ==============================         ================================

  Step 1: Draft generates K=5 tokens     Step 2: Target verifies ALL 5 in parallel

  "The capital of France"                "The capital of France"
        |                                      |
        v                                      v
  [is] -> [Paris] -> [,] -> [a] -> [city]    [is] [Paris] [,] [which] [is]
                                                ^     ^     ^     X
                                              ACCEPT ACCEPT ACCEPT REJECT
                                              (matches draft)     |
                                                                  v
                                                            Use target's token: "which"
                                                            Discard "a", "city"

  Result: 4 tokens accepted + 1 from target = 5 tokens in ~32ms
  Without speculation: 5 tokens * 30ms = 150ms
  Speedup: ~4.7x

  Acceptance Rate Formula:
  P(accept token i) = min(1, p_target(x_i) / p_draft(x_i))

  Expected tokens per step = 1/(1-alpha) where alpha = avg acceptance rate
  For alpha=0.8: ~5 tokens per verification step

import torch
from typing import List, Tuple

class SpeculativeDecoder:
    """Speculative decoding with draft + target model."""

    def __init__(self, draft_model, target_model, tokenizer, num_speculative: int = 5):
        self.draft = draft_model
        self.target = target_model
        self.tokenizer = tokenizer
        self.K = num_speculative

    @torch.no_grad()
    def generate(self, prompt_ids: torch.Tensor, max_new_tokens: int = 256) -> List[int]:
        generated = prompt_ids.clone()
        tokens_generated = 0

        while tokens_generated < max_new_tokens:
            # Phase 1: Draft model generates K candidate tokens
            draft_tokens, draft_probs = self._draft_generate(generated, self.K)

            # Phase 2: Target model scores ALL candidates in ONE forward pass
            target_probs = self._target_verify(generated, draft_tokens)

            # Phase 3: Accept/reject using modified rejection sampling
            accepted, bonus_token = self._speculative_sample(
                draft_tokens, draft_probs, target_probs
            )

            # Append accepted tokens + bonus token from target
            generated = torch.cat([generated, accepted, bonus_token.unsqueeze(0)], dim=-1)
            tokens_generated += len(accepted) + 1

            if bonus_token.item() == self.tokenizer.eos_token_id:
                break

        return generated

    def _draft_generate(self, context, k):
        """Generate k tokens with the draft model, storing probabilities."""
        tokens = []
        probs = []
        current = context

        for _ in range(k):
            logits = self.draft(current).logits[:, -1, :]
            p = torch.softmax(logits, dim=-1)
            token = torch.multinomial(p, 1)
            tokens.append(token.item())
            probs.append(p[0, token.item()].item())
            current = torch.cat([current, token], dim=-1)

        return torch.tensor(tokens), torch.tensor(probs)

    def _target_verify(self, context, draft_tokens):
        """Verify all draft tokens in a single target model forward pass."""
        # Concatenate context + draft tokens
        full_seq = torch.cat([context, draft_tokens.unsqueeze(0)], dim=-1)
        logits = self.target(full_seq).logits

        # Extract probabilities at positions corresponding to draft tokens
        start = context.shape[-1] - 1
        target_probs = []
        for i, token in enumerate(draft_tokens):
            p = torch.softmax(logits[0, start + i, :], dim=-1)
            target_probs.append(p[token].item())

        return torch.tensor(target_probs)

    def _speculative_sample(self, draft_tokens, draft_probs, target_probs):
        """Modified rejection sampling for speculative decoding."""
        accepted = []

        for i in range(len(draft_tokens)):
            # Accept with probability min(1, p_target / p_draft)
            acceptance_prob = min(1.0, target_probs[i].item() / max(draft_probs[i].item(), 1e-10))

            if torch.rand(1).item() < acceptance_prob:
                accepted.append(draft_tokens[i])
            else:
                # Sample from adjusted distribution: max(0, p_target - p_draft)
                # In practice, sample from target's distribution at this position
                bonus = draft_tokens[i]  # Simplified; real impl samples from residual
                return torch.tensor(accepted), torch.tensor(bonus)

        # All K tokens accepted; bonus token from target's next prediction
        bonus = draft_tokens[-1]  # Simplified
        return torch.tensor(accepted), torch.tensor(bonus)

Draft Model Selection Strategy

The draft model must be fast (small enough for negligible latency) and similar (high acceptance rate). Best pairs: Llama-3.1-8B drafting for Llama-3.1-70B (acceptance rate ~75–85%); Phi-3-mini drafting for Mixtral-8x22B (~65–75%). The draft model should be from the same family or fine-tuned on the same data for highest acceptance rates.

5. Quantization for Inference

Quantization reduces weight precision from FP16 (2 bytes) to INT8 (1 byte) or INT4 (0.5 bytes), directly halving or quartering memory requirements and improving throughput via reduced memory bandwidth.

Method	Precision	Memory Reduction	Quality Loss (MMLU)	Throughput Gain	Calibration Needed
FP16 (baseline)	16-bit	1.0x	0%	1.0x	No
SmoothQuant	W8A8	2.0x	<0.5%	1.5–1.8x	Yes (128 samples)
GPTQ	W4A16	3.5–4.0x	0.5–2%	2.5–3.0x	Yes (calibration set)
AWQ	W4A16	3.5–4.0x	0.3–1.5%	2.5–3.2x	Yes (activation-aware)
FP8 (H100)	E4M3/E5M2	2.0x	<0.2%	1.8–2.0x	No
GGUF Q4_K_M	Mixed 4-6 bit	3.0–3.5x	0.5–2%	2.0–2.5x (CPU)	No (weight-only)

For production GPU serving, AWQ INT4 is the current sweet spot — near-FP16 quality with 3–4x memory reduction. For H100 deployments, FP8 offers the best quality-to-throughput ratio with zero calibration overhead. For CPU/edge deployment, GGUF Q4_K_M via llama.cpp remains unmatched.

6. Distributed Inference Patterns

  TENSOR PARALLELISM (TP)                    PIPELINE PARALLELISM (PP)
  =======================                    =========================

  Single layer split across GPUs:            Different layers on different GPUs:

  GPU 0        GPU 1        GPU 2            GPU 0: Layers 0-19
  +--------+  +--------+  +--------+        +------------------+
  | W[:,0:d]|  |W[:,d:2d]|  |W[:,2d:3d]|        | Layers 0-19      |
  | (1/3)  |  | (1/3)  |  | (1/3)  |        | Forward -> send  |
  +--------+  +--------+  +--------+        +--------+---------+
       \          |          /                        |
        \         |         /                         v
     [AllReduce across GPUs]                 GPU 1: Layers 20-39
        /         |         \                +------------------+
       v          v          v               | Layers 20-39     |
  Output concatenated/summed                 | Forward -> send  |
                                             +--------+---------+
  Pros: Low latency (all GPUs active)                 |
  Cons: AllReduce comm overhead                       v
  Best for: Single-node multi-GPU            GPU 2: Layers 40-59
                                             +------------------+
                                             | Layers 40-59     |
                                             | Forward -> send  |
                                             +--------+---------+
                                                      |
                                                      v
                                             GPU 3: Layers 60-79
                                             +------------------+
                                             | Layers 60-79     |
                                             | Output           |
                                             +------------------+

                                             Pros: No AllReduce, scales across nodes
                                             Cons: Pipeline bubbles, higher latency
                                             Best for: Multi-node, very large models

# Tensor Parallel inference with vLLM
from vllm import LLM, SamplingParams

# vLLM handles TP automatically
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,          # Split across 4 GPUs
    gpu_memory_utilization=0.90,     # Use 90% of GPU memory
    max_model_len=8192,
    quantization="awq",              # AWQ INT4 quantization
    enable_chunked_prefill=True,     # Sarathi-style chunked prefill
    max_num_batched_tokens=8192,     # Token budget per iteration
)

# Batch inference
prompts = ["Explain quantum computing", "Summarize this contract: ..."]
params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
    min_tokens=10,
)

outputs = llm.generate(prompts, params)
for output in outputs:
    print(output.outputs[0].text)

7. Novel Framework: Adaptive Inference Orchestrator (AIO)

Here’s the production insight that most teams miss: not every request needs the same model. A simple factual lookup (“What is the capital of France?”) doesn’t need a 70B model. A complex multi-step reasoning task does. Routing every request to the largest model wastes 60–70% of compute budget.

The Adaptive Inference Orchestrator (AIO) classifies incoming requests by complexity and routes them to the cheapest model capable of handling them:

  ADAPTIVE INFERENCE ORCHESTRATOR (AIO)
  ======================================

  Incoming Request
       |
       v
  +---------------------------+
  | Complexity Classifier     |
  | (Lightweight BERT/DistilBERT model)
  | Trained on: (query, difficulty_label) pairs
  |                           |
  | Features:                 |
  | - Query length            |
  | - Entity count            |
  | - Reasoning keyword detection
  | - Domain classification   |
  | - Historical accuracy per tier
  +------+------+------+------+
         |      |      |      |
    Tier 1  Tier 2  Tier 3  Tier 4
    Simple  Medium  Complex Expert
         |      |      |      |
         v      v      v      v
  +--------+ +--------+ +----------+ +-----------+
  | SLM    | | SLM    | | LLM      | | LLM       |
  | 3B     | | 8B     | | 70B      | | 70B +     |
  | INT4   | | INT4   | | AWQ INT4 | | Speculative|
  | (CPU)  | | (1 GPU)| | (4 GPU)  | | + CoT     |
  +--------+ +--------+ +----------+ +-----------+
  Cost/req:  Cost/req:  Cost/req:    Cost/req:
  $0.0001   $0.0005    $0.003       $0.01

  Traffic Split (typical enterprise):
  [====60%====][==20%==][==15%=][=5%=]

  Weighted avg cost: 60%*0.0001 + 20%*0.0005 + 15%*0.003 + 5%*0.01
                   = $0.00061 per request
  vs. All-70B:     = $0.003 per request
  Savings:         = 80% cost reduction

  +---------------------------+
  | Quality Monitor           |
  | - Sample 5% of Tier 1/2  |
  |   responses, verify with  |
  |   Tier 4 model            |
  | - Auto-promote queries    |
  |   if quality drops        |
  | - Retrain classifier      |
  |   weekly on new data      |
  +---------------------------+

from dataclasses import dataclass
from enum import Enum
import numpy as np
from typing import Optional

class ComplexityTier(Enum):
    SIMPLE = 1      # Factual, lookup, simple Q&A
    MEDIUM = 2      # Moderate reasoning, summarization
    COMPLEX = 3     # Multi-step reasoning, analysis
    EXPERT = 4      # Chain-of-thought, complex generation

@dataclass
class InferenceEndpoint:
    tier: ComplexityTier
    model_name: str
    base_url: str
    cost_per_token: float
    avg_latency_ms: float
    max_tokens: int

class AdaptiveInferenceOrchestrator:
    """Routes requests to the cheapest capable model."""

    def __init__(self, classifier_model, endpoints: list[InferenceEndpoint]):
        self.classifier = classifier_model  # Fine-tuned DistilBERT
        self.endpoints = {e.tier: e for e in endpoints}
        self.quality_monitor = QualityMonitor()

    def classify_complexity(self, query: str) -> ComplexityTier:
        """Classify query complexity using lightweight model."""
        features = self._extract_features(query)
        logits = self.classifier.predict(features)
        tier = ComplexityTier(np.argmax(logits) + 1)
        return tier

    def _extract_features(self, query: str) -> dict:
        """Extract complexity signals from query."""
        return {
            "length": len(query.split()),
            "has_reasoning_keywords": any(
                kw in query.lower() for kw in
                ["analyze", "compare", "explain why", "step by step",
                 "evaluate", "synthesize", "what would happen if"]
            ),
            "entity_count": self._count_entities(query),
            "question_depth": self._estimate_depth(query),
            "domain_specificity": self._domain_score(query),
        }

    async def route_request(self, query: str, context: Optional[str] = None) -> dict:
        """Route request to appropriate tier and return response."""
        tier = self.classify_complexity(query)
        endpoint = self.endpoints[tier]

        # Execute inference
        response = await self._call_endpoint(endpoint, query, context)

        # Quality monitoring: sample 5% of low-tier responses
        if tier.value <= 2 and np.random.random() < 0.05:
            self.quality_monitor.schedule_verification(
                query=query,
                response=response,
                tier=tier,
                verification_endpoint=self.endpoints[ComplexityTier.EXPERT]
            )

        return {
            "response": response,
            "tier": tier.name,
            "model": endpoint.model_name,
            "cost": self._estimate_cost(response, endpoint),
            "latency_ms": response.latency_ms,
        }

class QualityMonitor:
    """Monitors routing quality and auto-adjusts classifier."""

    def __init__(self, quality_threshold: float = 0.85):
        self.threshold = quality_threshold
        self.verification_log = []

    async def schedule_verification(self, query, response, tier, verification_endpoint):
        """Verify low-tier response against expert model."""
        expert_response = await self._call_endpoint(verification_endpoint, query)

        # Semantic similarity between low-tier and expert response
        similarity = self._compute_similarity(response.text, expert_response.text)

        self.verification_log.append({
            "query": query,
            "tier": tier,
            "similarity": similarity,
            "should_upgrade": similarity < self.threshold,
        })

        # If too many low-quality responses, retrain classifier
        recent = self.verification_log[-100:]
        upgrade_rate = sum(1 for v in recent if v["should_upgrade"]) / len(recent)

        if upgrade_rate > 0.15:  # >15% of sampled responses need upgrade
            self._trigger_classifier_retrain()

8. Production Serving Stack Comparison

Feature	vLLM	TensorRT-LLM	TGI (HF)	SGLang
PagedAttention	Yes (native)	Yes	Yes (via vLLM)	Yes
Continuous Batching	Yes	Yes	Yes	Yes
Speculative Decoding	Yes	Yes	Limited	Yes
Multi-LoRA	Yes (S-LoRA)	Yes	Yes	Yes
Quantization	AWQ, GPTQ, FP8	All + custom	AWQ, GPTQ, BnB	AWQ, GPTQ, FP8
Structured Output	Outlines	Limited	Yes	Native (fast)
Best For	General purpose, highest throughput	NVIDIA-optimised, lowest latency	HF ecosystem, easy deployment	Complex prompts, structured output

9. Monitoring: The Three Metrics That Matter

Production LLM serving requires three key metrics, each capturing a different user experience dimension:

TTFT (Time to First Token): How long until the user sees the first response character. Dominated by prefill latency. Target: <500ms for interactive, <2s for batch.
TPS (Tokens Per Second): Decode throughput per request. Determines how fast the response streams. Target: >30 TPS for interactive (readable speed), higher for batch.
P99 End-to-End Latency: The tail latency that determines worst-case user experience. Target: <5x the P50. If your P99/P50 ratio exceeds 10x, you have a scheduling or memory contention problem.

import time
from dataclasses import dataclass, field
from collections import defaultdict

@dataclass
class InferenceMetrics:
    """Production inference metrics collector."""

    ttft_samples: list = field(default_factory=list)
    tps_samples: list = field(default_factory=list)
    e2e_samples: list = field(default_factory=list)
    tier_counts: dict = field(default_factory=lambda: defaultdict(int))

    def record_request(self, ttft_ms, total_tokens, total_ms, tier):
        self.ttft_samples.append(ttft_ms)
        tps = total_tokens / (total_ms / 1000) if total_ms > 0 else 0
        self.tps_samples.append(tps)
        self.e2e_samples.append(total_ms)
        self.tier_counts[tier] += 1

    def report(self):
        import numpy as np
        return {
            "ttft_p50_ms": np.percentile(self.ttft_samples, 50),
            "ttft_p99_ms": np.percentile(self.ttft_samples, 99),
            "tps_mean": np.mean(self.tps_samples),
            "tps_p50": np.percentile(self.tps_samples, 50),
            "e2e_p50_ms": np.percentile(self.e2e_samples, 50),
            "e2e_p99_ms": np.percentile(self.e2e_samples, 99),
            "p99_p50_ratio": (
                np.percentile(self.e2e_samples, 99) /
                max(np.percentile(self.e2e_samples, 50), 1)
            ),
            "tier_distribution": dict(self.tier_counts),
        }

Key Takeaways

PagedAttention + Continuous Batching are table stakes. If you’re not using vLLM/TGI/TensorRT-LLM, you’re leaving 10–50x throughput on the table.
Speculative decoding is free throughput for the right use cases — 2–5x speedup with zero quality loss.
AWQ INT4 is the production quantization sweet spot. FP8 on H100s where available.
Route requests by complexity (AIO pattern). Sending simple queries to large models wastes 60–80% of your budget.
Measure TTFT, TPS, and P99 E2E. These three metrics tell you everything about user experience and system health.

The best inference optimization is the one you don’t need — route simple queries to small models, and only invoke the heavy machinery for tasks that truly require it.

References & Resources

Research Papers

Kwon, W., et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention” (arXiv:2309.06180, 2023)
Leviathan, Y., et al. “Fast Inference from Transformers via Speculative Decoding” (arXiv:2211.17192, 2022)
Shazeer, N. “Fast Transformer Decoding: One Write-Head is All You Need” (arXiv:1911.02150, 2019)
Ainslie, J., et al. “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints” (arXiv:2305.13245, 2023)
Frantar, E., et al. “GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers” (arXiv:2210.17323, 2022)
Lin, J., et al. “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration” (arXiv:2306.00978, 2023)
Xiao, G., et al. “SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models” (arXiv:2211.10438, 2022)
Agrawal, A., et al. “Sarathi: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills” (arXiv:2308.16369, 2023)
Chen, C., et al. “Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads” (arXiv:2401.02954, 2024)
Zheng, L., et al. “SGLang: Efficient Execution of Structured Language Model Programs” (arXiv:2312.07104, 2023)

Frameworks & Tools

vLLM — High-throughput LLM serving engine
TensorRT-LLM — NVIDIA’s optimized inference library
Text Generation Inference (TGI) — Hugging Face serving solution
SGLang — Structured generation language

LLM Inference vLLM Speculative Decoding Quantization KV-Cache Production ML

Prev: LoRA vs Fine-Tuning Next: Agentic AI Systems