GPU Memory Optimization for Large Language Model Inference

GPU memory utilization diagram for LLM inference with KV cache visualization

GPU memory — VRAM — is the scarcest resource in large language model inference. A 70-billion parameter model loaded in FP16 requires approximately 140 GB of VRAM just for the model weights, before accounting for the KV cache, activations, and serving framework overhead required to handle concurrent requests. That's more VRAM than fits on a single A100 80GB or H100 80GB GPU, making multi-GPU deployment necessary even for mid-sized models. The economic consequence is direct: VRAM constraints determine how many models you can fit per GPU, how many concurrent requests you can serve, and ultimately the cost per inference token.

Memory optimization for LLM inference is not a single technique but a stack of complementary techniques, each targeting a different component of the memory budget. Understanding how VRAM is allocated — and which optimization reduces which allocation — is the prerequisite for making good decisions about which techniques to apply to a given deployment scenario.

Anatomy of LLM VRAM Consumption

VRAM consumption in LLM inference has three main components: model weights, KV cache, and activation memory. Model weights are fixed for a given model and precision: a model with P parameters loaded in FP16 requires approximately 2P bytes (e.g., 70B parameters × 2 bytes = 140 GB). The KV cache stores the key and value tensors computed during the prefill phase for each token in each request's context, enabling the autoregressive decode phase to reuse prior computations rather than recomputing them from scratch. KV cache size scales with batch size, sequence length, number of attention heads, and head dimension — for long-context requests at high concurrency, the KV cache can easily exceed the model weights in VRAM consumption. Activation memory is temporary memory required during the forward pass and is typically the smallest component for inference (unlike training, where it is often dominant).

The critical insight is that model weights are a fixed cost that determines your minimum VRAM floor, while KV cache is a variable cost that determines your maximum concurrent request capacity within the remaining VRAM. Optimization strategies that reduce model weight size (quantization) lower the floor; strategies that improve KV cache efficiency (paged attention, prefix caching) raise the effective ceiling for concurrent requests.

Quantization: Reducing the Weight Floor

Quantization reduces model weight precision from the training format (typically BF16 or FP16) to a lower-bit representation, reducing both memory footprint and, on hardware with appropriate integer compute units, inference latency. The main quantization formats for production LLM inference are: FP16/BF16 (baseline, no quantization), GPTQ and AWQ (weight-only INT4 quantization, reducing model size by ~4x with modest quality loss), INT8 (8-bit integer for both weights and activations using libraries like bitsandbytes or TensorRT-LLM), and FP8 (8-bit floating point, supported natively on H100 and newer hardware with minimal quality degradation).

Weight-only INT4 quantization (GPTQ, AWQ) is the most widely deployed technique for reducing the weight floor in production. A 70B model quantized to 4-bit requires approximately 35 GB of VRAM for weights — a 4x reduction that enables single-GPU deployment of models that previously required two or four GPUs. The quality tradeoff is model-dependent but typically manageable: on benchmarks, 4-bit quantized versions of large models often match or slightly exceed the quality of smaller unquantized models, making quantization a clear win for cost-quality tradeoff at scale.

Choose your quantization format based on your hardware and quality requirements. FP8 quantization on H100s provides the best quality-performance tradeoff if you have access to that hardware. GPTQ and AWQ are the right choices for A100s and consumer GPUs. Avoid INT8 weight-and-activation quantization unless you've carefully validated quality on your specific use case — it is more sensitive to outliers in model activations than weight-only quantization methods.

Paged Attention: Eliminating KV Cache Fragmentation

The original motivation for the vLLM serving system was a specific KV cache memory management problem: pre-allocating a contiguous memory block for each request's maximum sequence length leads to severe internal fragmentation when requests don't use their full allocated length, and to a "memory cliff" failure mode where the system runs out of memory suddenly when a few long-context requests are served concurrently. PagedAttention, introduced by the vLLM team, solves this by managing KV cache memory using virtual memory page tables — the same technique operating systems use to manage physical RAM.

In PagedAttention, KV cache is stored in fixed-size blocks (pages) that are allocated and freed dynamically as each request's context grows and completes. Physical memory blocks are mapped to logical sequence positions through a block table maintained per request. This eliminates internal fragmentation almost entirely and enables memory sharing between requests that share a common prefix — a technique called prefix caching or prompt caching. For applications where many requests share a common system prompt, prefix caching can reduce KV cache consumption by 30-70% and reduce time-to-first-token for cached prefixes to near-zero.

Memory Pooling and Tensor Parallelism

For multi-GPU deployments, tensor parallelism splits model weights across GPUs, with each GPU holding a shard of each layer's parameters. Tensor parallelism reduces per-GPU VRAM consumption proportionally to the number of GPUs but introduces inter-GPU communication overhead that increases latency. At small batch sizes, tensor parallelism across too many GPUs can increase latency rather than decrease it — the communication overhead dominates the compute savings. The optimal number of GPUs for tensor parallelism depends on your target batch size and the interconnect bandwidth between GPUs (NVLink significantly outperforms PCIe for this workload).

Memory pooling at the serving framework level — maintaining a single GPU memory pool shared across all requests rather than allocating and freeing per-request — reduces allocation overhead and fragmentation. Frameworks like vLLM and TensorRT-LLM implement memory pooling internally; when using these frameworks, configure the memory pool size (typically as a fraction of available VRAM) carefully. Setting the pool too large leaves insufficient memory for model weights and framework overhead; setting it too small limits concurrent request capacity unnecessarily. A starting point of 85-90% of available VRAM allocated to the serving pool, with the remainder reserved for weights and overhead, works well for most configurations.

Continuous Batching for Throughput

Traditional static batching groups a fixed set of requests into a batch, waits for all requests to complete, then starts the next batch. This is VRAM-inefficient: requests that finish early hold their KV cache allocation until the slowest request in the batch completes, blocking new requests from entering service. Continuous batching (also called iteration-level scheduling) inserts new requests into the batch at each decode step, allowing completed requests to be replaced immediately. This technique, implemented in vLLM, TGI, and TensorRT-LLM, typically improves GPU utilization by 2-3x over static batching at the same VRAM budget by eliminating idle time between batch completions.

Key Takeaways

  • VRAM consumption has two components: model weights (fixed floor, reduced by quantization) and KV cache (variable ceiling, improved by paged attention and prefix caching).
  • INT4 weight-only quantization (GPTQ, AWQ) provides ~4x weight size reduction with manageable quality tradeoffs — use it by default for models above 13B parameters on A100 hardware.
  • PagedAttention eliminates KV cache fragmentation and enables prefix caching; use vLLM or TensorRT-LLM rather than implementing your own serving loop to get these benefits automatically.
  • Continuous batching improves GPU utilization 2-3x over static batching; it should be the default scheduling strategy for all production LLM serving deployments.
  • Tensor parallelism reduces per-GPU VRAM but adds latency; profile at your target batch size before committing to a GPU count — diminishing returns appear sooner than expected on small batches.
  • Set serving memory pool to 85-90% of available VRAM; benchmark prefix cache hit rates and tune block size for your specific workload's prefix sharing patterns.

Conclusion

GPU memory optimization for LLM inference is a multi-layer discipline. The teams that run the most cost-efficient LLM inference at scale apply all the layers: quantization to reduce the weight floor, paged attention to efficiently manage the KV cache, prefix caching to exploit shared prefixes, and continuous batching to maximize utilization within the available memory budget. No single technique is sufficient, but combining them systematically can reduce the VRAM requirement — and the associated GPU cost — by 4-8x compared to naive FP16 deployment with static batching.