The Inference Engine Architecture War — What Actually Differs Between vLLM, SGLang, TensorRT-LLM, and Custom Silicon
In the same week of May 2026, Cerebras IPO at $26.6B, Sierra $950M Series E, RadixArk Seed $100M at $400M valuation, and Gimlet Labs $80M all landed simultaneously. The single question all this capital is betting on: “How do you serve already-trained models cheaper and faster?”
Two competing answers are fighting it out. Software inference engines (vLLM, SGLang, TensorRT-LLM) and custom silicon (Cerebras, Groq, Etched). But these aren’t competitors — each targets a different bottleneck. Knowing which bottleneck each one attacks, and how, is the basis for practical decision-making.
Table of Contents
- KV Cache Strategy Determines Throughput
- vLLM — PagedAttention: What It Learned from the OS
- SGLang — RadixAttention: Caching Repeated Context
- TensorRT-LLM — A Hopper-Dedicated Compiler
- Throughput Comparison and Selection Criteria
- Custom Silicon: Which Bottleneck Does Each Solve?
- Gimlet Labs: The Heterogeneous Chip Integration Layer
- Decision Guide
KV Cache Strategy Determines Throughput
The largest cost component in Transformer model inference is KV cache (Key-Value Cache) management. When generating each token, the key/value pairs for all previous tokens are cached to avoid recomputation. How that cache is stored and managed is what drives large throughput differences.
GPU VRAM must be shared between model parameters and the KV cache. Loading Llama 3.1 70B in bf16 already requires 140GB — two H100 80GB cards fill completely just for the model weights. The remaining VRAM determines how many requests can be processed simultaneously. If KV cache fragmentation is severe, new requests get rejected even when VRAM physically exists.
vLLM — PagedAttention: What It Learned from the OS
vLLM’s core innovation is PagedAttention, an idea borrowed from Linux virtual memory management. It splits the KV cache into fixed-size blocks (pages) of 4–16 tokens.
The traditional approach’s problem:
Request A: [token 1][token 2][token 3]...[token 512] ← contiguous memory reserved
Request B: [token 1][token 2] ← contiguous memory reserved
↑ empty space between requests = internal fragmentationPagedAttention’s solution:
Physical block pool:
[block 0][block 1][block 2][block 3][block 4][block 5]...
Request A → block table: [block 0 → tokens 1-16][block 3 → tokens 17-32]...
Request B → block table: [block 1 → tokens 1-16][block 4 → tokens 17-32]...
→ Allocate blocks on demand = no fragmentationThe same principle as an OS page table. The result: less KV cache waste, more concurrent batches.
vLLM strengths: 200+ model support, fast adoption of new models, continuous batching handles each request without queuing delays. Built-in OpenAI-compatible API. Fastest time to deploy.
# vLLM serving (just 2 lines)
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", tensor_parallel_size=2)
outputs = llm.generate(["The weather in Seoul is"], SamplingParams(temperature=0.8, max_tokens=100))
# Run as OpenAI-compatible server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 2 \
--port 8000
SGLang — RadixAttention: Caching Repeated Context
SGLang (commercialized by RadixArk, Seed $100M at $400M valuation) solved a different problem. In scenarios where the same system prompt repeats across every request — chatbots, agents, RAG — it eliminates the waste of recomputing identical prefix KV every single time.
RadixAttention structures the KV cache using a radix tree (trie).
Shared system prompt:
"You are a customer service AI. Answer questions kindly and accurately..."
(500 tokens)
Request 1: [system prompt 500 tokens][user: "How do I get a refund?"]
Request 2: [system prompt 500 tokens][user: "How do I track my order?"]
Request 3: [system prompt 500 tokens][user: "How do I fix a payment error?"]
Traditional approach: All three requests recompute KV for 500-token system prompt
RadixAttention: System prompt KV computed once, shared across all three requests
→ 500 tokens × 2 KV computations savedBecause the prefix is managed via a radix tree, cache hit rates are high, and previous turns in multi-turn conversations automatically reuse prior KV.
# Launch SGLang server
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--port 30000 \
--enable-cache-report # monitor cache hit rate
# Result: 60-80% cache hit rate on workloads with shared prefixes
Throughput on the same workload (Llama 3.1 8B, bf16, batch inference):
- vLLM: ~12,553 tok/s
- SGLang: ~16,215 tok/s (+29%)
The +29% gap doesn’t mean “SGLang is better than vLLM.” It means RadixAttention’s cache hits reduce recomputation specifically on workloads with repeated system prompts. With many independent one-off requests, this gap narrows.
TensorRT-LLM — A Hopper-Dedicated Compiler
TensorRT-LLM is NVIDIA’s official inference framework, specialized for the Hopper architecture (H100/H200). Rather than simply loading a model, it compiles the computation graph into an executable that maximally exploits H100’s Tensor Cores and FP8 operations.
# Model compilation (28 minutes, done once)
trtllm-build \
--checkpoint_dir ./llama-3.1-8b-hf \
--output_dir ./llama-3.1-8b-trt \
--gemm_plugin float16 \
--use_fp8 \ # leverage H100 FP8 Tensor Cores
--max_batch_size 64 \
--max_input_len 2048 \
--max_output_len 512
# Subsequent runs yield +20-40% throughput over vLLM
The compilation artifact is a binary optimized for a specific model on specific hardware. Switching to a different model or to an A100 requires recompilation. This is both TensorRT-LLM’s constraint and its strength — in production environments running a single model stably on H100, it delivers the highest throughput.
Throughput Comparison and Selection Criteria
Throughput benchmarks (Llama 3.1 8B, bf16, batch inference)
| Engine | Throughput (tok/s) | Relative | Notes |
|---|---|---|---|
| vLLM (PagedAttention) | ~12,553 | baseline | continuous batching |
| SGLang (RadixAttention) | ~16,215 | +29% | high repeated prefix workloads |
| TensorRT-LLM (FP8) | ~17,500+ | +20–40% | H100-only, 28-min compile |
| Groq LPU | ~500 tok/s (per request) | latency determinism | no batching, lowest latency |
Figures based on single H100 80GB. Varies by workload characteristics.
Selection criteria by workload
| Workload | Recommended engine | Reason |
|---|---|---|
| Rapid new model adoption, 200+ model operations | vLLM | Widest model support, fastest releases |
| Chatbot, RAG, agents (long system prompts) | SGLang | RadixAttention achieves high cache hit rates |
| Single-model H100 production (stable operations) | TensorRT-LLM | Highest throughput, but compilation required |
| Deterministic response latency required (telecom, medical) | Groq LPU | 0.8s/100 tokens, OS-level determinism |
| Heterogeneous chip environments | Gimlet Labs | Workload slicing across chips |
Custom Silicon: Which Bottleneck Does Each Solve?
If software engines are about how efficiently they use GPU VRAM, custom silicon takes a hardware approach to solving the GPU’s structural limitations.
The Memory Wall — Cerebras WSE-3
The fundamental GPU problem: A100/H100 DRAM bandwidth is 2TB/s and 3.35TB/s respectively. Compute is blazing fast, but the bottleneck is reading model parameters from DRAM. At batch size 1 with a 70B model, DRAM bandwidth caps throughput — the reason compute utilization sits at just 10–20%.
Cerebras WSE-3 eliminates this bottleneck at the root.
NVIDIA H100:
DRAM capacity: 80 GB
DRAM bandwidth: 3.35 TB/s
Model access: DRAM ↔ HBM ↔ SRAM ↔ Tensor Core
Cerebras WSE-3:
On-chip SRAM: 44 GB (model parameters live on-chip)
SRAM bandwidth: 21 PB/s (6,000x H100's DRAM bandwidth)
Model access: SRAM → compute (no DRAM)No DRAM access means no memory wall. The trade-off: Llama-70B inference requires 4 wafers and 336 chips. The cost is extreme and general-purpose applicability is limited.
Latency Determinism — Groq LPU
Groq’s Language Processing Unit solves a different problem. On GPUs, inference runs through multiple software layers — OS scheduler, CUDA driver, memory allocator. These layers make latency non-deterministic. Response time can vary between deployments.
The Groq LPU determines all execution paths at compile time. There is no runtime scheduling. Results:
- Response latency: 0.8s/100 tokens (deterministic)
- Batch processing: not supported (single-request optimized)
- Training: not supported (inference only)
- Use cases: telecom SLAs, autonomous driving safety systems, real-time medical diagnostics
The 70B model requires 576 chips — also not a general-purpose solution.
Transformer-Dedicated ASIC — Etched Sohu
The most aggressive bet. 96% of transistors allocated to matrix operations (matmul). Almost no other compute capability. The premise: all transformer operations decompose to matrix multiplication, so optimizing only that operation completely.
Claimed performance: ~20x throughput vs 8-chip H100 cluster on Llama-70B.
The risk is clear. If the architecture moves toward linear-time-complexity models like Mamba, RWKV, or SSM, an ASIC specialized for matrix multiplication loses competitive relevance immediately. This bet was made at $500M at $5B valuation. Etched needs to be confident that “the transformer architecture remains dominant for the next 10+ years.”
Gimlet Labs: The Heterogeneous Chip Integration Layer
Gimlet Labs (Series A $80M) chose a different direction entirely. Not “build a better chip,” but “build a compiler that optimally uses all the chips you already have, simultaneously.”
Traditional approach:
GPU cluster → model → execution optimized for single chip architecture
Gimlet Labs approach:
NVIDIA H100 + AMD MI300 + Intel Gaudi + Cerebras + d-Matrix
↓
Gimlet compiler: slices model per-chip → each chip handles what it's good at
↓
3–10x inference acceleration (no chip swap, existing infrastructure)Many enterprises already have NVIDIA GPUs with some AMD mixed in. If you can optimize the existing heterogeneous environment without switching to new chips, adoption friction is low. That’s Gimlet’s go-to-market.
Decision Guide
Three questions narrow down the choice.
Q1. Do you need model diversity?
- YES → vLLM (200+ models, fastest new model support)
- NO → next question
Q2. Are your system prompts or shared contexts long? (RAG, agents, chatbots)
- YES → SGLang (RadixAttention is most effective)
- NO → next question
Q3. Are you running a single model stably on H100 long-term?
- YES → TensorRT-LLM (28-min initial compilation cost, then highest throughput)
- Deterministic latency required → Groq LPU
- Heterogeneous chip environment → Gimlet Labs
In real operations, layering is common. Using vLLM with prefix caching enabled, or combining SGLang with TensorRT-LLM style FP8 quantization. Understanding workload characteristics comes before engine selection.
The software inference engine war hasn’t concluded. vLLM’s ecosystem advantage, SGLang’s agent specialization, and TensorRT-LLM’s hardware optimization each carry different adoption motivations. Each engine also reflects different design philosophy: vLLM optimizes for breadth and velocity, SGLang for prefix-heavy agentic workloads, TensorRT-LLM for stable single-model production throughput.
Custom silicon, as Cerebras’s IPO shows, is drawing capital market bets — but is more likely to establish itself as a complement for specialized workloads than to replace general-purpose GPUs. Whether Etched’s transformer-dedicated ASIC bet remains valid ten years from now depends on how the architecture landscape evolves.