The Inference Engine Architecture War — What Actually Differs Between vLLM, SGLang, TensorRT-LLM, and Custom Silicon

In the same week of May 2026, Cerebras IPO at $26.6B, Sierra $950M Series E, RadixArk Seed $100M at $400M valuation, and Gimlet Labs $80M all landed simultaneously. The single question all this capital is betting on: “How do you serve already-trained models cheaper and faster?”

Two competing answers are fighting it out. Software inference engines (vLLM, SGLang, TensorRT-LLM) and custom silicon (Cerebras, Groq, Etched). But these aren’t competitors — each targets a different bottleneck. Knowing which bottleneck each one attacks, and how, is the basis for practical decision-making.

KV Cache Strategy Determines Throughput

The largest cost component in Transformer model inference is KV cache (Key-Value Cache) management. When generating each token, the key/value pairs for all previous tokens are cached to avoid recomputation. How that cache is stored and managed is what drives large throughput differences.

GPU VRAM must be shared between model parameters and the KV cache. Loading Llama 3.1 70B in bf16 already requires 140GB — two H100 80GB cards fill completely just for the model weights. The remaining VRAM determines how many requests can be processed simultaneously. If KV cache fragmentation is severe, new requests get rejected even when VRAM physically exists.

vLLM — PagedAttention: What It Learned from the OS

vLLM’s core innovation is PagedAttention, an idea borrowed from Linux virtual memory management. It splits the KV cache into fixed-size blocks (pages) of 4–16 tokens.

The traditional approach’s problem:

Request A: [token 1][token 2][token 3]...[token 512]   ← contiguous memory reserved
Request B: [token 1][token 2]                          ← contiguous memory reserved
            ↑ empty space between requests = internal fragmentation

PagedAttention’s solution:

Physical block pool:
[block 0][block 1][block 2][block 3][block 4][block 5]...

Request A → block table: [block 0 → tokens 1-16][block 3 → tokens 17-32]...
Request B → block table: [block 1 → tokens 1-16][block 4 → tokens 17-32]...

→ Allocate blocks on demand = no fragmentation

The same principle as an OS page table. The result: less KV cache waste, more concurrent batches.

vLLM strengths: 200+ model support, fast adoption of new models, continuous batching handles each request without queuing delays. Built-in OpenAI-compatible API. Fastest time to deploy.

# vLLM serving (just 2 lines)
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", tensor_parallel_size=2)
outputs = llm.generate(["The weather in Seoul is"], SamplingParams(temperature=0.8, max_tokens=100))

# Run as OpenAI-compatible server
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --tensor-parallel-size 2 \
    --port 8000

SGLang — RadixAttention: Caching Repeated Context

SGLang (commercialized by RadixArk, Seed $100M at $400M valuation) solved a different problem. In scenarios where the same system prompt repeats across every request — chatbots, agents, RAG — it eliminates the waste of recomputing identical prefix KV every single time.

RadixAttention structures the KV cache using a radix tree (trie).

Shared system prompt:
"You are a customer service AI. Answer questions kindly and accurately..."
(500 tokens)

Request 1: [system prompt 500 tokens][user: "How do I get a refund?"]
Request 2: [system prompt 500 tokens][user: "How do I track my order?"]
Request 3: [system prompt 500 tokens][user: "How do I fix a payment error?"]

Traditional approach: All three requests recompute KV for 500-token system prompt
RadixAttention: System prompt KV computed once, shared across all three requests
                → 500 tokens × 2 KV computations saved

Because the prefix is managed via a radix tree, cache hit rates are high, and previous turns in multi-turn conversations automatically reuse prior KV.

# Launch SGLang server
python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --port 30000 \
    --enable-cache-report   # monitor cache hit rate

# Result: 60-80% cache hit rate on workloads with shared prefixes

Throughput on the same workload (Llama 3.1 8B, bf16, batch inference):

vLLM: ~12,553 tok/s
SGLang: ~16,215 tok/s (+29%)

The +29% gap doesn’t mean “SGLang is better than vLLM.” It means RadixAttention’s cache hits reduce recomputation specifically on workloads with repeated system prompts. With many independent one-off requests, this gap narrows.

TensorRT-LLM — A Hopper-Dedicated Compiler

TensorRT-LLM is NVIDIA’s official inference framework, specialized for the Hopper architecture (H100/H200). Rather than simply loading a model, it compiles the computation graph into an executable that maximally exploits H100’s Tensor Cores and FP8 operations.

# Model compilation (28 minutes, done once)
trtllm-build \
    --checkpoint_dir ./llama-3.1-8b-hf \
    --output_dir ./llama-3.1-8b-trt \
    --gemm_plugin float16 \
    --use_fp8 \               # leverage H100 FP8 Tensor Cores
    --max_batch_size 64 \
    --max_input_len 2048 \
    --max_output_len 512

# Subsequent runs yield +20-40% throughput over vLLM

The compilation artifact is a binary optimized for a specific model on specific hardware. Switching to a different model or to an A100 requires recompilation. This is both TensorRT-LLM’s constraint and its strength — in production environments running a single model stably on H100, it delivers the highest throughput.

Throughput Comparison and Selection Criteria

Throughput benchmarks (Llama 3.1 8B, bf16, batch inference)

Engine	Throughput (tok/s)	Relative	Notes
vLLM (PagedAttention)	~12,553	baseline	continuous batching
SGLang (RadixAttention)	~16,215	+29%	high repeated prefix workloads
TensorRT-LLM (FP8)	~17,500+	+20–40%	H100-only, 28-min compile
Groq LPU	~500 tok/s (per request)	latency determinism	no batching, lowest latency

Figures based on single H100 80GB. Varies by workload characteristics.

Selection criteria by workload

Workload	Recommended engine	Reason
Rapid new model adoption, 200+ model operations	vLLM	Widest model support, fastest releases
Chatbot, RAG, agents (long system prompts)	SGLang	RadixAttention achieves high cache hit rates
Single-model H100 production (stable operations)	TensorRT-LLM	Highest throughput, but compilation required
Deterministic response latency required (telecom, medical)	Groq LPU	0.8s/100 tokens, OS-level determinism
Heterogeneous chip environments	Gimlet Labs	Workload slicing across chips

Custom Silicon: Which Bottleneck Does Each Solve?

If software engines are about how efficiently they use GPU VRAM, custom silicon takes a hardware approach to solving the GPU’s structural limitations.

The Memory Wall — Cerebras WSE-3

The fundamental GPU problem: A100/H100 DRAM bandwidth is 2TB/s and 3.35TB/s respectively. Compute is blazing fast, but the bottleneck is reading model parameters from DRAM. At batch size 1 with a 70B model, DRAM bandwidth caps throughput — the reason compute utilization sits at just 10–20%.

Cerebras WSE-3 eliminates this bottleneck at the root.

NVIDIA H100:
  DRAM capacity:    80 GB
  DRAM bandwidth:   3.35 TB/s
  Model access:     DRAM ↔ HBM ↔ SRAM ↔ Tensor Core

Cerebras WSE-3:
  On-chip SRAM:     44 GB (model parameters live on-chip)
  SRAM bandwidth:   21 PB/s (6,000x H100's DRAM bandwidth)
  Model access:     SRAM → compute (no DRAM)

No DRAM access means no memory wall. The trade-off: Llama-70B inference requires 4 wafers and 336 chips. The cost is extreme and general-purpose applicability is limited.

Latency Determinism — Groq LPU

Groq’s Language Processing Unit solves a different problem. On GPUs, inference runs through multiple software layers — OS scheduler, CUDA driver, memory allocator. These layers make latency non-deterministic. Response time can vary between deployments.

The Groq LPU determines all execution paths at compile time. There is no runtime scheduling. Results:

Response latency: 0.8s/100 tokens (deterministic)
Batch processing: not supported (single-request optimized)
Training: not supported (inference only)
Use cases: telecom SLAs, autonomous driving safety systems, real-time medical diagnostics

The 70B model requires 576 chips — also not a general-purpose solution.

Transformer-Dedicated ASIC — Etched Sohu

The most aggressive bet. 96% of transistors allocated to matrix operations (matmul). Almost no other compute capability. The premise: all transformer operations decompose to matrix multiplication, so optimizing only that operation completely.

Claimed performance: ~20x throughput vs 8-chip H100 cluster on Llama-70B.

The risk is clear. If the architecture moves toward linear-time-complexity models like Mamba, RWKV, or SSM, an ASIC specialized for matrix multiplication loses competitive relevance immediately. This bet was made at $500M at $5B valuation. Etched needs to be confident that “the transformer architecture remains dominant for the next 10+ years.”

Gimlet Labs: The Heterogeneous Chip Integration Layer

Gimlet Labs (Series A $80M) chose a different direction entirely. Not “build a better chip,” but “build a compiler that optimally uses all the chips you already have, simultaneously.”

Traditional approach:
GPU cluster → model → execution optimized for single chip architecture

Gimlet Labs approach:
NVIDIA H100 + AMD MI300 + Intel Gaudi + Cerebras + d-Matrix
        ↓
Gimlet compiler: slices model per-chip → each chip handles what it's good at
        ↓
3–10x inference acceleration (no chip swap, existing infrastructure)

Many enterprises already have NVIDIA GPUs with some AMD mixed in. If you can optimize the existing heterogeneous environment without switching to new chips, adoption friction is low. That’s Gimlet’s go-to-market.

Decision Guide

Three questions narrow down the choice.

Q1. Do you need model diversity?

YES → vLLM (200+ models, fastest new model support)
NO → next question

Q2. Are your system prompts or shared contexts long? (RAG, agents, chatbots)

YES → SGLang (RadixAttention is most effective)
NO → next question

Q3. Are you running a single model stably on H100 long-term?

YES → TensorRT-LLM (28-min initial compilation cost, then highest throughput)
Deterministic latency required → Groq LPU
Heterogeneous chip environment → Gimlet Labs

In real operations, layering is common. Using vLLM with prefix caching enabled, or combining SGLang with TensorRT-LLM style FP8 quantization. Understanding workload characteristics comes before engine selection.

The software inference engine war hasn’t concluded. vLLM’s ecosystem advantage, SGLang’s agent specialization, and TensorRT-LLM’s hardware optimization each carry different adoption motivations. Each engine also reflects different design philosophy: vLLM optimizes for breadth and velocity, SGLang for prefix-heavy agentic workloads, TensorRT-LLM for stable single-model production throughput.

Custom silicon, as Cerebras’s IPO shows, is drawing capital market bets — but is more likely to establish itself as a complement for specialized workloads than to replace general-purpose GPUs. Whether Etched’s transformer-dedicated ASIC bet remains valid ten years from now depends on how the architecture landscape evolves.

References

May 7, 2026 ∙ AI-inference vLLM SGLang TensorRT-LLM custom-silicon AI-infrastructure Cerebras Groq

Looking for a product partner? Founders, teams, businesses — from problem framing to launch.

Work With Me Get in touch →