Stream2llm Mlsys26 | RajveerBachkaniwala

1. Problem

Modern LLM serving systems that rely on external context retrieval (web search, vector databases) face a fundamental tension: waiting for complete context before starting inference increases time-to-first-token (TTFT) dramatically—web crawling takes ~10 seconds, vector search ~4.5 seconds—while starting without full context reduces response quality. Prior streaming approaches only handle single requests and break down under multi-tenant serving where 4 concurrent requests contend for limited GPU memory.

2. Motivation

Streaming context incrementally to the model as it arrives can eliminate the retrieval latency bottleneck. However, naive streaming under GPU memory pressure makes tail latency 5x worse than not streaming at all. The scheduler design is the difference between an 11x win and a 5x regression. Additionally, context retrieval exhibits two fundamentally different patterns—append-mode (web crawlers, where documents extend the input) and update-mode (vector search, where refined results replace earlier documents and invalidate cached computation)—requiring distinct cache management strategies.

3. Contribution / Solution

Stream2LLM extends vLLM with multi-tenant streaming support through:

Two-phase scheduling architecture: Decouples request prioritization from memory management—Phase 1 ranks requests by policy (FCFS, LCAS, MCPS) and checks feasibility; Phase 2 allocates GPU blocks and evicts if needed.
Two streaming modes: Supports both append-mode (progressive accumulation) and update-mode (iterative refinement with cache invalidation) through a unified mechanism.
LCP-based cache invalidation: When input changes, computes the longest common prefix between old and new token sequences, only invalidating cache blocks beyond the unchanged prefix.
Cost-based adaptive preemption: When GPU memory exhausts, chooses between recomputation and swapping to CPU based on hardware-specific performance models profiled offline.

4. Results / Observations

Up to 11x faster TTFT: On web crawling workloads at high load (QPS 4.0), median TTFT improves 10.8–11.0x. Vector search workloads see 2.5–2.6x P95 TTFT improvements.
Throughput parity: All streaming variants achieve near-identical trace completion times (within ~1%) compared to non-streaming baselines—latency gains at zero throughput cost.
Scheduling is critical under memory pressure: Without proper scheduling, streaming makes P99 latency 5x worse. With FCFS/LCAS and cost-based preemption, P99 improves up to 9.14x even under extreme memory contention.
Hardware-aware eviction matters: Recompute-only reaches 10.03x P99 on crawler workloads but only 6.69x with swap-only; the cost-based approach adapts per-GPU, achieving 8.62x by selecting the cheaper option at each eviction.
Evaluated on real-world traces: Web crawling (4,322 queries from OpenAI SimpleQA) and vector search (500 queries over 372GB Fineweb-edu corpus) on NVIDIA H200 and H100 GPUs with Llama-3.1-8B.