1. Problem

Modern LLM serving systems that rely on external context retrieval (web search, vector databases) face a fundamental tension: waiting for complete context before starting inference increases time-to-first-token (TTFT) dramatically—web crawling takes ~10 seconds, vector search ~4.5 seconds—while starting without full context reduces response quality. Prior streaming approaches only handle single requests and break down under multi-tenant serving where 4 concurrent requests contend for limited GPU memory.

2. Motivation

Streaming context incrementally to the model as it arrives can eliminate the retrieval latency bottleneck. However, naive streaming under GPU memory pressure makes tail latency 5x worse than not streaming at all. The scheduler design is the difference between an 11x win and a 5x regression. Additionally, context retrieval exhibits two fundamentally different patterns—append-mode (web crawlers, where documents extend the input) and update-mode (vector search, where refined results replace earlier documents and invalidate cached computation)—requiring distinct cache management strategies.

3. Contribution / Solution

Stream2LLM extends vLLM with multi-tenant streaming support through:

4. Results / Observations