<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://rajveerbachkaniwala.com/blog/feed.xml" rel="self" type="application/atom+xml" /><link href="https://rajveerbachkaniwala.com/blog/" rel="alternate" type="text/html" /><updated>2026-05-24T21:42:09+00:00</updated><id>https://rajveerbachkaniwala.com/blog/feed.xml</id><title type="html">@rajveerbach’s blog</title><subtitle></subtitle><author><name>@rajveerbach&apos;s blog</name></author><entry><title type="html">Why Ctrl+V won’t paste images in Claude Code on WSL, with a fix</title><link href="https://rajveerbachkaniwala.com/blog/2026/05/24/on-the-difficulty-of-pasting-a-picture/" rel="alternate" type="text/html" title="Why Ctrl+V won’t paste images in Claude Code on WSL, with a fix" /><published>2026-05-24T00:00:00+00:00</published><updated>2026-05-24T00:00:00+00:00</updated><id>https://rajveerbachkaniwala.com/blog/2026/05/24/on-the-difficulty-of-pasting-a-picture</id><content type="html" xml:base="https://rajveerbachkaniwala.com/blog/2026/05/24/on-the-difficulty-of-pasting-a-picture/"><![CDATA[<p>Copy an image in Windows. Open Claude Code inside a WSL terminal launched from Windows Terminal. Press Ctrl+V. Nothing happens.</p>

<p>Three small things between the Windows clipboard and Claude Code’s chat input are broken. Each one is harmless on its own. Together they make image paste fail completely. Here’s what they are and the workaround I built until the upstream fixes catch up.</p>

<h2 id="whats-actually-broken">What’s actually broken</h2>

<h3 id="1-the-windows-to-linux-clipboard-sync-only-knows-about-an-ancient-image-format">1. The Windows-to-Linux clipboard sync only knows about an ancient image format</h3>

<p>WSL has a built-in piece called <strong>WSLg</strong> (the “g” is for graphics). Its job is to make Windows and the Linux side share things — including the clipboard, so copy-paste works across the boundary. For text, it works fine. For images, it does two things badly.</p>

<p>First, it only syncs images in one direction: from Windows to Linux. Anything copied from a Linux app doesn’t flow back to Windows as an image.</p>

<p>Second, when WSLg sends a Windows image over to Linux, it converts it into a single, dated format — a specific old BMP variant that uses an obscure colour encoding (“BI_BITFIELDS”). Most software’s BMP readers can’t handle it. <strong>Claude Code’s reader is one of those.</strong> It tries to read what arrives, gets nothing useful, and gives up — silently. No error, no toast, no visible failure. The image just doesn’t attach.</p>

<p>(This is a known bug: <a href="https://github.com/anthropics/claude-code/issues/50552">claude-code#50552</a>.)</p>

<h3 id="2-the-same-windows-to-linux-sync-silently-overwrites-your-workarounds">2. The same Windows-to-Linux sync silently overwrites your workarounds</h3>

<p>You might think: fine, I’ll bypass WSLg for images. Read the Windows clipboard myself, convert the image to PNG, push the PNG straight onto the Linux clipboard. Claude Code will then find a PNG when it looks for an image, and the paste will work.</p>

<p>There’s even a standard Linux command for putting things on the clipboard: <code class="language-plaintext highlighter-rouge">wl-copy</code>. So you do exactly that — Windows image → PNG → <code class="language-plaintext highlighter-rouge">wl-copy --type image/png</code>.</p>

<p>It works. For a moment. Then it stops working again. Here’s what WSL does to you:</p>

<ol>
  <li>You put a PNG on the Linux clipboard.</li>
  <li>WSLg notices the Linux clipboard changed, and dutifully syncs it back to Windows. The Windows clipboard now reflects “PNG.”</li>
  <li><strong>The Windows clipboard just changed.</strong> That fires WSLg’s other half — the half that pushes Windows changes over to Linux. As we know from problem #1, that half only knows how to push BMP.</li>
  <li>So your good PNG on the Linux side gets overwritten with the broken BMP, shortly after you put it there.</li>
</ol>

<p>The cruellest part: the program you wrote to watch the Windows clipboard never sees step 4 happen. WSLg writes to the Linux clipboard directly — it doesn’t go through the Windows clipboard to do it. So from your watcher’s point of view, <strong>the Linux clipboard just silently mutates</strong>, with nothing for you to react to.</p>

<h3 id="3-windows-terminal-eats-ctrlv-before-claude-code-sees-it">3. Windows Terminal eats Ctrl+V before Claude Code sees it</h3>

<p>Suppose you fix everything above and a real PNG sits reliably on the Linux clipboard. Press Ctrl+V in Claude Code. Still nothing happens.</p>

<p>The reason: <strong>Windows Terminal</strong> — the program you’re typing into — has its own meaning for Ctrl+V. It’s the standard “paste text from the Windows clipboard” shortcut. So when you press Ctrl+V inside Windows Terminal:</p>

<ol>
  <li>Windows Terminal sees the keystroke first.</li>
  <li>It pastes (or tries to paste) the Windows clipboard as text into the terminal input.</li>
  <li>The keystroke never makes it down to the Linux side.</li>
  <li>Claude Code’s image-paste code (internally named <code class="language-plaintext highlighter-rouge">chat:imagePaste</code>) never runs.</li>
</ol>

<p>The terminal is one layer above Claude Code. It eats the input before the program below can react.</p>

<h2 id="the-fix">The fix</h2>

<p>Three small components, one per failure, laid out in the diagram below.</p>

<ul>
  <li><strong><code class="language-plaintext highlighter-rouge">clip-listener.exe</code></strong> — runs on Windows and encodes each clipboard image as a real PNG via Windows’ own GDI+. Sidesteps the BMP problem (#1).</li>
  <li><strong><code class="language-plaintext highlighter-rouge">wsl-clip-bridge</code></strong> — runs in WSL, pushes the PNG onto the Linux clipboard with <code class="language-plaintext highlighter-rouge">wl-copy</code>, and re-asserts once half a second later if WSLg has overwritten our PNG with the broken BMP. Handles the silent clobber (#2).</li>
  <li><strong>Alt+V keybinding</strong> in <code class="language-plaintext highlighter-rouge">~/.claude/keybindings.json</code> — triggers Claude Code’s <code class="language-plaintext highlighter-rouge">chat:imagePaste</code> handler without going through Ctrl+V, which Windows Terminal would eat. Routes around #3.</li>
</ul>

<p><img src="/blog/assets/images/wsl-clip-bridge/architecture.svg" alt="End-to-end flow: Snipping Tool → clip-listener.exe (Windows) → wsl-clip-bridge (WSL) → Linux clipboard → Claude Code" class="figure-default" /></p>
<p class="image-caption">What changes when the bridge is installed. Without it (top), images flow straight to Claude Code as broken BMP and the paste fails. With it (bottom), a Windows listener encodes a real PNG and a Linux script puts it on the Linux clipboard, re-asserting once after WSLg's overwrite. The user presses Alt+V instead of Ctrl+V to bypass Windows Terminal.</p>

<p>End to end: snip an image in Windows, press Alt+V in Claude Code, image attaches.</p>

<p>The full source — the Go listener, the bash bridge, the install script, and a more detailed walkthrough — lives at <a href="https://github.com/rajveerb/wsl-clip-bridge">github.com/rajveerb/wsl-clip-bridge</a>.</p>

<h2 id="try-it-yourself">Try it yourself</h2>

<p>Prerequisites: WSL2 with WSLg (Windows 11, or a recent Windows 10 + WSL update), Go 1.20+ on the Linux side for cross-compiling the Windows binary.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/rajveerb/wsl-clip-bridge.git
<span class="nb">cd </span>wsl-clip-bridge
<span class="nb">sudo </span>apt <span class="nb">install </span>wl-clipboard
./install.sh <span class="nt">--with-autostart</span> <span class="nt">--with-keybinding</span>
</code></pre></div></div>

<p>That last command does four things: cross-compiles the Windows-side listener (<code class="language-plaintext highlighter-rouge">GOOS=windows GOARCH=amd64</code>), drops the binaries into <code class="language-plaintext highlighter-rouge">~/.local/share/wsl-clip-bridge/</code> and <code class="language-plaintext highlighter-rouge">~/.local/bin/</code>, appends a snippet to <code class="language-plaintext highlighter-rouge">~/.bashrc</code> that starts the bridge on every new WSL shell, and writes <code class="language-plaintext highlighter-rouge">~/.claude/keybindings.json</code> with <code class="language-plaintext highlighter-rouge">alt+v → chat:imagePaste</code>.</p>

<p>Open a fresh WSL terminal. The bridge starts in the background. Verify:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># copy any image in Windows (Win+Shift+S)</span>
wl-paste <span class="nt">-l</span>   <span class="c"># should print image/png</span>
</code></pre></div></div>

<p>Then in Claude Code, press <strong>Alt+V</strong>. The image attaches.</p>

<p>The repo README walks through each step in more detail (running without the install script, stopping or uninstalling the bridge, debugging via the log).</p>

<h2 id="whose-problem-is-it">Whose problem is it?</h2>

<p>Four things contribute to the failure. The first problem above (“an ancient image format”) is really two separate issues, so the table splits them apart:</p>

<table>
  <thead>
    <tr>
      <th>#</th>
      <th>Failure</th>
      <th>Whose problem</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>WSLg only sends Windows images to Linux as <code class="language-plaintext highlighter-rouge">image/bmp</code></td>
      <td>Microsoft (WSLg)</td>
    </tr>
    <tr>
      <td>2</td>
      <td>Claude Code can’t read the BMP it actually receives</td>
      <td><strong>Claude Code — one-PR fix</strong></td>
    </tr>
    <tr>
      <td>3</td>
      <td>WSLg overwrites the Linux clipboard from Windows, silently</td>
      <td>Microsoft (WSLg); only matters because of #1 and #2</td>
    </tr>
    <tr>
      <td>4</td>
      <td>Windows Terminal grabs Ctrl+V before WSL programs see it</td>
      <td>Microsoft (Windows Terminal); Claude Code could route around it</td>
    </tr>
  </tbody>
</table>

<p><strong>#1</strong> is hardcoded in WSLg’s clipboard bridge (<code class="language-plaintext highlighter-rouge">rdpclip.c</code> in Microsoft’s Weston fork): exactly five Windows→Linux format mappings, and the only image one is <code class="language-plaintext highlighter-rouge">CF_DIB → image/bmp</code>. The strings <code class="language-plaintext highlighter-rouge">image/png</code> and <code class="language-plaintext highlighter-rouge">image/jpeg</code> don’t appear in the source. The upstream issue <a href="https://github.com/microsoft/wslg/issues/833">microsoft/wslg#833</a> has been open since September 2022.</p>

<p><strong>#2 is broader than the post above implies, and more fixable than I first thought.</strong> Claude Code bundles <a href="https://sharp.pixelplumbing.com/">sharp</a> as its image library, in its WebAssembly build. That build’s bundled libvips has <em>no</em> BMP loader — not just no BI_BITFIELDS variant, no BMP support of any kind. Claude Code does detect BMP on the clipboard and try to convert it via <code class="language-plaintext highlighter-rouge">sharp(bmpBuffer).png().toBuffer()</code>, but the call dies with “Input buffer contains unsupported image format.” Despite the user-facing implication that BMP is supported, sharp’s WASM build can’t read <em>any</em> BMP. The actual upstream fixes are: ship sharp’s native libvips build (which has BMP support); ship a small BMP→PNG converter that doesn’t go through sharp at all; or shell out to ImageMagick or GDI+ on detection failure. Any of these obsoletes this entire bridge.</p>

<p><strong>#3</strong> isn’t hypothetical. The bridge’s log on this machine shows it directly, with timestamped lines like <code class="language-plaintext highlighter-rouge">re-asserted clip-1.png (was: image/bmp,)</code> after every Snipping Tool capture. Interestingly, synthetic <code class="language-plaintext highlighter-rouge">Clipboard::SetImage</code> calls from PowerShell never triggered it; only the Snipping Tool path did, which suggests WSLg keys off something specific in how Snipping Tool finalises its writes. A single re-assertion at +0.5 s catches it reliably; the three additional retries I tried earlier never fired.</p>

<p><strong>#4</strong> is hardcoded in a place that surprised me. Windows Terminal’s Ctrl+V handler isn’t in <code class="language-plaintext highlighter-rouge">defaults.json</code> (that file only binds <code class="language-plaintext highlighter-rouge">Ctrl+Shift+V</code>). It’s in ConHost’s <code class="language-plaintext highlighter-rouge">windowio.cpp</code>, where the inputKeyInfo check on <code class="language-plaintext highlighter-rouge">'V'</code> with <code class="language-plaintext highlighter-rouge">IsInVirtualTerminalInputMode</code> and <code class="language-plaintext highlighter-rouge">ShouldTakeOverKeyboardShortcuts</code> swallows the keystroke before the inner program ever sees it. Tracked at <a href="https://github.com/microsoft/terminal/issues/5790">microsoft/terminal#5790</a>, open and on “Backlog” since 2020.</p>

<p>The kicker on #4 — <strong>Claude Code already defaults <code class="language-plaintext highlighter-rouge">chat:imagePaste</code> to Alt+V on native Windows.</strong> It just doesn’t apply that default in WSL, because WSL reports as Linux and the Windows-specific code never runs. The cleanest upstream fix isn’t even new functionality: detect “running in WSL inside Windows Terminal” at startup and apply the existing Windows Alt+V default there too. No keybinding file required from the user.</p>

<h2 id="when-this-stops-being-needed">When this stops being needed</h2>

<ul>
  <li>WSL starts sending PNG to Linux → the Windows listener is no longer needed.</li>
  <li>Claude Code learns to read the BMP variant → the Windows listener is no longer needed.</li>
  <li>Windows Terminal stops grabbing Ctrl+V → the custom keybinding is no longer needed.</li>
</ul>

<p>When all three happen, removing the workaround is one <code class="language-plaintext highlighter-rouge">pkill</code>, a couple of <code class="language-plaintext highlighter-rouge">rm</code>s, and taking the relevant snippets out of <code class="language-plaintext highlighter-rouge">.bashrc</code> and <code class="language-plaintext highlighter-rouge">~/.claude/keybindings.json</code>. Nothing else on the system depends on it.</p>]]></content><author><name>Rajveer Bachkaniwala</name></author><category term="essays" /><category term="systems" /><summary type="html"><![CDATA[Three small problems pile up between the Windows clipboard and Claude Code. Each one is harmless alone. Together they make image paste fail completely. Here's what's happening and how to fix it.]]></summary></entry><entry><title type="html">Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token</title><link href="https://rajveerbachkaniwala.com/blog/2026/05/19/stream2llm-overlap-context-streaming-prefill-reduced-ttft/" rel="alternate" type="text/html" title="Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token" /><published>2026-05-19T00:00:00+00:00</published><updated>2026-05-19T00:00:00+00:00</updated><id>https://rajveerbachkaniwala.com/blog/2026/05/19/stream2llm-overlap-context-streaming-prefill-reduced-ttft</id><content type="html" xml:base="https://rajveerbachkaniwala.com/blog/2026/05/19/stream2llm-overlap-context-streaming-prefill-reduced-ttft/"><![CDATA[<p>A user asks a question. Behind the scenes, a web crawler fetches pages to build context over about 10 seconds, with each page arriving roughly 700 milliseconds apart. Without streaming, the user stares at a blank screen the entire time – because the model cannot start until every page has arrived. With streaming, the model starts reading the first page while the rest are still being fetched. The first word appears in under a second.</p>

<p><img src="/blog/assets/images/stream2llm/streaming-retrieval-prefill.png" alt="Context streaming overlaps retrieval with prefill" class="figure-small" /></p>
<p class="image-caption">Context streaming overlaps retrieval with prefill, reducing TTFT by beginning inference as chunks arrive.</p>

<p>For a single request, streaming context is straightforward and effective.</p>

<p>Now serve four requests at once. Each is streaming context at a different rate. Each holds KV cache blocks in GPU memory for its growing input. Memory fills up. The scheduler must decide: which request gets evicted? Should its cache be discarded or swapped to CPU? And what happens when a request’s input changes mid-flight because the search algorithm found better documents?</p>

<p>Prior streaming systems do not answer these questions – they evaluate single-request scenarios only. <strong>Stream2LLM</strong> extends vLLM to handle concurrent streaming, delivering up to <strong>11x faster time-to-first-token (TTFT)</strong> while maintaining throughput parity. But there is a catch: naive streaming under memory pressure makes tail latency <strong>5x worse</strong> than not streaming at all. Getting the scheduler right is the difference between an 11x win and a 5x regression.</p>

<h2 id="why-memory-and-scheduling-matter">Why Memory and Scheduling Matter</h2>

<p>LLM inference has two phases. The <strong>prefill phase</strong> processes all input tokens in parallel, storing a key vector and value vector for every input token in GPU memory – the KV cache. The <strong>decode phase</strong> generates tokens one at a time, attending to the full cache at each step.</p>

<p>The cache is expensive. For Llama-3.1-8B with a 32K-token input, it is <strong>4 GB per request</strong> in FP16. Serve four concurrent requests and you already have 16 GB of KV cache alone. Scale retrieval time further and memory runs out given the total requests kv state resident in given window of time. When it does, the inference system must evict requests: either discard their cache (recomputation) or move it to CPU (swapping). Both cost time.</p>

<p>Traditional inference waits for all documents before starting prefill. Time-to-first-token is retrieval time plus prefill time, and retrieval dominates – often seconds. Streaming eliminates this wait by feeding documents to the model as they arrive, but existing solutions only handle one request at a time. None address what happens when many requests stream simultaneously.</p>

<h2 id="two-kinds-of-streaming">Two Kinds of Streaming</h2>

<p>The complication is that context retrieval doesn’t follow a single pattern. There are two, and they break the KV cache in different ways:</p>

<ul>
  <li><strong>Append-mode</strong> (web crawlers): Documents arrive sequentially and extend the input. <code class="language-plaintext highlighter-rouge">[c1, q]</code> becomes <code class="language-plaintext highlighter-rouge">[c1, c2, q]</code>. The cached prefix stays valid, so only the new tokens need processing. Favors swapping to preserve reusable cache.</li>
  <li><strong>Update-mode</strong> (vector search): The search algorithm refines its results, replacing documents. <code class="language-plaintext highlighter-rouge">[c1, c3, q]</code> becomes <code class="language-plaintext highlighter-rouge">[c1, c2, q]</code> – the cached KV pairs for <code class="language-plaintext highlighter-rouge">c3</code> are now wrong. The longest common prefix (LCP) between old and new inputs determines how much cache survives. Favors recomputation to avoid retaining soon-invalid data.</li>
</ul>

<p><img src="/blog/assets/images/stream2llm/streaming-modes-lcp.png" alt="Longest Common Prefix invalidation for streaming modes" class="figure-small" /></p>
<p class="image-caption">Longest Common Prefix (LCP) invalidation for append-mode and update-mode streaming. Append mode preserves the full prefix; update mode may invalidate significant portions of the KV cache.</p>

<p>The figure above shows the difference concretely. Each row represents the token sequence at a point in time, with blocks showing which KV cache entries survive across updates. In append mode (left), the full prefix is preserved every time a new chunk arrives – the cache grows monotonically and nothing is wasted. In update mode (right), replacing a document in the middle invalidates everything after the longest common prefix. The marked blocks are cached KV pairs that must be discarded because the tokens they correspond to no longer exist in the new input.</p>

<p>A scheduler that treats these uniformly will either waste computation or serve stale cache. Stream2LLM handles both.</p>

<h2 id="a-decoupled-scheduler">A Decoupled Scheduler</h2>

<p>Stream2LLM splits the scheduling problem into two independent decisions. It targets the prefill instance in a disaggregated architecture (where prefill and decode run on separate GPU pools). TTFT and throughput are the relevant metrics; decode latency is handled by a separate instance.</p>

<p>The scheduler must answer two questions: which request gets the GPU next, and where do the evicted blocks go? Coupling these decisions makes both worse – fixed eviction rules prevent scheduling algorithms from expressing their priorities.</p>

<h3 id="two-phase-scheduling">Two-Phase Scheduling</h3>

<p>Stream2LLM separates these concerns into two phases:</p>

<p><strong>Phase 1: Decide who matters most.</strong> Rank all unfinished requests by priority and check which ones can fit within the current token budget and GPU block capacity. No memory is allocated – this phase only produces an ordered list.</p>

<p><strong>Phase 2: Make room.</strong> Attempt to allocate GPU blocks for the selected requests. If allocation fails, evict the lowest-priority request and use cost-based decisions to choose between recomputation and swapping.</p>

<p>The separation means you can swap in a different scheduling policy without touching the eviction logic, and vice versa.</p>

<p><img src="/blog/assets/images/stream2llm/sys-design.png" alt="Stream2LLM system design" class="figure-small" /></p>
<p class="image-caption">Stream2LLM system design showing the two-phase scheduling architecture with streaming input support.</p>

<p>On the left, multiple clients stream context chunks into the system concurrently – each request’s input grows over time as retrieval progresses. These streaming inputs feed into the two-phase scheduler at the center.</p>

<p>In Phase 1, the scheduler ranks all active requests by the chosen scheduling policy (FCFS, LCAS, or MCPS) and determines which ones fit within the current token budget and available GPU block capacity. No memory is allocated yet – this phase produces only a priority-ordered list of requests to schedule.</p>

<p>In Phase 2, the scheduler attempts to allocate GPU memory blocks for the selected requests. When the KV cache pool is full, it evicts the lowest-priority request. The eviction decision – recompute from scratch or swap blocks to CPU – is made by the cost-based module on the right, which consults hardware-profiled latency models to pick the cheaper option. Evicted requests re-enter the waiting queue and are reconsidered in the next scheduling cycle.</p>

<p>The key insight is that the arrows between Phase 1 and Phase 2 go in one direction: the priority policy never needs to know about eviction mechanics, and the eviction module never needs to know about scheduling priorities. This decoupling is what makes it possible to swap in different policies without rewriting the memory management logic.</p>

<h3 id="lcp-based-cache-invalidation">LCP-Based Cache Invalidation</h3>

<p>A unique challenge for streaming inputs is that the token sequence itself changes dynamically. When a new chunk arrives, the scheduler must decide which cached KV blocks remain valid. Naive full-invalidation wastes memory and forces expensive recomputation; reusing blocks without verification risks incorrect output based on stale cache.</p>

<p>Stream2LLM computes the <strong>longest common prefix (LCP)</strong> between the old and new input token sequences – the portion that has not changed. Only KV cache blocks beyond this prefix are invalidated, preserving blocks for unchanged tokens.</p>

<p>For example, suppose a request previously computed KV cache for tokens <code class="language-plaintext highlighter-rouge">[c1, c2, q, output1, output2]</code>, and an input update replaces this with <code class="language-plaintext highlighter-rouge">[c1, c2', q, output1, output2]</code> where <code class="language-plaintext highlighter-rouge">c2' != c2</code>. The LCP is <code class="language-plaintext highlighter-rouge">[c1]</code> (length 1). The scheduler invalidates cache blocks for tokens 1 onward while preserving the KV cache for token 0.</p>

<p>The LCP approach is optimal when input updates append new chunks or replace suffix tokens – typical in context retrieval.</p>

<h3 id="cost-based-preemption">Cost-Based Preemption</h3>

<p>When GPU memory is exhausted, Stream2LLM chooses between recomputing tokens and swapping KV blocks to CPU. Both have hardware-specific costs – recomputing 30K tokens takes ~100ms on H200 but ~1000ms on A40 – so the scheduler profiles both offline on each target GPU and picks the cheaper option at eviction time.</p>

<p><img src="/blog/assets/images/stream2llm/hardware_comparison_combined_latency_plot.png" alt="Hardware-specific cost models" class="figure-small" /></p>
<p class="image-caption">Performance models for recomputation vs. total swap latency costs across token counts on H200 and A40. The scheduler selects the lower-cost strategy at preemption time.</p>

<h3 id="scheduling-policies">Scheduling Policies</h3>

<p>A note on labels used throughout the rest of the post: <strong>vLLM-NS</strong> is the unmodified vLLM baseline with no streaming, <strong>vLLM-S</strong> is vLLM with streaming but no custom scheduling, and the <strong>Stream2LLM-</strong>* variants are streaming combined with each of the custom scheduling policies introduced below.</p>

<p>Stream2LLM implements four scheduling algorithms:</p>

<table>
  <thead>
    <tr>
      <th>Scheduling Policy</th>
      <th style="text-align: center">Append Mode</th>
      <th style="text-align: center">Update Mode</th>
      <th>Key Behavior</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>vLLM-S (default streaming)</td>
      <td style="text-align: center">Poor</td>
      <td style="text-align: center">Poor</td>
      <td>Arrival-time ordering; ignores when chunks arrive. Catastrophic under memory pressure</td>
    </tr>
    <tr>
      <td>Stream2LLM-FCFS (First-Come-First-Served)</td>
      <td style="text-align: center">Excellent</td>
      <td style="text-align: center">Excellent</td>
      <td>Ensures requests with complete context get processed before partially-streamed ones, preventing starvation</td>
    </tr>
    <tr>
      <td>Stream2LLM-MCPS (Most Chunks Processed)</td>
      <td style="text-align: center">Poor</td>
      <td style="text-align: center">Poor</td>
      <td>Prioritizes requests with most tokens computed; priority collapses when updates reset progress</td>
    </tr>
    <tr>
      <td>Stream2LLM-LCAS (Last Chunk Arrival)</td>
      <td style="text-align: center">Excellent</td>
      <td style="text-align: center">Good</td>
      <td>Prioritizes most recent chunk arrival; best general-purpose choice</td>
    </tr>
  </tbody>
</table>

<p class="table-caption">Stream2LLM scheduling policies and their behavior under append-mode and update-mode streaming.</p>

<p>The core principle is to <strong>prioritize requests that just received fresh context</strong>. LCAS does this directly and handles both modes well. FCFS achieves the strongest results under memory pressure by cleanly separating complete from partial requests. MCPS looks reasonable in theory but fails in practice. It prioritizes by absolute computed token count – requests with the most tokens already processed go first. In append mode, a request that just received a fresh chunk has fewer computed tokens than one that has been processing longer, so fresh content sits idle while nearly-complete requests monopolize the GPU. Under memory pressure, these low-priority fresh-content requests are also the first to be evicted. In update mode it is worse: invalidation resets the computed count to near zero, dropping the request to lowest priority entirely.</p>

<h2 id="does-it-work">Does It Work?</h2>

<p>There are no public streaming workload traces. We built two – one for each retrieval pattern – and replayed them at varying queries per second (QPS).</p>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th style="text-align: center">Vector Search (Update)</th>
      <th style="text-align: center">Crawler (Append)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Queries</td>
      <td style="text-align: center">500</td>
      <td style="text-align: center">4,322</td>
    </tr>
    <tr>
      <td>Mean tokens/query</td>
      <td style="text-align: center">13K</td>
      <td style="text-align: center">9.1K</td>
    </tr>
    <tr>
      <td>P50 tokens/query</td>
      <td style="text-align: center">10K</td>
      <td style="text-align: center">5.8K</td>
    </tr>
    <tr>
      <td>P95 tokens/query</td>
      <td style="text-align: center">31K</td>
      <td style="text-align: center">28.9K</td>
    </tr>
    <tr>
      <td>Mean retrieval latency</td>
      <td style="text-align: center">4.5s</td>
      <td style="text-align: center">9.9s</td>
    </tr>
  </tbody>
</table>

<p class="table-caption">Workload statistics for the vector search (update mode) and crawler (append mode) traces.</p>

<p>The update-mode trace comes from a disk-based approximate nearest neighbor search (ANNS) – a form of vector search – over a 372 GB corpus. The append-mode trace comes from a web crawler fetching pages up to depth 2. Both are high-latency, I/O-bound retrieval scenarios.</p>

<p>The workloads differ sharply in chunk arrival patterns:</p>

<p><img src="/blog/assets/images/stream2llm/chunk_arrival_histogram.png" alt="Distribution of inter-chunk arrival times" /></p>
<p class="image-caption">Distribution of inter-chunk arrival times. Vector search chunks arrive with a median of 36.7ms; crawler chunks arrive with a median of 700.7ms with significantly higher variability.</p>

<p><strong>Hardware</strong>: NVIDIA H200 (141GB) and H100 (80GB) GPUs with Llama-3.1-8B-Instruct, tensor parallelism of 2, and 80% GPU memory utilization target. All experiments target the prefill instance.</p>

<h3 id="latency-up-to-11x-faster">Latency: Up to 11x Faster</h3>

<p>Each panel below shows what fraction of requests exceed a given TTFT – curves further left mean lower latency. The y-axis is log-scale, so differences at the bottom of each plot reflect tail behavior.</p>

<p><img src="/blog/assets/images/stream2llm/ttft_ccdf_stacked_2x4.png" alt="TTFT CCDF across load levels" /></p>
<p class="image-caption">TTFT CCDF across load levels for the crawler (top) and vector search (bottom) workloads on H200. Streaming achieves up to 10.8-11.0x faster median latencies on the crawler workload and 2.49-2.63x P95 speedups on the vector search workload.</p>

<p><strong>Crawler workload (append mode):</strong> Streaming improves TTFT at every load level. The gap widens with load: $\sim 4\times$ faster at low QPS, $\sim 11\times$ at high QPS. Scheduler choice matters only when page arrival rate approaches the prefill rate – at QPS 4, FCFS delivers $3.2\times$ P95 speedup over vLLM-S while MCPS degrades to $0.7\times$, deprioritizing requests with fresh content.</p>

<p><img src="/blog/assets/images/stream2llm/ttft_qps_comparison_crawler_0_4.0_qps.png" alt="Crawler TTFT vs QPS" /></p>
<p class="image-caption">Crawler workload: TTFT (average and P95) vs. QPS on H200 across all schedulers.</p>

<p><strong>Vector search workload (update mode):</strong> At low load, all streaming schedulers converge near $2.5\times$ P95 speedup. At QPS 2, differentiation emerges: FCFS achieves $2.3\times$, LCAS $1.8\times$, and MCPS drops to $1.5\times$. Despite significant cache invalidation, streaming still delivers $\sim 2\times$ faster TTFT than non-streaming.</p>

<p><img src="/blog/assets/images/stream2llm/ttft_qps_comparison_anns_0_2.0_qps.png" alt="Vector search TTFT vs QPS" /></p>
<p class="image-caption">vector search workload: TTFT (average and P95) vs. QPS on H200 across all schedulers.</p>

<h3 id="throughput-is-unaffected">Throughput Is Unaffected</h3>

<p>The latency gains raise a natural question: does streaming create overhead that slows total job completion? No. Trace completion times are near-identical across all methods (within ~1%).</p>

<p><img src="/blog/assets/images/stream2llm/trace_completion_time_combined.png" alt="Trace completion times" /></p>
<p class="image-caption">Trace completion time across QPS levels for both workloads. All scheduler variants achieve near-identical completion times, confirming throughput parity.</p>

<h3 id="when-memory-runs-out">When Memory Runs Out</h3>

<p>The H200’s 141 GB is generous enough that preemption never triggers at normal loads. To force the issue, we increased chunk delays ($10\times$ for crawler, $30\times$ for vector search), saturating the KV cache pool. This is where scheduler choice becomes critical. The numbers to watch are the P99 columns – that is where naive streaming collapses.</p>

<p>Speedup values are relative to the non-streaming baseline (first row). Values below 1.0x mean streaming is <em>slower</em> than not streaming.</p>

<div class="table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Scheduler</th>
        <th style="text-align: center">Recompute P50</th>
        <th style="text-align: center">Swap P50</th>
        <th style="text-align: center">Cost-Based P50</th>
        <th style="text-align: center">Recompute P99</th>
        <th style="text-align: center">Swap P99</th>
        <th style="text-align: center">Cost-Based P99</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>vLLM-NS</td>
        <td style="text-align: center">0.64s</td>
        <td style="text-align: center">0.64s</td>
        <td style="text-align: center">0.64s</td>
        <td style="text-align: center">8.97s</td>
        <td style="text-align: center">8.97s</td>
        <td style="text-align: center">8.97s</td>
      </tr>
      <tr>
        <td>vLLM-S</td>
        <td style="text-align: center">8.59x</td>
        <td style="text-align: center">7.29x</td>
        <td style="text-align: center">8.25x</td>
        <td style="text-align: center">0.78x</td>
        <td style="text-align: center">0.66x</td>
        <td style="text-align: center">0.71x</td>
      </tr>
      <tr>
        <td>Stream2LLM-FCFS</td>
        <td style="text-align: center"><strong>8.77x</strong></td>
        <td style="text-align: center"><strong>7.78x</strong></td>
        <td style="text-align: center"><strong>8.30x</strong></td>
        <td style="text-align: center"><strong>10.03x</strong></td>
        <td style="text-align: center"><strong>6.69x</strong></td>
        <td style="text-align: center">8.62x</td>
      </tr>
      <tr>
        <td>Stream2LLM-LCAS</td>
        <td style="text-align: center">8.61x</td>
        <td style="text-align: center">7.32x</td>
        <td style="text-align: center">7.82x</td>
        <td style="text-align: center">9.23x</td>
        <td style="text-align: center">4.80x</td>
        <td style="text-align: center"><strong>9.14x</strong></td>
      </tr>
      <tr>
        <td>Stream2LLM-MCPS</td>
        <td style="text-align: center">5.92x</td>
        <td style="text-align: center">3.86x</td>
        <td style="text-align: center">4.96x</td>
        <td style="text-align: center">0.73x</td>
        <td style="text-align: center">0.48x</td>
        <td style="text-align: center">0.77x</td>
      </tr>
    </tbody>
  </table>

</div>

<p class="table-caption">Crawler workload (append mode, 4.0 QPS, 10x delays): TTFT speedups at P50 and P99 relative to the non-streaming baseline.</p>

<div class="table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Scheduler</th>
        <th style="text-align: center">Recompute P50</th>
        <th style="text-align: center">Swap P50</th>
        <th style="text-align: center">Cost-Based P50</th>
        <th style="text-align: center">Recompute P99</th>
        <th style="text-align: center">Swap P99</th>
        <th style="text-align: center">Cost-Based P99</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>vLLM-NS</td>
        <td style="text-align: center">0.72s</td>
        <td style="text-align: center">0.72s</td>
        <td style="text-align: center">0.72s</td>
        <td style="text-align: center">6.56s</td>
        <td style="text-align: center">6.56s</td>
        <td style="text-align: center">6.56s</td>
      </tr>
      <tr>
        <td>vLLM-S</td>
        <td style="text-align: center">2.53x</td>
        <td style="text-align: center">2.26x</td>
        <td style="text-align: center">2.63x</td>
        <td style="text-align: center">0.19x</td>
        <td style="text-align: center">0.18x</td>
        <td style="text-align: center">0.19x</td>
      </tr>
      <tr>
        <td>Stream2LLM-FCFS</td>
        <td style="text-align: center"><strong>2.70x</strong></td>
        <td style="text-align: center"><strong>2.37x</strong></td>
        <td style="text-align: center"><strong>2.70x</strong></td>
        <td style="text-align: center"><strong>1.84x</strong></td>
        <td style="text-align: center"><strong>1.26x</strong></td>
        <td style="text-align: center"><strong>2.04x</strong></td>
      </tr>
      <tr>
        <td>Stream2LLM-LCAS</td>
        <td style="text-align: center">2.38x</td>
        <td style="text-align: center">1.62x</td>
        <td style="text-align: center">2.44x</td>
        <td style="text-align: center">1.79x</td>
        <td style="text-align: center">0.97x</td>
        <td style="text-align: center">1.79x</td>
      </tr>
      <tr>
        <td>Stream2LLM-MCPS</td>
        <td style="text-align: center">2.20x</td>
        <td style="text-align: center">1.47x</td>
        <td style="text-align: center">2.23x</td>
        <td style="text-align: center">0.19x</td>
        <td style="text-align: center">0.22x</td>
        <td style="text-align: center">0.19x</td>
      </tr>
    </tbody>
  </table>

</div>

<p class="table-caption">Vector search workload (update mode, 2.0 QPS, 30x delays): TTFT speedups at P50 and P99 relative to the non-streaming baseline.</p>

<p>Two key findings:</p>

<ol>
  <li>
    <p><strong>vLLM-S collapses under memory pressure.</strong> At P99, it degrades to $0.71\times$ (crawler) and $0.19\times$ (vector search) – streaming without a proper scheduling policy is actively harmful under contention.</p>
  </li>
  <li>
    <p><strong>Eviction strategy is hardware-dependent.</strong> Recompute-only FCFS reaches $10.03\times$ at P99 on the crawler workload but swap-only drops to $6.69\times$. The cost-based approach adapts to the hardware, achieving $8.62\times$ (FCFS) and $9.14\times$ (LCAS) by picking the cheaper option at each eviction.</p>
  </li>
</ol>

<h2 id="what-this-means">What This Means</h2>

<p>If you serve LLM requests backed by retrieval – web crawling, vector search, tool calls – you are paying a latency tax every time you wait for complete context before starting prefill.</p>

<p>Three takeaways:</p>

<ol>
  <li><strong>Start prefilling on partial context.</strong> Median TTFT improves 4-11x for append-mode workloads. If your retrieval takes more than a second, this dominates every other optimization you could make.</li>
  <li><strong>Do not stream without a proper scheduling policy.</strong> vLLM-S under memory pressure produces P99 latencies 5x worse than not streaming. Use FCFS or LCAS.</li>
  <li><strong>Profile your eviction costs.</strong> Whether to recompute or swap depends on your GPU. A five-minute offline profile determines the right strategy.</li>
</ol>

<p>The system, the traces, and the evaluation code are all public.</p>

<p><strong>Paper</strong>: <a href="https://rajveerb.com/assets/stream2llm-mlsys26.pdf">Stream2LLM (MLSys 2026)</a></p>

<p><strong>Code</strong>: <a href="https://github.com/rajveerb/stream2llm/tree/mlsys_artifact">github.com/rajveerb/stream2llm</a></p>

<p><strong>Data</strong>: <a href="https://huggingface.co/datasets/rbachkaniwala3/stream2llm-data">HuggingFace: rbachkaniwala3/stream2llm-data</a></p>

<p><strong>DOI</strong>: <a href="https://doi.org/10.5281/zenodo.18906769">10.5281/zenodo.18906769</a></p>]]></content><author><name>Rajveer Bachkaniwala</name></author><category term="research" /><category term="ai-systems" /><category term="inference" /><summary type="html"><![CDATA[Stream2LLM extends vLLM to support streaming inputs in LLM inference under concurrency, achieving up to 11x faster time-to-first-token without sacrificing throughput.]]></summary></entry></feed>