How do you track token costs on streaming LLM responses?

Add stream_options: {include_usage: true} to the OpenAI request body. OpenAI sends a final chunk with the full usage stats (prompt_tokens, completion_tokens) before the [DONE] sentinel. Your proxy captures this final chunk, records the usage, and passes it through to the client. This adds zero latency — the usage chunk arrives after all content tokens have already streamed.

Streaming SSE Proxying for LLM APIs: The Hard Parts

Q: What is SSE (Server-Sent Events) in the context of LLM APIs?

Server-Sent Events is the streaming protocol OpenAI, Anthropic, and other LLM providers use to stream tokens back to the client as they're generated. Instead of waiting for the full response, the client receives a series of newline-delimited JSON chunks prefixed with 'data: ', each containing a partial completion. The stream ends with a 'data: [DONE]' sentinel. SSE uses a standard HTTP response with Content-Type: text/event-stream — no WebSocket required.

Q: Why does LLM SSE proxying cause token leaks?

When a client disconnects mid-stream, a naive proxy keeps the upstream connection to OpenAI open. OpenAI continues generating tokens until the full completion is done — tokens you're billed for but that no client ever receives. The fix: wire the client's request context to the upstream request. When the client disconnects, Go's context cancellation propagates to the upstream HTTP call and terminates it.

Q: How do you handle errors mid-stream in an SSE proxy?

Once you've sent HTTP 200 and the response headers, you can't change the status code. If OpenAI returns an error mid-stream, you need to signal it in-band: send a final SSE event with an error payload (data: {"error": "..."}) and close the connection. Clients that handle streaming responses should check for this error event in addition to the HTTP status code.

Q: What is backpressure in SSE proxying?

Backpressure happens when the client reads slower than the upstream LLM sends. If your proxy buffers naively, memory grows unboundedly under slow clients. The fix: use a bounded channel between the upstream reader and the downstream writer. If the buffer fills, you have two options — drop events (acceptable for some use cases) or close the connection and let the client retry.

OpenAI streaming looks simple from the outside. Set stream: true, iterate the response, pipe it to the client. One afternoon of work.

Then you ship it. A client disconnects mid-generation and you eat 2,000 tokens nobody received. A slow mobile client causes your proxy's memory to climb. An OpenAI rate limit hits after you've already sent a 200. Here's what actually happens when you proxy SSE at scale — and the patterns that fix each failure mode.

TL;DR

1. There are four production failure modes in SSE proxying: chunk boundary corruption, token leaks on client disconnect, unbounded buffering under backpressure, and mid-stream errors after a 200 is already sent.
2. Each has a clear fix — all implemented in ~50 lines of Go.
3. Preto proxies 5,000+ streaming req/s at <50ms p95 overhead. These are the patterns running in production.

How LLM SSE Actually Works

OpenAI's streaming response is plain HTTP with Content-Type: text/event-stream. The body is a sequence of newline-delimited frames:

data: {"id":"chatcmpl-...","choices":[{"delta":{"content":"Hello"},...}]}

data: {"id":"chatcmpl-...","choices":[{"delta":{"content":" world"},...}]}

data: [DONE]

Each event is data: {JSON}\n\n — a line starting with data: , then a blank line as a separator. The stream ends with the literal data: [DONE]\n\n sentinel.

A proxy needs to read this upstream, optionally inspect or annotate the frames, and write them downstream to the client without adding latency or modifying the framing. That's the theory. Here's what breaks in practice.

Failure Mode 1

Chunk boundary corruption

TCP delivers what it wants. A single SSE event may arrive split across multiple reads, or multiple events in one read.

Failure Mode 2

Token leak on disconnect

Client closes the connection. Your proxy keeps the upstream open. OpenAI generates 1,800 more tokens nobody receives. You pay for all of them.

Failure Mode 3

Backpressure and OOM

A slow client reads slower than OpenAI sends. Your buffer grows unboundedly. Under load, this kills the process.

Failure Mode 4

Mid-stream errors

OpenAI returns a 429 or 503 after you've already sent HTTP 200 downstream. You can't send a new status code. The client sees a truncated stream with no explanation.

Failure 1: The Chunk Boundary Problem

If you only need to pass the stream through without inspecting it, io.Copy is fine — TCP handles reassembly. The problem comes when you need to inspect each SSE frame: to extract token counts, inject annotations, or filter events.

The wrong approach: read into a fixed buffer and parse what you get. A single Read() call may return half an event, two events, or anything in between.

The right approach: use Go's bufio.Scanner with a custom split function that understands SSE framing:

func proxySSE(upstream io.ReadCloser, w http.ResponseWriter, onEvent func([]byte)) {
  scanner := bufio.NewScanner(upstream)
  scanner.Split(scanSSEEvents) // custom split — see below

  flusher := w.(http.Flusher)

  for scanner.Scan() {
    line := scanner.Bytes()
    w.Write(line)
    w.Write([]byte("\n\n"))
    flusher.Flush() // push each event to the client immediately

    if onEvent != nil && bytes.HasPrefix(line, []byte("data: ")) {
      onEvent(line[6:]) // strip "data: " prefix before passing to handler
    }
  }
}

// scanSSEEvents splits on double-newline (SSE event boundary)
func scanSSEEvents(data []byte, atEOF bool) (advance int, token []byte, err error) {
  if atEOF && len(data) == 0 {
    return 0, nil, nil
  }
  if i := bytes.Index(data, []byte("\n\n")); i >= 0 {
    return i + 2, bytes.TrimRight(data[:i], "\n"), nil
  }
  if atEOF {
    return len(data), bytes.TrimRight(data, "\n"), nil
  }
  return 0, nil, nil // request more data
}

The Flusher call on every event is critical. Without it, Go's http.ResponseWriter buffers writes and the client gets chunks in batches — defeating the purpose of streaming.

Want to see exactly how much overhead your streaming proxy adds?

Preto tracks proxy overhead vs. provider TTFT per request. See where your milliseconds go.

See Your LLM Costs Free — 10K Requests Included

No credit card required. Works with OpenAI, Anthropic, and more.

Failure 2: Token Leaks on Client Disconnect

This is the most expensive failure mode. When a client closes the connection mid-stream — browser tab closed, mobile app backgrounded, network timeout — a naive proxy keeps the upstream request to OpenAI running. OpenAI finishes generating the full completion. You're billed for every token.

At 1,000 req/s with a 5% disconnect rate and 500 average output tokens: that's 25,000 wasted tokens per second — hundreds of dollars per day at GPT-4.1-nano pricing, more at GPT-4.1.

The fix is Go context propagation. Pass the client's request context to the upstream HTTP call. When the client disconnects, Go's net/http server cancels the request context, which cascades to the upstream call:

func (p *Proxy) handleStream(w http.ResponseWriter, r *http.Request) {
  // r.Context() is cancelled automatically when the client disconnects
  ctx := r.Context()

  upstreamReq, err := http.NewRequestWithContext(ctx, "POST",
    "https://api.openai.com/v1/chat/completions",
    r.Body,
  )
  if err != nil {
    http.Error(w, "upstream error", 502)
    return
  }
  upstreamReq.Header = r.Header.Clone()

  resp, err := p.client.Do(upstreamReq)
  if err != nil {
    if errors.Is(err, context.Canceled) {
      // client disconnected — upstream cancelled, no token leak
      return
    }
    http.Error(w, "upstream error", 502)
    return
  }
  defer resp.Body.Close()

  // ... proxy the stream
}

The key: http.NewRequestWithContext(ctx, ...) instead of http.NewRequest. When the client context is cancelled, Go's HTTP client aborts the upstream connection. OpenAI stops generating. You stop paying.

Failure 3: Backpressure and Unbounded Buffering

OpenAI streams tokens at roughly 50–100 tokens per second for GPT-4.1. A client on a fast connection reads faster than that — no problem. A client on a slow connection, or one that's processing each chunk before reading the next, can fall behind.

If your proxy writes to the client's ResponseWriter synchronously, Go's HTTP server buffers unread data in kernel socket buffers. Those buffers are bounded (typically 64KB–256KB per connection). When they fill, the Write() call blocks — which blocks your upstream reader — which causes the upstream TCP window to fill — which causes OpenAI's server to pause sending. The stream stalls.

The dangerous alternative is an unbounded in-memory buffer between the upstream reader and the downstream writer. Under load with many slow clients, this causes OOM.

Our approach: a bounded channel between reader and writer goroutines, with a timeout on the write side:

func (p *Proxy) streamWithBackpressure(ctx context.Context,
  upstream io.ReadCloser, w http.ResponseWriter) {

  eventCh := make(chan []byte, 64) // bounded: 64 events max in-flight
  flusher  := w.(http.Flusher)

  // Reader goroutine: upstream → channel
  go func() {
    defer close(eventCh)
    scanner := bufio.NewScanner(upstream)
    scanner.Split(scanSSEEvents)
    for scanner.Scan() {
      select {
      case eventCh <- append([]byte{}, scanner.Bytes()...):
      case <-ctx.Done():
        return
      case <-time.After(5 * time.Second):
        // client too slow — abort
        return
      }
    }
  }()

  // Writer goroutine: channel → client
  for event := range eventCh {
    w.Write(event)
    w.Write([]byte("\n\n"))
    flusher.Flush()
  }
}

The 5-second timeout on the channel send is the backpressure relief valve. If the downstream writer hasn't consumed the last event within 5 seconds, the reader goroutine exits — closing the channel — which causes the writer loop to exit cleanly. The upstream connection is cancelled via context.

Failure 4: Mid-Stream Errors After HTTP 200

HTTP status codes are sent in the response header — before the body. Once you've written 200 OK and started streaming, you cannot send a 429 Too Many Requests or 503 if something goes wrong upstream. The client already accepted the response as successful.

This happens in real production traffic. OpenAI sends a 200 header and begins streaming, then hits an internal rate limit or context overflow mid-generation. The stream ends abruptly. Without handling, the client sees a truncated response with no indication of what happened — and typically retries, paying for partial tokens twice.

The SSE spec has no built-in error channel. The convention (used by OpenAI themselves in their error handling) is to send a final event with an error payload before closing:

func writeSSEError(w http.ResponseWriter, code string, message string) {
  flusher, ok := w.(http.Flusher)
  if !ok {
    return
  }
  errPayload := fmt.Sprintf(
    `data: {"error":{"code":"%s","message":"%s"}}`+"\n\n",
    code, message,
  )
  w.Write([]byte(errPayload))
  flusher.Flush()
}

// In your stream handler, after the scanner loop:
if err := scanner.Err(); err != nil {
  if !errors.Is(err, context.Canceled) {
    writeSSEError(w, "stream_error", "upstream stream interrupted")
  }
}

Clients handling streaming responses should check every data: payload for an error key, not just the HTTP status. This is especially important for retry logic — a truncated stream with an in-band error should retry differently from a clean stream that completed normally.

Cost Tracking Without Blocking the Stream

One more wrinkle: you can't calculate output token cost until the stream completes. Without the final token count, you'd need to build your own token counter — and you'd still be estimating mid-stream.

The solution: pass stream_options: {"include_usage": true} in your OpenAI request body. OpenAI sends a final chunk before [DONE] with the exact usage stats:

data: {"id":"...","choices":[],"usage":{"prompt_tokens":142,"completion_tokens":387,"total_tokens":529}}

data: [DONE]

In your onEvent handler, parse each chunk and look for the usage field. When you find it, record the cost asynchronously (fire the log entry into your channel) and pass the chunk through to the client unmodified. Zero latency added — the usage chunk arrives after all content tokens have already streamed.

At Preto, we capture this usage chunk and attribute cost to the calling feature, team, and model — giving you per-feature cost breakdown that the OpenAI dashboard doesn't provide. See how we store and aggregate those logs at scale.

Frequently Asked Questions

What is SSE in the context of LLM APIs?

Server-Sent Events is the streaming protocol OpenAI, Anthropic, and other providers use to stream tokens as they're generated. Instead of waiting for the full response, the client receives a series of newline-delimited JSON chunks prefixed with data: . The stream ends with data: [DONE]. SSE uses standard HTTP with Content-Type: text/event-stream — no WebSocket required.

Why does LLM SSE proxying cause token leaks?

When a client disconnects mid-stream, a naive proxy keeps the upstream OpenAI connection open. OpenAI continues generating — tokens you're billed for that no client receives. The fix: use http.NewRequestWithContext(r.Context(), ...) for the upstream call. When the client disconnects, Go cancels the request context, which propagates to the upstream HTTP call and terminates it.

How do you handle errors mid-stream in an SSE proxy?

Once you've sent HTTP 200 and started streaming, you can't change the status code. If OpenAI errors mid-stream, signal it in-band: send a final SSE event with an error payload (data: {"error": {...}}) and close the connection. Clients should check every data: payload for an error key, not just the HTTP status code.

How do you track token costs on streaming responses?

Add stream_options: {"include_usage": true} to the OpenAI request. OpenAI sends a final chunk with full usage stats before the [DONE] sentinel. Your proxy captures this chunk, records the cost asynchronously, and passes it through to the client — zero latency added.

What is backpressure in SSE proxying?

Backpressure occurs when a client reads slower than the upstream LLM sends. If your proxy buffers naively, memory grows unboundedly under slow clients. The fix: a bounded channel between the upstream reader goroutine and the downstream writer goroutine. If the buffer fills within a timeout, close the connection rather than letting it stall indefinitely.

See what's happening inside your LLM request path.

Preto proxies streaming LLM requests at <50ms p95 overhead — logging cost, latency, and per-feature attribution without blocking a single token. One URL change to get started.

See Your LLM Costs Free — Start in 5 Minutes

Free forever up to 10K requests. No credit card required.

Gaurav Dagade

Founder of Preto.ai. 11 years engineering leadership. Previously Engineering Manager at Bynry. Building the cost intelligence layer for AI infrastructure.

LinkedIn · Twitter