We were logging 50,000 LLM requests per day to PostgreSQL. Query latency was fine. At 400,000 requests, cost aggregation queries started taking 3 seconds. At 2 million, the database was the slowest thing in the stack — slower than the LLM calls themselves.
We switched to ClickHouse. Here's exactly what we did, why, and what the latency numbers look like now.
1. LLM request logs are append-only, high-cardinality, and read through aggregation queries. That's a ClickHouse workload — not PostgreSQL.
2. Switching dropped our cost dashboard p95 query latency from 3.2s to 180ms on 10M+ rows.
3. Our async write path keeps logging overhead under 2ms p95 — the client never waits for a log write.
The Schema That Broke PostgreSQL
Our initial log schema was straightforward:
CREATE TABLE llm_requests (
id UUID PRIMARY KEY,
tenant_id UUID NOT NULL,
model TEXT NOT NULL,
provider TEXT NOT NULL,
feature TEXT,
user_id TEXT,
input_tokens INT,
output_tokens INT,
cost_usd NUMERIC(12,8),
latency_ms INT,
cached BOOLEAN,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_tenant_created ON llm_requests (tenant_id, created_at DESC);
CREATE INDEX idx_model ON llm_requests (model);
CREATE INDEX idx_feature ON llm_requests (feature);
This worked fine at low volume. The problem was our query pattern. The dashboard runs queries like:
-- Cost by model, last 30 days
SELECT model, SUM(cost_usd), COUNT(*)
FROM llm_requests
WHERE tenant_id = $1
AND created_at > NOW() - INTERVAL '30 days'
GROUP BY model;
-- Cost by feature, last 7 days
SELECT feature, SUM(cost_usd), AVG(latency_ms)
FROM llm_requests
WHERE tenant_id = $1
AND created_at > NOW() - INTERVAL '7 days'
GROUP BY feature
ORDER BY SUM(cost_usd) DESC;
At 10 million rows, these GROUP BY queries scanned hundreds of thousands of rows per tenant despite the index. PostgreSQL is row-oriented — to compute SUM(cost_usd), it reads entire rows to extract one column. At scale, that's a lot of I/O for a field you could pre-aggregate.
Why LLM Logs Are a ClickHouse Workload
Three properties of LLM request logs make them a natural fit for ClickHouse:
Append-only writes. LLM logs are never updated or deleted (outside retention policies). ClickHouse is optimized for high-throughput inserts and efficient sequential reads — it doesn't pay the overhead PostgreSQL does for MVCC (multi-version concurrency control) on write-heavy workloads.
Columnar storage. When you query SUM(cost_usd) GROUP BY model, ClickHouse reads only the cost_usd and model columns from disk. PostgreSQL reads entire rows. For a table with 20 columns where you're aggregating 2, ClickHouse does 10% of the I/O.
Materialized views. ClickHouse can pre-aggregate data at insert time. When a new log row lands, a materialized view can increment a running total in a summary table. Your dashboard query hits the summary — not the raw logs. This is what takes the query from 3.2 seconds to 12ms.
For comparison: Langfuse uses PostgreSQL for its logging backend. It works well at lower volumes and for teams that need ACID transactions. If you're running more than 1M requests/month and querying with analytics patterns, you'll eventually feel the difference.
Want to see your LLM costs broken down by model and feature?
Preto logs every request to ClickHouse and surfaces per-feature cost attribution in real time.
See What Your LLM Spend Looks LikeFree forever for up to 10K requests. No credit card.
Our ClickHouse Schema
The production schema uses a MergeTree engine, partitioned by month, ordered by (tenant_id, created_at) to colocate each tenant's data on disk for fast tenant-scoped scans:
CREATE TABLE llm_requests (
request_id String,
tenant_id String,
model LowCardinality(String),
provider LowCardinality(String),
feature LowCardinality(String),
user_id String,
input_tokens UInt32,
output_tokens UInt32,
cost_usd Float64,
latency_ms UInt32,
cached UInt8,
created_at DateTime
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (tenant_id, created_at)
SETTINGS index_granularity = 8192;
A few details worth noting:
LowCardinality(String)formodel,provider, andfeature— these columns have bounded value sets. ClickHouse dictionary-encodes them automatically, cutting storage by ~4x and improving scan speed.- No UUID primary key — ClickHouse doesn't use a traditional primary key for row lookups. The
ORDER BYdetermines sort order and the sparse index ClickHouse builds internally. - Monthly partitions let us drop old data efficiently with
ALTER TABLE DROP PARTITIONinstead of slow DELETE queries.
Materialized Views That Pre-Aggregate at Insert Time
This is the highest-leverage feature in ClickHouse for a logging use case. Every time a row is inserted into llm_requests, the materialized view fires and updates a running aggregate in a separate summary table.
-- Summary table (append-only, aggregated)
CREATE TABLE cost_by_model_daily (
tenant_id String,
model LowCardinality(String),
day Date,
total_cost AggregateFunction(sum, Float64),
total_reqs AggregateFunction(count, UInt64),
avg_latency AggregateFunction(avg, UInt32)
)
ENGINE = AggregatingMergeTree()
ORDER BY (tenant_id, model, day);
-- Materialized view that populates it
CREATE MATERIALIZED VIEW cost_by_model_mv
TO cost_by_model_daily AS
SELECT
tenant_id,
model,
toDate(created_at) AS day,
sumState(cost_usd) AS total_cost,
countState() AS total_reqs,
avgState(latency_ms) AS avg_latency
FROM llm_requests
GROUP BY tenant_id, model, day;
Dashboard queries now read from cost_by_model_daily — a pre-aggregated table with orders of magnitude fewer rows than the raw log. The *State/*Merge function pattern is ClickHouse's way of storing partial aggregation states so they can be merged at query time.
Query latency before and after:
10M rows, GROUP BY model
same 10M rows
pre-aggregated summary
The Async Write Path: How Logging Stays Under 2ms
Getting the query latency down was the first problem. The second was making sure logging never adds meaningful latency to the LLM request path.
Our architecture in Go:
// logCh buffers log entries in memory
var logCh = make(chan LogEntry, 10_000)
// In the request handler — fire and forget
func (p *Proxy) logRequest(entry LogEntry) {
select {
case logCh <- entry:
// buffered successfully
default:
// channel full — drop rather than block the request
metrics.Increment("log_dropped")
}
}
// Background goroutine: batch flush every 500ms
func runLogFlusher(ch <-chan LogEntry, db *clickhouse.Conn) {
ticker := time.NewTicker(500 * time.Millisecond)
batch := make([]LogEntry, 0, 500)
for {
select {
case entry := <-ch:
batch = append(batch, entry)
if len(batch) >= 500 {
flushBatch(db, batch)
batch = batch[:0]
}
case <-ticker.C:
if len(batch) > 0 {
flushBatch(db, batch)
batch = batch[:0]
}
}
}
}
The client gets its LLM response immediately. The log entry goes into a buffered channel. A background goroutine flushes to ClickHouse in batches of up to 500 entries every 500ms. If the channel is full (backpressure), we drop the log entry and increment a metric — we'd rather lose a log than slow down a request.
At Preto, this gives us p95 logging overhead under 2ms. The ClickHouse batch insert itself takes 10–30ms for 500 rows — invisible to any individual request.
ClickHouse vs. PostgreSQL: When to Switch
| Factor | PostgreSQL | ClickHouse |
|---|---|---|
| Write pattern | Mixed read/write, ACID | Append-only inserts |
| Query pattern | Point lookups, joins | Aggregations, scans, GROUP BY |
| Performance ceiling | ~5M rows before index degradation | Billions of rows, still sub-second |
| Operational complexity | Low — most teams know it | Medium — different mental model |
| Materialized views | Manual refresh required | Real-time, insert-triggered |
| Right choice when | <500K requests/month | >1M requests/month |
Don't over-engineer. If you're logging under 500K LLM requests per month, a Postgres table with a (tenant_id, created_at) index is fine. Add ClickHouse when dashboard queries start timing out — not before.
If you're evaluating LLM observability tools and curious how different products approach the storage question, see how Preto compares to Helicone and Langfuse.
Frequently Asked Questions
Why is ClickHouse better than PostgreSQL for LLM request logging?
How much latency does logging add to LLM requests?
What columns should I track in an LLM request log?
Does ClickHouse support real-time cost dashboards?
When should I stick with PostgreSQL for LLM logging?
See what your LLM spend looks like, broken down by model and feature.
Preto logs every request through a ClickHouse-backed pipeline and surfaces per-feature cost attribution in real time. One URL change — no code refactor required.
Start for Free — 10K Requests IncludedFree forever for up to 10K requests. No credit card.