R2 SQLR2 SQL in ActionStatistical Analysis
Cloudflare

Linear regression on AI Gateway telemetry — does token count predict latency?

latency-regression.sql
SELECT model,
ROUND(corr(
CAST(total_tokens AS DOUBLE),
CAST(duration_ms AS DOUBLE)), 4
) AS token_latency_corr,
ROUND(regr_slope(
CAST(duration_ms AS DOUBLE),
CAST(total_tokens AS DOUBLE)), 4
) AS ms_per_token,
ROUND(regr_r2(
CAST(duration_ms AS DOUBLE),
CAST(total_tokens AS DOUBLE)), 4
) AS r_squared,
regr_count(
CAST(duration_ms AS DOUBLE),
CAST(total_tokens AS DOUBLE)
) AS sample_size
FROM ai_agents.invocations
WHERE duration_ms > 0 AND total_tokens > 0
GROUP BY model
HAVING regr_count(CAST(duration_ms AS DOUBLE),
CAST(total_tokens AS DOUBLE)) > 20
ORDER BY r_squared DESC
corrregr_sloperegr_r2regr_count
7 rows — top models by R²ms_per_token × tokens + base = predicted latency
modelcorrms_per_tokensamples
HIGH R²@cf/google/gemma-7b-it-lora0.949965.850.9022688
HIGH R²@cf/mistral/mistral-7b-v0.2-lora0.797254.960.6355570
@cf/baai/bge-large-en-v1.50.46040.030.2119815
mistral-small-3.1-24b-0.2662-48.640.0709329,637
llama-3.3-70b0.05663.110.0032147,443
gemma-3-12b0.00830.260.0001329,828
llama-3.1-8b0.00410.110.0000883,512
Insight: LoRA fine-tuned models show strong token-latency correlation (R² > 0.6), while large general models do not — latency is dominated by other factors.
8 / 12