Re-ranking LLMs in Production: Benchmarking Latency vs. Precision

Context: Why Re-rank at All?
#

First-stage retrieval (BM25 or dense vectors) is built for speed: you need top-1000 candidates fast. Precision at the very top (position 1–5) is a secondary concern.

Re-ranking flips the priority: given a small candidate set (~50–200 docs), score them accurately using a more expensive model. The question is: how much precision do you gain, and at what latency cost?

This post documents what we measured at NAIC (National AI Center).

Models Under Test
#

We evaluated three re-ranking approaches on top of our hybrid BM25+dense first stage:

Model	Type	Size	Notes
`cross-encoder/ms-marco-MiniLM-L-6-v2`	Cross-Encoder	~22M params	Fast, widely used
`cross-encoder/ms-marco-electra-base`	Cross-Encoder	~110M params	Higher quality
LRT (Lightweight Re-ranking Transformer)	Custom	~14M params	Fine-tuned in-house

The LRT is a distilled cross-encoder trained on our domain data using knowledge distillation from the electra-base teacher.

Benchmark Setup
#

Corpus: 500k+ domain-specific documents
Test queries: 2,000 annotated queries (held-out)
First stage: BM25+dense hybrid, top-100 candidates per query
Re-ranking input: top-50 from first stage
Hardware: 4-core CPU (no GPU in production)
Metric: MRR@10, NDCG@5, P95 latency per query

Results
#

Quality (MRR@10)
#

System	MRR@10	NDCG@5	Δ vs. First Stage
First stage only (BM25+dense)	0.81	0.77	baseline
+ MiniLM-L6 cross-encoder	0.86	0.83	+6.2%
+ Electra-base cross-encoder	0.89	0.86	+9.9%
+ LRT (in-house distilled)	0.87	0.84	+7.4%

Latency (P95, ms per query, CPU, re-rank top-50)
#

System	P95 Latency	Notes
First stage only	12 ms	ONNX quantized encoders
+ MiniLM-L6	31 ms	Acceptable for async use cases
+ Electra-base	148 ms	Too slow for <100ms SLA
+ LRT (ONNX INT8)	44 ms	Best quality/latency trade-off

The Trade-off Curve
#

Quality (MRR@10)
  0.89 │              ● Electra-base
  0.87 │         ● LRT
  0.86 │    ● MiniLM
  0.81 │ ● First stage
       └────────────────────────────── Latency (ms)
           12   31   44              148

For our 100ms end-to-end SLA, the LRT model was the only option that delivered meaningful quality improvement while staying within budget.

Key Findings
#

The first stage is more important than the re-ranker. Improving hybrid retrieval MRR from 0.74 → 0.81 (by switching to RRF) gave more gains than adding any re-ranker.
LRT > MiniLM at similar latency when fine-tuned on in-domain data. Domain adaptation matters more than parameter count.
Electra-base is a research tool, not a production re-ranker on CPU. It belongs behind a GPU, or as a teacher for distillation.
Quantize your re-ranker too. LRT INT8 (via ONNX) cut latency from ~70ms to 44ms with <0.5% MRR drop.

Deployment Decision
#

We deployed the LRT re-ranker as a FastAPI microservice:

Input: (query, [candidate_passages])
Output: re-scored list, sorted descending
Batch size: 50 passages / request
P95 latency: 44ms; P99: 61ms
Horizontal scaling: stateless, behind a load balancer

What I’d Do Differently
#

Collect click logs earlier. Implicit feedback is a goldmine for re-ranker fine-tuning.
Test ColBERT-style late interaction as an alternative to full cross-attention — better precision/latency balance than bi-encoders at the cost of a larger index.
Cache re-ranker scores for repeated queries. A simple Redis cache cut our average latency by 20% in production.

Repo on GitHub. Questions? Email me.

Context: Why Re-rank at All?#

Models Under Test#

Benchmark Setup#

Results#

Quality (MRR@10)#

Latency (P95, ms per query, CPU, re-rank top-50)#

The Trade-off Curve#

Key Findings#

Deployment Decision#

What I’d Do Differently#