Skip to main content
  1. Blogs/

Re-ranking LLMs in Production: Benchmarking Latency vs. Precision

Rauf Ibishov
Author
Rauf Ibishov
Three years shipping search pipelines at scale. Incoming MSc @ TUM. I build retrieval, re-ranking, and quantization systems for production.

Context: Why Re-rank at All?
#

First-stage retrieval (BM25 or dense vectors) is built for speed: you need top-1000 candidates fast. Precision at the very top (position 1–5) is a secondary concern.

Re-ranking flips the priority: given a small candidate set (~50–200 docs), score them accurately using a more expensive model. The question is: how much precision do you gain, and at what latency cost?

This post documents what we measured at NAIC (National AI Center).


Models Under Test
#

We evaluated three re-ranking approaches on top of our hybrid BM25+dense first stage:

ModelTypeSizeNotes
cross-encoder/ms-marco-MiniLM-L-6-v2Cross-Encoder~22M paramsFast, widely used
cross-encoder/ms-marco-electra-baseCross-Encoder~110M paramsHigher quality
LRT (Lightweight Re-ranking Transformer)Custom~14M paramsFine-tuned in-house

The LRT is a distilled cross-encoder trained on our domain data using knowledge distillation from the electra-base teacher.


Benchmark Setup
#

  • Corpus: 500k+ domain-specific documents
  • Test queries: 2,000 annotated queries (held-out)
  • First stage: BM25+dense hybrid, top-100 candidates per query
  • Re-ranking input: top-50 from first stage
  • Hardware: 4-core CPU (no GPU in production)
  • Metric: MRR@10, NDCG@5, P95 latency per query

Results
#

Quality (MRR@10)
#

SystemMRR@10NDCG@5Δ vs. First Stage
First stage only (BM25+dense)0.810.77baseline
+ MiniLM-L6 cross-encoder0.860.83+6.2%
+ Electra-base cross-encoder0.890.86+9.9%
+ LRT (in-house distilled)0.870.84+7.4%

Latency (P95, ms per query, CPU, re-rank top-50)
#

SystemP95 LatencyNotes
First stage only12 msONNX quantized encoders
+ MiniLM-L631 msAcceptable for async use cases
+ Electra-base148 msToo slow for <100ms SLA
+ LRT (ONNX INT8)44 msBest quality/latency trade-off

The Trade-off Curve
#

Quality (MRR@10)
  0.89 │              ● Electra-base
  0.87 │         ● LRT
  0.86 │    ● MiniLM
  0.81 │ ● First stage
       └────────────────────────────── Latency (ms)
           12   31   44              148

For our 100ms end-to-end SLA, the LRT model was the only option that delivered meaningful quality improvement while staying within budget.


Key Findings
#

  1. The first stage is more important than the re-ranker. Improving hybrid retrieval MRR from 0.74 → 0.81 (by switching to RRF) gave more gains than adding any re-ranker.
  2. LRT > MiniLM at similar latency when fine-tuned on in-domain data. Domain adaptation matters more than parameter count.
  3. Electra-base is a research tool, not a production re-ranker on CPU. It belongs behind a GPU, or as a teacher for distillation.
  4. Quantize your re-ranker too. LRT INT8 (via ONNX) cut latency from ~70ms to 44ms with <0.5% MRR drop.

Deployment Decision
#

We deployed the LRT re-ranker as a FastAPI microservice:

  • Input: (query, [candidate_passages])
  • Output: re-scored list, sorted descending
  • Batch size: 50 passages / request
  • P95 latency: 44ms; P99: 61ms
  • Horizontal scaling: stateless, behind a load balancer

What I’d Do Differently
#

  • Collect click logs earlier. Implicit feedback is a goldmine for re-ranker fine-tuning.
  • Test ColBERT-style late interaction as an alternative to full cross-attention — better precision/latency balance than bi-encoders at the cost of a larger index.
  • Cache re-ranker scores for repeated queries. A simple Redis cache cut our average latency by 20% in production.

Repo on GitHub. Questions? Email me.