Engineering

Our reranker stack: Jina, Voyage, and the quantization speedup

Dmitrii Kuzmenkov

Software Engineer, IndexFox.ai

December 9, 2025 6 min read Updated April 2, 2026

After hybrid retrieval gives you 50 candidates, the reranker is what decides which 5 the user actually sees. Get this stage wrong and the rest of the pipeline doesn't matter.

Why we rerank at all

Hybrid search scoring is a fusion of two signals, neither of which is the ground truth of "did this answer the user's question?" A cross-encoder reranker reads the query and the candidate together and outputs a relevance score. That's structurally a better question than a similarity calculation against an embedding the model produced once, three weeks ago.

The cost is that you can't pre-compute it. Every (query, candidate) pair is a forward pass. With 50 candidates, that's 50 forward passes per search. If each one takes 200ms, your widget feels slow.

Two providers, two tradeoffs

We ship with two configurable reranker providers:

Local: `jina-reranker-v1-tiny-en` (quantized)

Runs on the same CPU box as the search engine — no network hop. We ship v1-tiny deliberately for its footprint; Jina's v3 reranker (listwise, larger) is more accurate but the v1-tiny tradeoff still wins on the CPU profile we run.
Quantized to INT8 via the standard @xenova/transformers pipeline; 2-3× faster than the FP32 baseline on modern x86, with the kind of small accuracy delta that's invisible on top-5 results.
Fixed cost — the box is already there. Marginal cost per query: essentially zero.
Best for: high-volume sites where the latency budget is tight and we control infrastructure.

Cloud: Voyage rerank-2.5-lite

Hosted, per-token pricing. Higher absolute accuracy on multilingual content and longer contexts.
Adds one network round-trip — fine inside the same region, painful across continents.
Best for: customers with mostly English content where every percentage point of NDCG matters more than a few extra cents per thousand queries.

Top-K is the cheapest dial

Before you optimize the model, optimize what you feed it. We rerank the top 20-30 candidates from hybrid retrieval, not all 50. The 50th candidate almost never becomes the top result after reranking — and skipping it saves 40% of the reranker compute. This is a free lunch.

The right top-K is workload-dependent. Documentation sites with strong query-to-page concentration can drop to top-10. E-commerce with long-tail queries needs more breadth. We benchmark it per customer, not by guessing.

Text-length trimming

The reranker doesn't need the full document. It needs enough to judge relevance. We expose three knobs:

rerank_title_len — full title is usually right.
rerank_description_len — meta description, usually full.
rerank_highlight_len — the matched passage. This is the biggest lever.

Trimming the highlight from "everything we matched" to "the surrounding 256 chars" cuts forward-pass cost noticeably with no measurable quality loss on our benchmark. The reranker is reading for relevance, not for comprehensiveness.

The benchmark we trust

Public RAG benchmarks are useful for paper writing. They're terrible for product decisions. We maintain a private benchmark of ~1,200 (query, expected URL) pairs sampled from real customer logs, anonymized. Every reranker change runs against it before shipping. The single number we look at is "top-1 hit rate on the expected URL." Nothing else has correlated as well with customer satisfaction.

If you're picking a reranker for your own stack: build your own benchmark first. Pick the model second.