Engineering
Why we wrote our own RAG on ManticoreSearch instead of buying one

We get asked this in nearly every demo call: "Why did you build the index layer yourself? Pinecone is fine." The short answer is unit economics. The long answer is what this post is about.
The pricing wall
The original IndexFox sketch ran on a managed vector database. Crawl → embed → push → query. It worked. Then we did the math on a customer with 80,000 pages and 30,000 monthly searches:
- Vector storage at typical per-vector pricing dwarfed the entire infrastructure budget for that single customer.
- Per-query reads, with reranking and metadata filtering, pushed the marginal cost of a query into the cents — fine for B2B SaaS, fatal for a $19/mo widget.
- Hybrid retrieval (BM25 + vectors) was either a paid add-on or required a second system glued on top.
Our pricing model only works if a search costs us fractions of a cent. That meant the storage and retrieval layer had to be ours.
Why ManticoreSearch
We chose ManticoreSearch because it is the only open-source engine we found that gives us all three of these in one binary:
- BM25 keyword search with proximity, phrase, and quorum operators — the boring 1990s stuff that still wins on product-name and exact-phrase queries.
- Dense vector retrieval with cosine/L2 distance and configurable HNSW indices.
- Native hybrid scoring — one query, two signals, fused at the engine level instead of in our application code.
It's a C++ engine, MySQL-protocol on the wire, and the operational profile is closer to a database than to Elasticsearch's JVM tax. We host it ourselves on cheap CPU boxes. Vector dimensions are 768 on workloads using Jina v2-base (our default for English-only customers); Jina v3 outputs 1024 and we truncate via Matryoshka representation when storage cost matters. Recall is what you'd expect from HNSW at sensible ef settings.
The retrieval pipeline today
A query through IndexFox now looks like this:
q → query-rewrite (LLM) → N parallel sub-queries
↓
hybrid_search OR vector_search OR keyword_search
↓ (per query-type classification)
dedup by URL → rerank (Jina v1 / Voyage 2.5) → top-K → answer
Two things are worth flagging:
- Query rewrite is not optional. One user query expands into multiple variations — keyword, semantic, intent-shifted. We learned this is exactly what Google publicly calls query fan-out in their 2026 AI search guidance. We had been calling it "the parallel search thing." Same mechanic.
- Reranking is cheap if you quantize. A quantized Jina reranker tiny runs locally on CPU in a few milliseconds per pair. We cover the reranker tradeoffs in a separate post.
What we gave up
It's not free. Running your own search engine means:
- You own backups, replication, and the page that wakes you up at 3am.
- You write your own crawl-to-index pipeline, which in our case is Crawlee + Readability + a content-hash dedup layer. Covered in how we crawl.
- You give up the marketing slide that says "powered by [famous brand]."
What we got in exchange: a search call that costs us fractions of a cent, a hybrid query model that we control end-to-end, and a pricing page that doesn't start at $99.
Would we recommend this for everyone?
No. If your product is "AI answers on top of someone else's docs," a hosted vector DB is the right answer for the first 18 months. The buy-vs-build line moves the moment your unit economics depend on the cost of a single retrieval. For a search widget you embed on someone's marketing site, that line is roughly day one.