ML & AI · Vellum Robotics (fictional engagement)

A Semantic Search Prototype for an Internal Knowledge Base

How a six-week build replaced a struggling internal search with a hybrid retrieval pipeline, and why the most important deliverable was the evaluation harness rather than the index itself.

The challenge

Vellum’s internal knowledge base — about 24,000 documents, mostly engineering RFCs and incident write-ups — was indexed by a vanilla Elasticsearch BM25 setup that had not been tuned in four years. The complaint was familiar: “search returns nothing useful unless I already know the exact title.” Engineers had stopped using it; product managers were quietly maintaining a parallel Notion graveyard. Leadership wanted to know whether modern semantic search would help, and they wanted an answer in six weeks.

The risk was the usual one: ship a vector-search demo that wins the cherry-picked queries, declare success, and discover six months later that average retrieval quality across the long tail of real queries got worse, not better.

The approach

We spent the first week building the evaluation harness, not the retrieval pipeline. We sampled 200 real search queries from the existing search logs, partitioned them into “navigational” (engineer typing a known title), “exploratory” (engineer browsing a topic), and “diagnostic” (engineer hunting a specific past incident), and had three subject-matter experts manually rank the top-10 documents per query. That gold set was the fixed point everything else would be measured against.

Then we built the pipeline. The choices were unglamorous:

The whole thing fit in 800 lines of Python plus a docker-compose with Elasticsearch and Qdrant.

The outcome

The hybrid index improved nDCG@10 from 0.41 (the BM25 baseline) to 0.67 across the full evaluation set. Engineers noticed within the first week of the staged rollout — most of the early feedback was, paraphrased, “I forgot search could be useful.”

The deliverable that mattered most, though, was the evaluation harness. Vellum could now ask retrieval-quality questions (“does the new sentence-transformer model help?”, “did the document-type chunking decision still hold a quarter later?”) and get a numeric answer in 20 minutes instead of arguing about anecdotes for a week. That harness outlived the prototype and is, as far as I know, still in use.

The lesson I keep returning to is that in retrieval work, the index is the easy part. The expensive, durable artifact is the way you measure whether the index is good.