Engineering2026-03-28 · 11 min

Hybrid Retrieval for Offline Medical Search at Sub-15ms

50% of the world lacks reliable access to essential health services. Most medical reference tools assume internet. LocalRx is a deterministic search engine that runs entirely offline on one machine. The interesting part isn't the offline-ness — it's the retrieval architecture that makes the results trustworthy without an LLM in the loop.

The constraint

A nurse in a rural clinic with no internet types “child 5 years fever rash joint pain” and needs ranked diagnoses with treatment protocols, filtered by what medicines are actually on the shelf. The system has to:

Run on one machine with no network calls in the query path.
Return results in well under a second — clinical context, not casual browsing.
Be deterministic. Every result must be traceable back to a verified WHO record. No hallucinated dosages.
Boot from docker compose up with zero configuration on whatever ARM or x86 box the clinic happens to have.

Anti-goalThis is not a chatbot. There is no LLM generating treatment advice. The system retrieves and ranks. The clinician decides.

Why hybrid retrieval, not just semantic

Pure semantic search fails the medical domain in two specific ways:

Drug names are tokens, not concepts. “Paracetamol” needs to match exactly. Cosine similarity on embeddings will happily return “general pain reliever” ranked above the actual drug record. That's a clinical hazard.
Symptom descriptions are conceptual. “Burning chest feeling” should match GERD even though the literal token “GERD” never appears in the query. BM25 keyword search will miss this entirely.

Neither signal alone is sufficient. The architecture runs both in parallel and fuses them.

Why Actian VectorAI DB

The choice came down to one capability: native hybrid fusion inside the database engine, in a single API call.

Pinecone is cloud-only — disqualified by the offline requirement. pgvector does vector search, but you maintain a separate BM25 index and fuse in application code. Chroma and FAISS have similar gaps. VectorAI DB exposes actian_vectorai.reciprocal_rank_fusion() directly — and applies metadata filters server-side, during the search, not after.

The fusion math

Reciprocal Rank Fusion (RRF) merges two independently ranked lists without needing score calibration:

RRF(d) = Σ  weight_i / (k + rank_i(d))

where:
  rank_i(d) = 1-indexed position of document d in list i
  k         = smoothing constant (default: 60)
  weight_i  = per-signal weight

RRF beats Distribution-Based Score Fusion (DBSF) here for one reason: scale mismatch robustness. Semantic cosine lives in [0, 1]. BM25 lives in [0, ∞]. If you fuse on raw scores, one strong BM25 match can dominate the result set. With RRF, rank 1 vs rank 2 is the same gap as rank 50 vs rank 51 — ranks are always comparable. For medical search, where “paracetamol” (exact match, BM25 rank 1) and “chest burning” (semantic rank 1) need to coexist in the same fused list, that property is load-bearing.

What the wire looks like

# query path — happy case

POST /search
{
  "query": "child 5 years fever rash joint pain",
  "filters": { "age_group": "child", "region": "tropical" }
}

# 1. FastAPI receives request
# 2. Embedding service encodes query → 384-dim vector
#    Model: sentence-transformers/all-MiniLM-L6-v2 (CPU, ~50ms)
# 3. Single API call to VectorAI DB:
#      - Cosine similarity on embeddings
#      - BM25-style keyword match on title / summary / keywords
#      - RRF fusion (k=60)
#      - Filter: age_group="child" applied server-side
# 4. Results returned with three scores per item: rrf, semantic, keyword
# 5. Frontend renders ranked cards with score breakdown

# Total round-trip: ~12ms after warmup

Filtered search inside the DB engine

The other VectorAI DB feature that's underrated: filter composition with a typed builder, applied during the search rather than as a post-filter. Six dimensions wired up: category, specialty, age_group, risk_level, region, and availability.

f = (FilterBuilder()
       .must(Field("risk_level").eq("high"))
       .must(Field("age_group").eq("child"))
       .build())

results = client.points.search(
    "drugs",
    vector=query_embedding,
    filter=f,
    limit=10,
)

The clinic configures availability to match what's actually on the shelf. A diagnosis for malaria that requires artesunate IV doesn't rank if the clinic only stocks oral artemether-lumefantrine. The clinician sees what they can actually prescribe.

The dataset

Four collections, 70 records total. Small enough to index in seconds; large enough to demonstrate retrieval quality across domains.

drugs — 30 WHO Essential Medicines with dosing, contraindications, interactions.
guidelines — 15 clinical protocols (malaria, pneumonia, PPH, TB, HIV, etc.).
conditions — 15 diseases with symptoms, diagnosis criteria, complications.
interactions — 10 critical drug-interaction pairs with mechanism and management.

What I'd build next

[Scaling the dataset to 5k+ records — the BM25/semantic balance changes with corpus size; what to tune.]
[Per-clinic availability sync — small CSV upload that updates the availability filter dimension.]
[Multilingual query — the embedding model is English-only; multilingual variants of MiniLM exist but the BM25 leg needs a tokenizer per language.]

For engineers building similar things

For domains where the cost of a wrong answer is high — medical, legal, financial compliance — resist the temptation to put an LLM in the response path. Retrieval-and-rank with traceable sources is a better primitive than “chatbot over your data.” The interesting engineering is choosing the right retrieval signals, fusing them robustly, and exposing the score breakdown to the user so they can interrogate the ranking themselves.