Summary
A monitoring layer that compares live retrieval output against stored intent baselines so drift shows up before the model starts sounding wrong.
Semantic Drift Watcher
A retrieval observability experiment for catching semantic drift before degraded search turns into bad answers.
Project Brief
Summary
A monitoring layer that compares live retrieval output against stored intent baselines so drift shows up before the model starts sounding wrong.
Problem
RAG systems can degrade quietly after updates, and teams usually notice only after answer quality and trust have already slipped.
Hypothesis
If retrieval quality is measured against intent over time, semantic drift can be caught before it becomes a user-facing hallucination problem.
Outcome
Built a dashboard-first monitor that exposes similarity deltas, threshold breaches, and review states before retrieval quality collapses.
Goals
Technologies Used
The build centers on a simple idea: retrieval quality should be observed like any other critical system, not treated as a hidden dependency behind model output. Instead of only judging final answers, this project compares retrieval results against the intent that the system was originally meant to satisfy.
That creates a cleaner failure signal. If retrieval quality drops, the dashboard shows the delta before a user ever sees a bad answer. The system becomes easier to debug because the issue is visible at the retrieval layer rather than disguised as a vague model-quality complaint.
I started this thinking the main risk in RAG was answer quality. The deeper lesson was that retrieval quality decays quietly, and observability is the only dependable way to catch it before people lose trust in the whole system.
01
Built a retrieval corpus of 19 Wikipedia AI articles embedded locally with all-MiniLM-L6-v2. Defined 8 benchmark queries, each mapped to a specific source document, to establish a repeatable baseline with no API dependency.
02
Simulated corpus drift by silently replacing 4 of the 8 source documents with off-topic content and re-embedding the full corpus. The remaining 4 source documents were left untouched as a control group.
03
2 of the 4 swapped sources breached alert thresholds — 1 degraded, 1 review-required. The 4 untouched queries held at exact zero delta. The other 2 swaps stayed within the healthy range, which is expected when replacement content is topically adjacent. No false positives on the control group.
Analysis
Retrieval Distance — Baseline vs Drifted
Loading chart...
Baseline scores captured against a known-good corpus of 19 Wikipedia AI articles. Drifted scores reflect the same 8 queries after 4 source articles were silently replaced. Q5–Q8 hold at zero delta — their source documents were untouched, confirming the monitor isolates only what actually changed.
[ Connect ]
If you are dealing with retrieval quality drift, eval blind spots, or production RAG observability, let's compare notes.
You are reaching
John Meyer
Security Engineer → AI