All Projects
tooling Foundation Complete

RAG Drift Monitor

Semantic Drift Watcher

A retrieval observability experiment for catching semantic drift before degraded search turns into bad answers.

Project Brief

Live drift dashboard
Proof surface
BERT similarity
Signal
Catch failure early
Why it matters
01 - Project Brief

Problem, Hypothesis, Outcome.

Summary

A monitoring layer that compares live retrieval output against stored intent baselines so drift shows up before the model starts sounding wrong.

Problem

RAG systems can degrade quietly after updates, and teams usually notice only after answer quality and trust have already slipped.

Hypothesis

If retrieval quality is measured against intent over time, semantic drift can be caught before it becomes a user-facing hallucination problem.

Outcome

Built a dashboard-first monitor that exposes similarity deltas, threshold breaches, and review states before retrieval quality collapses.

02 - Goals & Stack

What the build was trying to do.

Goals

  • Detect silent retrieval drift early instead of waiting for support symptoms.
  • Compare live retrieval against saved intent baselines.
  • Surface a review workflow before answer quality degrades in production.

Technologies Used

Python sentence-transformers (all-MiniLM-L6-v2) ChromaDB Streamlit Plotly
03 - Breakdown & Notes

Implementation notes.

Breakdown

The build centers on a simple idea: retrieval quality should be observed like any other critical system, not treated as a hidden dependency behind model output. Instead of only judging final answers, this project compares retrieval results against the intent that the system was originally meant to satisfy.

That creates a cleaner failure signal. If retrieval quality drops, the dashboard shows the delta before a user ever sees a bad answer. The system becomes easier to debug because the issue is visible at the retrieval layer rather than disguised as a vague model-quality complaint.

Build notes

  • Intent baselines are stored per retrieval scenario.
  • Live retrieval output is measured against those baselines over time.
  • Thresholds mark healthy, degraded, and review-required states.
  • The review path is part of the design, not an afterthought.

Lessons Learned

I started this thinking the main risk in RAG was answer quality. The deeper lesson was that retrieval quality decays quietly, and observability is the only dependable way to catch it before people lose trust in the whole system.

04 - Analysis

Findings.

01

Built a retrieval corpus of 19 Wikipedia AI articles embedded locally with all-MiniLM-L6-v2. Defined 8 benchmark queries, each mapped to a specific source document, to establish a repeatable baseline with no API dependency.

02

Simulated corpus drift by silently replacing 4 of the 8 source documents with off-topic content and re-embedding the full corpus. The remaining 4 source documents were left untouched as a control group.

03

2 of the 4 swapped sources breached alert thresholds — 1 degraded, 1 review-required. The 4 untouched queries held at exact zero delta. The other 2 swaps stayed within the healthy range, which is expected when replacement content is topically adjacent. No false positives on the control group.

Analysis

Retrieval Distance — Baseline vs Drifted

Loading chart...

Baseline scores captured against a known-good corpus of 19 Wikipedia AI articles. Drifted scores reflect the same 8 queries after 4 source articles were silently replaced. Q5–Q8 hold at zero delta — their source documents were untouched, confirming the monitor isolates only what actually changed.

[ Connect ]

Worth a conversation?

If you are dealing with retrieval quality drift, eval blind spots, or production RAG observability, let's compare notes.

All Projects →

You are reaching

John Meyer

Security Engineer → AI

  • Open to roles
  • Contract + consulting
  • Architecture advisory