Projects

Explorations in AI evaluation and quality engineering.

LLM-As-a-Judge – Hallucination Detection Sandbox

How do you know if your LLM-as-a-Judge is any good?

I built this to find out. It's a sandbox for running hallucination detection experiments—testing prompt strategies, measuring what actually works, and learning what metrics like F1 and Cohen's Kappa mean when applied to messy, real-world cases.

What I explored:

60+ hallucination cases across different failure modes
Multiple prompt strategies—small changes, surprisingly different results
Error breakdowns by slice to see where detection fails
Token costs and API usage (evaluation isn't free)
Every LLM prediction and rationale visible for inspection

What I learned:

Detection accuracy varies wildly depending on how you prompt
Some hallucination types are easy to catch; others slip through consistently
Metrics can look good while missing the failures that matter most

How it's built:
Streamlit frontend, Python backend, OpenAI API, SQL for results, custom metrics engine. Deployed on Streamlit Cloud.

This isn't a polished tool. It's a learning project—my way of building intuition for LLM evaluation.

Mapping NIST AI RMF to Evaluation Approaches

View on GitHub →

Policy documents are full of principles. "AI systems should be transparent." "Models should be tested for bias." "Risks should be documented."

I took frameworks like NIST AI RMF and asked: what would it look like to test for this? How do you turn "ensure robustness" into a concrete evaluation strategy?

What this prototype covers:

Translates governance principles into evaluable questions and test ideas
Maps NIST AI RMF requirements to candidate evaluation approaches in a Jupyter notebook
Flags the gaps—where the framework is clear on intent but light on measurable checks

It's not comprehensive—it's a starting point and a way of thinking about the problem.

AI Evaluation Monitoring System

View on GitHub →

How do you know if an AI assistant got worse yesterday?

At 10 million responses per day, you can't review everything manually. Evaluating all of them with a large LLM would cost ~$100K/day. This project designs a production-scale monitoring system that costs ~$105/day—a 99.9% reduction.

The approach: tiered evaluation.

Rule-based checks run on all 10M responses (~$0)
A fast LLM scores a 1% sample—100K responses ($30/day)
A larger LLM reviews 5K flagged responses ($50/day)
Humans handle 50 edge cases and calibrate the system ($25/day)

The design covers:

Tiered architecture with cost analysis at each level
Quality metrics across five dimensions (correctness, completeness, conciseness, naturalness, safety)
CI/CD pipelines with quality gates for both model and autograder deployment
Alerting logic with severity levels and escalation paths

For QE backgrounds: The design maps to familiar concepts—BVT becomes golden set evaluation, regression testing becomes baseline comparison, production monitoring becomes tiered evaluation.

This is a system design, not running code. It demonstrates how to think about AI evaluation as an operational problem at scale.