Projects
Explorations in AI evaluation and quality engineering.
LLM-As-a-Judge – Hallucination Detection Sandbox
How do you know if your LLM-as-a-Judge is any good?
I built this to find out. It's a sandbox for running hallucination detection experiments—testing prompt strategies, measuring what actually works, and learning what metrics like F1 and Cohen's Kappa mean when applied to messy, real-world cases.
What I explored:- 60+ hallucination cases across different failure modes
- Multiple prompt strategies—small changes, surprisingly different results
- Error breakdowns by slice to see where detection fails
- Token costs and API usage (evaluation isn't free)
- Every LLM prediction and rationale visible for inspection
- Detection accuracy varies wildly depending on how you prompt
- Some hallucination types are easy to catch; others slip through consistently
- Metrics can look good while missing the failures that matter most
Streamlit frontend, Python backend, OpenAI API, SQL for results, custom metrics engine. Deployed on Streamlit Cloud.
This isn't a polished tool. It's a learning project—my way of building intuition for LLM evaluation.
Mapping NIST AI RMF to Evaluation Approaches
Policy documents are full of principles. "AI systems should be transparent." "Models should be tested for bias." "Risks should be documented."
I took frameworks like NIST AI RMF and asked: what would it look like to test for this? How do you turn "ensure robustness" into a concrete evaluation strategy?
What this prototype covers:
- Translates governance principles into evaluable questions and test ideas
- Maps NIST AI RMF requirements to candidate evaluation approaches in a Jupyter notebook
- Flags the gaps—where the framework is clear on intent but light on measurable checks
It's not comprehensive—it's a starting point and a way of thinking about the problem.
AI Evaluation Monitoring System
How do you know if an AI assistant got worse yesterday?
At 10 million responses per day, you can't review everything manually. Evaluating all of them with a large LLM would cost ~$100K/day. This project designs a production-scale monitoring system that costs ~$105/day—a 99.9% reduction.
The approach: tiered evaluation.
- Rule-based checks run on all 10M responses (~$0)
- A fast LLM scores a 1% sample—100K responses ($30/day)
- A larger LLM reviews 5K flagged responses ($50/day)
- Humans handle 50 edge cases and calibrate the system ($25/day)
The design covers:
- Tiered architecture with cost analysis at each level
- Quality metrics across five dimensions (correctness, completeness, conciseness, naturalness, safety)
- CI/CD pipelines with quality gates for both model and autograder deployment
- Alerting logic with severity levels and escalation paths
For QE backgrounds: The design maps to familiar concepts—BVT becomes golden set evaluation, regression testing becomes baseline comparison, production monitoring becomes tiered evaluation.
This is a system design, not running code. It demonstrates how to think about AI evaluation as an operational problem at scale.