AI Evals & Discovery

Show notes

What you’ll learn in this episode:

What “evals” actually mean in the AI/ML world
Why evals are more than just quality assurance
The difference between golden datasets, synthetic data, and real-world traces
How to identify error modes and turn them into evals
When to use code-based evals vs. LLM-as-judge evals
How discovery practices inform every step of AI product evaluation
Why evals require continuous maintenance (and what “criteria drift” means for your product)
The relationship between evals, guardrails, and ongoing human oversight

Resources & Links:

Mentioned in the episode:

Coming soon from Teresa:

Weekly Monday posts sharing lessons learned while building AI products
A new podcast interviewing cross-functional teams about real-world AI product development stories