AI Evals for Product Managers: Building Better Feedback Loops for AI Products
AI evals for product managers are rapidly becoming the defining skill for building trustworthy AI products. Why? Because unlike traditional software, AI systems behave probabilistically. They hallucinate, drift, and break in subtle ways that standard QA testing can't catch.
At the recent AI Product Summit, two leading experts—Aman Khan (Head of Product at Arize AI) and Ian Cairns (Co-Founder & CEO at Freeplay)—shared their battle-tested frameworks for how product managers can master AI evaluations. Both are deeply embedded in the world of AI evals and observability, and their sessions offered a practical blueprint for navigating the uncertainty of AI development.
Done properly, AI evals help product managers create better feedback loops and iterate faster to build more trustworthy AI products.
Aman focused on the foundations: how to define what "good" looks like in an AI product, how to encode that judgment into AI evals, and how to collaborate with engineering to build evaluations that actually reflect product expectations. Ian picked up the thread and walked through how AI evals operate across the full AI feedback loop: from experimentation to testing to live production.
We’ll cover:
- What AI evals are and why they matter for product managers
- How to write effective AI evaluations using the 4-part framework
- How to get started with your first AI eval in 5 steps
- Best practices for PM-engineering collaboration on AI testing
- Where AI evals fit in your product development cycle
What Are AI Evals and Why Do Product Managers Need Them?
AI evaluations (commonly called AI evals) are systematic tests that help teams measure the quality of AI systems. But unlike traditional software testing with clear pass/fail results, AI evals assess nuanced, probabilistic outputs across multiple dimensions like accuracy, tone, relevance, and safety.
The Problem: AI Systems Don't Behave Like Traditional Software
Here's what makes AI products different (and risky):
- Non-deterministic behavior: Large Language Models (LLMs) can generate different responses to identical inputs.
- Subtle failure modes: A prompt that works perfectly today might subtly break tomorrow with a model update.
- Judgment-based quality: "Good" isn't binary. It depends on tone, context, brand alignment, and user intent.
- Model drift: Performance degrades over time as data distributions and user behaviors shift.
- Hallucinations: AI systems confidently generate plausible but incorrect information.
Without systematic AI evaluation, product managers are flying blind. That variability introduces risk and without a system to catch regressions, you can't ship with confidence.
"The reality is that LLMs hallucinate. Our job is to make sure that they don't embarrass us, our company, or our brand. You don't have to take my word for it, because the CPOs of the companies that are selling you LLMs are telling you that evals matter." — Aman Khan, Head of Product at Arize AI
What Makes AI Evals Different from Traditional Testing?
In the most practical sense, an AI eval is a structured test that helps you measure the quality of your AI product. But unlike software unit tests, AI evals aren't about binary outcomes. Instead:
- AI evals require judgment – You're defining subjective quality criteria
- AI evals rely on nuance – Context, tone, and user intent all matter
- AI evals force you to define "good" – What works for your product, users, and business
- AI evals must evolve continuously – They adapt as models change, use cases expand, and users behave unpredictably
Traditional software unit testing is like checking if a train stays on its tracks: straightforward, deterministic, clear pass/fail scenarios. AI evals for LLM-based systems, on the other hand, feel more like evaluating a driver navigating a busy city. The environment is variable, and the system is non-deterministic.
AI Evals Are About Decision-Making, Not Just Quality Assurance
AI evals aren't just about quality assurance (QA); they're about decision-making. Whether it's a chatbot issuing a refund or a summarizer parsing legal documents, every AI feature is making decisions on your behalf.
To protect your product's reputation, AI evaluations give you a framework to assess those decisions at every layer—retrieval, generation, tone, compliance—and improve them over time.
"We're basically running this loop over and over again, finding issues and then continuously building up a better set of evals—a better representative data set that lets you swap models quickly, change prompts, and add functionality for customers as needs emerge."— Ian Cairns, Co-Founder & CEO at Freeplay
The AI Feedback Loop: Where Evals Fit in Your Product Development
Ian Cairn’s overview of the AI feedback loop.
The Basics:
- Inspect production data to find issues or errors
- Experiment with changes to improve quality
- Test any change before shipping to production
- Evals inform each stage, in different ways
Winning AI product teams treat AI evals as the nervous system of their product, not a one-time QA step. AI evaluations should be:
- Integrated across the build, test, and observe stages
- Tied not only to user satisfaction, but to business outcomes
- Aligned with how customers actually use the product—not just how teams imagine they will
Reducing surprises for AI products is everything.
How to Write an AI Eval: The 4-Part Framework
Building AI evals starts with asking a deceptively simple question: What should this AI product do and how do we know it's doing it well?
AI evaluations formalize that judgment. They convert product expectations into testable criteria that help teams measure performance, catch regressions, and iterate with confidence.
But writing a good AI eval is not solely technical exercise like most people think. It's also a product exercise, because fundamentally evals optimize for what’s considered a “good user experience.” Product managers bring in customer and business context that their technical peers may not have.
The Standard AI Eval Prompt Structure
According to Aman Khan, a standard AI eval prompt includes four essential parts:
1. Setting the Judge's Role
Define who is evaluating and from what perspective. This establishes the lens through which quality will be assessed.
Example:
"You are an expert customer service evaluator with 10+ years of experience at Fortune 500 companies."
2. Supplying Context
Provide background about the product, users, business constraints, and what matters to your brand.
Example:
"Our AI chatbot serves enterprise software users ranging from beginners to technical experts. We prioritize clear, jargon-free explanations and a professional yet friendly tone. Response time matters—users expect answers within 30 seconds."
3. Defining a Clear Goal
Specify exactly what you're evaluating. Be explicit about the criteria that matter.
Example:
"Evaluate whether the AI response:
- Accurately understood the user's technical problem
- Provided a clear, actionable solution
- Maintained a helpful, professional tone
- Avoided making promises about features we don't have"
4. Establishing What Each Label Means
Create a scoring rubric with specific definitions. This is where consistency comes from.
Example:
- 5 = Excellent: All criteria met, user likely satisfied, would recommend to others
- 3 = Acceptable: Core problem addressed but clarity or tone needs improvement
- 1 = Failed: Misunderstood problem, gave incorrect information, or used inappropriate tone
Example: Complete AI Eval for a Customer Service Chatbot
Role: You are an expert evaluator of customer service interactions with 10+ years of experience at enterprise software companies.
Context: Our AI chatbot helps users troubleshoot technical issues with our B2B software product. Users range from beginners (first-time users) to experts (IT administrators). Our brand voice is professional yet approachable. We prioritize accuracy over speed, and we never promise features that don't exist.
Goal: Evaluate whether the AI response:
- Correctly identified the user's technical issue
- Provided step-by-step guidance that's actionable
- Used clear language appropriate for the user's expertise level
- Maintained a helpful, patient tone
- Set accurate expectations about what's possible
Labels:
- 5 = Excellent: All criteria met. User can immediately solve their problem. Professional and empathetic tone throughout.
- 4 = Good: Solution is correct and clear, minor tone or clarity improvements possible.
- 3 = Acceptable: Core issue addressed but solution lacks clarity or assumes too much knowledge.
- 2 = Poor: Partially correct but missing key steps, or tone is too technical/condescending.
- 1 = Failed: Misunderstood the problem, gave incorrect information, or made false promises.
Key Principles for Writing Effective AI Evals
You're not looking for a numeric rating alone. You're defining what "friendly" means for your tone model, or what counts as "correct" for your summarizer. The goal is consistency.
"You're defining the terminology and the label on top of that because what is considered 'good' in your business is not going to be the same as another company… AI evals apply at every level of the product funnel. They function like success metrics that show whether the end product is performing as intended." — Aman Khan, Head of Product at Arize AI
💡 Key Insight: For product managers, this means you don't have to be the engineer who implements every test, but you must help articulate the business context and user experience that matter. What does a good answer look like? What errors are unacceptable? What tone aligns with the brand?
Done well, AI evals both measure and shape performance. They influence how models behave, how prompts are written, and how engineering teams define "done." And they aren't static. As real-world usage evolves, so should your AI evaluations.
Best Practices: Building AI Evals From Real-World Failures
Ground Evals in Messy Data, Not Assumptions
"The goal of AI evals is to figure out what's not working in the system. I would recommend defining evals based on the failure states that you find rather than sitting in a room and coming up with a list of evals that you think are going to be right. Otherwise, all of your evals are passing, but customers still aren't happy." — Ian Cairns, Co-Founder & CEO at Freeplay
The best AI evals are grounded in messy and imperfect real-world data. They reflect how your product behaves under pressure, and they help you make informed choices about what to fix, what to ship, and when to keep digging.
How to Ground Your AI Evals:
- Start with production data: Don't create synthetic test cases in a vacuum. Use real user queries.
- Identify actual failure modes: What broke? Where did the AI hallucinate? When did users report issues?
- Create evals for each failure type: If users complain about tone, write a tone eval. If the AI misses context, write a context-awareness eval.
- Continuously update your eval set: As you discover new edge cases, add them to your testing dataset.
- Make evals challenging: Include ambiguous queries, unusual formatting, and edge cases that represent real usage.
How to Get Started with AI Evals: A 5-Step Process
Ready to write your first AI evaluation? Here's a practical process to get started today.
Step 1: Identify Your AI Product's Critical Failure Modes
Before writing any AI evals, understand what could go wrong with your specific AI product.
Common failure modes to consider:
- Hallucinations: Making up facts, citing non-existent sources
- Tone errors: Being too casual/formal, inappropriate humor, condescending language
- Context misses: Ignoring user intent, missing conversation history
- Compliance issues: Sharing sensitive data, violating regulations
- Scope creep: Answering questions outside intended domain
- Inconsistency: Different responses to similar queries
Action: Review your last 50-100 production interactions. What went wrong? What did users complain about?
Step 2: Define What "Good" Means for Your Specific Use Case
"Good" varies wildly by product, industry, and user base. A legal AI has different quality standards than a creative writing assistant.
Questions to answer:
- What does success look like for our users?
- What tone and style align with our brand?
- What level of accuracy is acceptable? (100% may not be possible)
- What tradeoffs are we willing to make? (Speed vs. completeness)
- What are the "unforgivable" errors? (Safety, legal, factual)
Action: Document your quality criteria in writing. Get alignment from stakeholders, engineering, and customer success teams.
Step 3: Write Your First AI Eval Using the 4-Part Framework
Use Aman's framework:
- Role: Who's evaluating?
- Context: What's the product, user, business situation?
- Goal: What are you measuring?
- Labels: How do you score it?
Start simple. Pick one failure mode (like tone or accuracy) and write a focused eval for just that dimension.
Action: Draft your first eval prompt. Test it on 10-20 real examples. Does it catch the issues you care about? Refine.
Step 4: Test on Real Production Data (Not Just Synthetic)
Synthetic test cases are useful, but they don't capture the chaos of real users.
Why real data matters:
- Users phrase questions unpredictably
- Real queries have typos, ambiguity, and context
- Edge cases you didn't imagine will appear
- Distribution shifts reveal model weaknesses
Action: Run your eval on 50-100 production samples. Look for patterns in failures. What's your eval missing?
Step 5: Iterate Based on Observed Failures
AI evals are never "done." They're living documents that evolve with your product.
As you observe failures:
- Add new test cases to your eval dataset
- Refine your scoring rubric
- Create new evals for new failure modes
- Archive evals that no longer apply
Action: Schedule weekly "eval review" sessions with your team. What broke this week? How do we catch it next time?
How Product Managers and Engineers Should Collaborate on AI Evals
Product managers don't need to write eval frameworks or build testing pipelines. But they do need to be close to them. The most effective teams are the ones where product and engineering share the responsibility for validating quality.
That means more than reporting bugs. It means shaping the way the product defines correctness in the first place.
Why PMs Can't Stay Removed from the AI Eval Process
AI evals are a perfect place to collaborate. Product managers bring the understanding of what the user experience should feel like, while engineers bring the tools to test and observe it at scale.
When PMs stay too far removed:
- Quality becomes subjective
- Regressions go unnoticed
- Engineering optimizes for the wrong metrics
- Product-market fit suffers
The key is context. Not just in the eval prompt, but across the team.
What Product Managers Bring (and What Engineers Bring)
Product Managers provide:
✓ User context and pain points
✓ Business priorities and tradeoffs
✓ Brand voice and tone guidelines
✓ Definition of "good enough"
✓ Prioritization of which failures matter most
Engineers provide:
✓ Technical constraints and possibilities
✓ Testing infrastructure and automation
✓ Data pipelines and observability
✓ Model behavior insights
✓ Implementation of eval frameworks
How to Collaborate Effectively
PMs should be involved in:
- Defining what each eval measures – What quality dimensions matter for this feature?
- Reviewing eval outputs – Are the scores aligned with actual user experience?
- Interpreting results – What do these numbers mean for the product roadmap?
- Prioritizing fixes – Which failures are urgent vs. nice-to-fix?
That shared visibility builds confidence in not just the product, but the process behind it.
No single team owns quality. The best AI products are shaped by many hands. But when PMs and engineers anchor their collaboration in AI evals, the results speak for themselves: faster iteration, fewer blind spots, and more trust in what gets shipped.
Types of AI Evaluations Every PM Should Know
Different evaluation approaches serve different purposes in your AI product development cycle:
1. Human Evaluations
Actual users or expert evaluators provide direct feedback on AI outputs.
Implementation: Add feedback mechanisms like thumbs up/down buttons or rating scales.
Best for: High-stakes decisions, creative content, establishing "gold standard" examples.
Example: Spotify's Podcast AI Summary feature used human evaluators to rate summary quality across accuracy, comprehensiveness, and readability before building automated systems.
2. LLM-as-a-Judge Evaluations
Use another LLM to evaluate your primary LLM's outputs.
Implementation: Create prompts that instruct a "judge" LLM to evaluate specific aspects.
Best for: Scaling evaluations cost-effectively, consistent application of criteria.
Note: LLM-based evals are themselves natural language prompts, so they require the same care as your product prompts.
3. Grounded Evaluations
Compare AI outputs against objective sources of truth.
Implementation: Check outputs against known-correct answers, source documents, or databases.
Best for: Factual accuracy, data retrieval, summarization tasks.
Example: Verify that a legal AI's citations actually exist in the source documents.
4. Automated Testing / Assertions
Code-based checks on specific properties of outputs.
Implementation: Unit tests that verify format, length, presence of required elements.
Best for: Structured outputs, API responses, data validation
Example: Assert that generated SQL queries are syntactically valid before execution.
Frequently Asked Questions About AI Evals
Do I need to be technical to write AI evals?
No. AI evals are more about product judgment than technical implementation. Product managers define what "good" looks like based on user needs and business context, while engineers implement the testing infrastructure.
The 4-part framework (role, context, goal, labels) is written in plain English, not code. Your PM intuition about user experience is exactly what's needed.
How are AI evals different from traditional QA testing?
Traditional QA tests for binary pass/fail outcomes with deterministic software. If you input X, you always get Y.
AI evals assess probabilistic, nuanced outputs across multiple dimensions like tone, relevance, and accuracy. The same input can generate different outputs, so evals define what "acceptable" looks like rather than what's "correct."
How often should I run AI evals?
AI evaluations should run continuously:
- Automated evals: On every code/prompt change (like unit tests)
- Human evaluation: Weekly on production samples
- Comprehensive reviews: Before major releases
- Ad-hoc testing: Whenever you observe unusual behavior
The goal is to catch issues before they reach users, then continuously monitor in production.
What's the difference between evals and monitoring?
Evals test AI systems before deployment (like unit tests). You define quality criteria and test against them during development.
Monitoring tracks performance after deployment. You observe how the AI behaves with real users in production.
Both are essential. Evals catch issues early; monitoring ensures they don't degrade over time.
Can AI evals catch hallucinations?
Yes, but only if you design specific AI evals for factual accuracy. This often requires:
- Grounded evals that check outputs against source documents
- Citation verification to ensure references actually exist
- Fact-checking prompts in your LLM-as-judge evals
- Human review for high-stakes claims
Hallucination detection is one of the most important (and challenging) aspects of AI evaluation.
How do I get buy-in from engineering to invest in evals?
Frame AI evals as:
- Risk mitigation: Catching embarrassing failures before customers see them
- Velocity: Enabling faster iteration with confidence
- Product quality: The difference between a demo and a production-ready product
- Career development: A skill that will define successful AI PMs and engineers
Share examples of competitors' AI failures that evals would have caught. Show how systematic testing reduces firefighting and enables shipping faster.
What if my AI eval scores don't match user feedback?
This is a signal that your eval criteria doesn't align with user expectations.
Actions to take:
- Review recent user complaints and support tickets
- Conduct user interviews about what quality means to them
- Revise your eval rubric based on user priorities
- Test the updated eval on samples where users complained
- Iterate until eval scores predict user satisfaction
Your AI evals should be validated against real user outcomes, not just internal assumptions.
Learn more about AI Evals for Product Teams
To learn more about common mistakes with AI evals and how modern product teams are incorporating them in their processed, watch Ian and Aman’s talks at AI Product Summit.