How to evaluate AI Agent to be robust Intelligence

Evaluation Reliability Metrics Visualization

talk.summaryAiDisclaimer

Sho TanakaSnowflake

talkDetail.whenAndWhere

Thursday, June 18, 12:40-13:05

Room 4B

talks.roomOccupancytalks.noOccupancyInfo

talks.description

Building AI Agents is accessible, but ensuring their reliability in production is a major engineering challenge. Unlike deterministic software, Agents are probabilistic. A binary "Pass/Fail" test is often insufficient to capture the nuances of an agent's reasoning process.
In this talk, we explore "Evaluation-Driven Development"—a paradigm shift for Python engineers building AI systems. We will focus on measuring the quality of agent trajectories using Python tools and visualizations.
The session covers:

From Testing to Evaluation: Why we need to move beyond standard assertions to probabilistic scoring (0.0 to 1.0) for Generative AI.
Metrics as Code: Implementing specific evaluation metrics in Python:
1. Faithfulness: Scoring whether the answer is grounded in the retrieved context to detect hallucinations.
2. Tool Selection Accuracy: Evaluating if the agent chose the correct tool (e.g., search vs. calculation) for the user's intent.
3. Answer Relevancy: Using embedding similarity to measure if the response actually answers the prompt.
Visualizing the Black Box: A live demo using Streamlit. We will showcase a custom dashboard that runs these evaluations, allowing developers to visualize the "reasoning trace" and identify exactly where the agent failed (Retrieval layer vs. Generation layer).
The Feedback Loop: How to use these evaluation scores to iteratively improve prompts and context retrieval logic.

visualization

reliability

metrics

evaluation

talks.speakers

Sho Tanaka

Snowflake

Japan

Sho Tanaka is a Lead Developer Advocate at Snowflake, focused on AI/ML and data engineering. He previously worked at Google (gTech) delivering ML/data solutions across Japan, APAC and global, and he is a Google Developer Expert (AI/ML) and co-founder of MLOps community in Japan. He enjoys turning messy real-world ML projects into reproducible, production-minded architectures.