Data & AIData & AI
Conference - Short25min
INTERMEDIATE

How to evaluate AI Agent to be robust Intelligence

Evaluation Reliability Metrics Visualization

talk.summaryAiDisclaimer

Sho Tanaka
Sho TanakaSnowflake

talkDetail.whenAndWhere

Thursday, June 18, 12:40-13:05
Room 4B
talks.roomOccupancytalks.noOccupancyInfo
talks.description
Building AI Agents is accessible, but ensuring their reliability in production is a major engineering challenge. Unlike deterministic software, Agents are probabilistic. A binary "Pass/Fail" test is often insufficient to capture the nuances of an agent's reasoning process.
In this talk, we explore "Evaluation-Driven Development"—a paradigm shift for Python engineers building AI systems. We will focus on measuring the quality of agent trajectories using Python tools and visualizations.
The session covers:
  1. From Testing to Evaluation: Why we need to move beyond standard assertions to probabilistic scoring (0.0 to 1.0) for Generative AI.
  2. Metrics as Code: Implementing specific evaluation metrics in Python:
    1. Faithfulness: Scoring whether the answer is grounded in the retrieved context to detect hallucinations.
    2. Tool Selection Accuracy: Evaluating if the agent chose the correct tool (e.g., search vs. calculation) for the user's intent.
    3. Answer Relevancy: Using embedding similarity to measure if the response actually answers the prompt.
  3. Visualizing the Black Box: A live demo using Streamlit. We will showcase a custom dashboard that runs these evaluations, allowing developers to visualize the "reasoning trace" and identify exactly where the agent failed (Retrieval layer vs. Generation layer).
  4. The Feedback Loop: How to use these evaluation scores to iteratively improve prompts and context retrieval logic.
visualization
reliability
metrics
evaluation
talks.speakers
Sho Tanaka

Sho Tanaka

Snowflake

Japan

Sho Tanaka is a Lead Developer Advocate at Snowflake, focused on AI/ML and data engineering. He previously worked at Google (gTech) delivering ML/data solutions across Japan, APAC and global, and he is a Google Developer Expert (AI/ML) and co-founder of MLOps community in Japan. He enjoys turning messy real-world ML projects into reproducible, production-minded architectures.