How to Use Azure AI Foundry’s Evaluation Tools Before You Ship

Craig Chiffers

2 months ago

TL;DR: Azure AI Foundry’s evaluation tools went GA in March 2026 and they’re genuinely good. There are three categories of built-in evaluator – quality, safety, and agent-specific – and they now feed directly into Azure Monitor for continuous production monitoring. This post walks through what each evaluator does, how to run your first evaluation in the portal, and why the shift from “pre-ship gate” to “live signal” matters more than you might think.

The Problem With Vibing Your Way to Production

Most AI projects I see go through roughly the same evaluation process: the builder asks it a few questions, the answers look reasonable, and it ships. That’s not evaluation. That’s optimism.

The thing is, AI outputs fail in ways that are hard to spot manually. A RAG pipeline that seems fine in testing can quietly start returning low-groundedness responses when new documents get indexed. An agent that handles happy-path tool calls perfectly can fall apart when a user’s intent is slightly ambiguous. A model that passes all your hand-crafted test questions might still produce unsafe content on edge cases you didn’t think to test.

Structured evaluation catches this. Manual spot-checking doesn’t – at least not reliably, and not at scale.

The good news is that Azure AI Foundry now has a proper, GA evaluation framework that covers this well. Here’s how to use it.

The Three Evaluator Categories

Foundry’s built-in evaluators fall into three buckets. Understanding which one applies to your workload is the first step.

Quality Evaluators

These measure the overall quality of generated responses. There are two sub-types:

AI-assisted quality metrics use a model as a judge – they require an Azure OpenAI deployment to score your outputs. The evaluators in this group include coherence (does the response make logical sense?), fluency (is the language natural and well-formed?), relevance (does the response actually address the question asked?), and groundedness (is the response grounded in the source material, or is the model making things up?).

Groundedness is the one I’d tell anyone building a RAG system to set up first. It’s the single most important signal for whether your retrieval pipeline is doing its job.

NLP-based quality metrics are mathematical rather than model-judged, and they typically require ground truth data to calculate. ROUGE is the most common – it measures n-gram overlap between your model’s output and a reference answer. These are useful if you have a well-defined expected output, less useful if you’re evaluating open-ended generation.

Safety Evaluators

Safety evaluators identify potential content risks in generated output. They cover areas including hate speech and unfairness, violence, sexual content, self-harm, and protected material detection. Importantly, safety evaluators don’t require you to provide a model deployment – Foundry provisions its own GPT-4 instance to run the scoring.

If you’re building anything customer-facing, these are non-negotiable to run before go-live. Running them post-deployment on sampled traffic is even better, which I’ll come to shortly.

Agent Evaluators

This is the category most people aren’t running yet, and it’s where the real value is for anyone building agentic workflows.

When a user sends a message to an agent, a lot happens before they get a response: the agent has to understand what the user actually wants, decide which tools to call, call them correctly, use the results appropriately, and produce a final response that matches its instructions. Each of those steps can fail independently. The final response can look fine even when several intermediate steps were wrong.

Foundry’s agent evaluators give you unit-test-style coverage over each step:

Intent resolution – did the agent correctly identify what the user was asking for?
Tool call accuracy – did the agent call the right tool, with the right parameters?
Task adherence – did the final response follow the agent’s system prompt and assigned tasks?
Response completeness – did the agent actually answer the question, or did it dodge?

These evaluators return Pass/Fail scores with reasoning, which makes them genuinely actionable. If tool call accuracy is consistently failing on a particular intent, you know exactly where to look.

Note: Agent evaluation is still in public preview as of April 2026, while model and dataset evaluation are GA.

Running Your First Evaluation in the Portal

The quickest way to get started is through the Foundry portal. You can kick off an evaluation from three places: the Evaluation page (left nav → Evaluation → Create), the Models page (go to your model → Evaluation tab → Create), or the Agents page (go to your agent → Evaluation tab → Create). The entry point you use doesn’t change what’s available – it just pre-populates some fields for you.

What You’ll Need

Before you start, make sure you have:

A test dataset in CSV or JSONL format – or be prepared to generate one synthetically
An Azure OpenAI connection with a deployed GPT model that supports chat completion (required for AI-assisted quality evaluations; not needed for safety-only runs)

If you don’t have a test dataset yet, Foundry has a synthetic dataset generation feature worth knowing about. You specify the number of rows, describe the type of data you want, and optionally upload reference documents to ground the generation. It’s not a substitute for real user traffic data, but it’s a reasonable starting point when you have nothing.

Selecting Your Evaluators

Once you’ve configured your data source, you select which evaluators to run. The portal gives you checkboxes across the three categories. A few practical notes from experience:

The portal automatically maps your dataset fields to the fields each evaluator expects. This works well for standard schemas (query/response/context), but if your data has non-standard field names you may need to adjust the mapping manually.

Different evaluators have different data requirements. Groundedness needs the source context as well as the query and response. ROUGE needs ground truth reference answers. Tool call accuracy needs the full agent message trace including tool call steps, not just the final text response. Check the data requirements for each evaluator before you run – otherwise you’ll get scoring errors rather than useful results.

Reading the Results

Results appear in the Foundry portal as a scored dataset. For AI-assisted metrics you’ll see a score per row plus an explanation of the reasoning. For Pass/Fail agent metrics you’ll see the pass rate across your dataset plus per-row reasoning for failures.

The thing to look for isn’t a single row failing – it’s patterns. If groundedness is failing consistently for responses that cite a particular document type, that’s a retrieval pipeline issue. If tool call accuracy is failing on queries with certain phrasing, that’s a prompt engineering issue. The per-row reasoning makes these patterns much easier to surface than raw scores alone.

Beyond the Pre-Ship Gate: Continuous Monitoring

This is where the March 2026 GA announcement changes the game, and it’s worth spending some time on because it represents a genuine shift in how evaluation is supposed to work.

The traditional evaluation model is: build → evaluate → pass/fail gate → ship. The problem with that model is that the world doesn’t stop changing when you deploy. Models get updated. Your retrieval corpus changes. New user traffic surfaces intents that never appeared in your test dataset. Quality in production is a moving target, not a state you achieve once.

Evaluations, Monitoring, and Tracing in Microsoft Foundry are now GA through Foundry Control Plane, and they’re deeply integrated with Azure Monitor. This means your AI quality signals live in the same operational plane as the rest of your infrastructure – not in a separate AI-specific dashboard you have to remember to check.

In practical terms, this enables a few things that weren’t cleanly possible before:

Continuous evaluation on sampled production traffic. You can configure Foundry to evaluate a percentage of real production requests against your chosen evaluators on an ongoing basis. You don’t have to wait for a complaint or a manually triggered test run to know your groundedness score has dropped.

Scheduled evaluation runs. For workloads where you want to test against a fixed dataset regularly (useful for detecting model drift after an upstream model update), you can schedule evaluation runs and get results piped into Azure Monitor automatically.

Azure Monitor alerts on evaluation metrics. Configure alert rules against any evaluation metric. Groundedness drops below 0.75? Trigger a PagerDuty incident. Safety violations spike above threshold? Send a Teams notification. These are standard Azure Monitor alert rules – same tooling you’d use for CPU or latency alerts.

Cross-stack correlation. When a quality metric degrades, you often don’t know immediately whether it’s a model issue, a retrieval issue, or an infrastructure issue affecting latency and truncating context. With AI quality signals and infrastructure telemetry in the same Azure Monitor Application Insights workspace, you can correlate across them in minutes rather than spending hours manually comparing disconnected dashboards.

Evaluation Results + Traces: The Gap That’s Now Closed

One of the most practically useful things in the March GA release is something that sounds minor but makes a real operational difference: evaluation results are now linked to the underlying agent trace.

Previously, if you had a failing evaluation result, you knew that a response had scored poorly on task adherence or groundedness. What you didn’t have directly from the evaluation view was the full trace of what the agent actually did – which tools it called, what the retrieval returned, where in the reasoning chain things went wrong.

Now, a failing evaluation result links directly to the trace for that interaction. You can go from “task adherence failed” to “here’s exactly what the agent did, step by step” in a single click. That’s the difference between knowing something is broken and being able to fix it quickly.

This also changes how useful continuous production monitoring is. Sampling production traffic and running evaluators against it is only valuable if you can act on failures. With trace linking, a spike in failing safety evaluations is now directly debuggable – you can inspect the actual interactions that triggered the failures rather than just knowing the rate went up.

Where to Start

If you haven’t run any structured evaluation against your AI workload yet, the path I’d suggest is:

Start with one evaluator. If you’re running a RAG system, start with groundedness. If you’re running an agent, start with task adherence. Get a baseline score against a representative dataset.
Fix the obvious failures first. Use the per-row reasoning to identify patterns in what’s failing. Usually there are a small number of root causes responsible for most of the failures.
Add continuous monitoring. Once you have a baseline you’re happy with, set up continuous evaluation on sampled production traffic and configure Azure Monitor alerts on your key metrics. This is what prevents you from shipping a fix and then silently regressing two weeks later.
Layer in additional evaluators. Add safety evaluators if you haven’t already. Add agent evaluators if your workload involves tool calling. Build custom evaluators for any domain-specific quality criteria that the built-ins don’t cover.

The tooling is genuinely solid now that it’s GA. The main thing stopping most teams from using it properly isn’t the tooling – it’s the test dataset. If you don’t have one, the synthetic generation feature is worth trying, but investing time in capturing and labelling real user interactions will pay back quickly in evaluation quality.

Questions or comments below – happy to go deeper on custom evaluator configuration or the Azure Monitor alert setup if that’s useful.

0 0 votes

Article Rating