TechCrunch

New Microsoft tool lets devs spin up AI behavior tests using text descriptions

June 2, 2026 · Ram Iyer · Original Source

Title: Microsoft Unveils ASSERT: A New Framework for Text-Driven AI Behavior Testing

While the AI research community has made significant strides in assessing models for broad concerns such as safety, compliance, alignment, and sycophancy, a distinct challenge remains for developers: ensuring their AI systems perform exactly as required for specific products or services. To address this niche requirement, Microsoft announced the launch of ASSERT on Tuesday. Short for Adaptive Spec-driven Scoring for Evaluation and Regression Testing, this new open-source framework aims to streamline the evaluation of application-specific AI behaviors.

According to Microsoft, ASSERT simplifies the testing process by leveraging AI to convert high-level, natural-language descriptions of goals, policies, or desired behaviors into comprehensive, scored test suites. The framework accepts plain-language inputs regarding expected model conduct and policies, then structures these into defined acceptable and unacceptable behaviors. It subsequently generates problem scenarios and test cases, executes them against the target system, and scores the outcomes. Furthermore, ASSERT records the AI’s decision-making paths, including intermediate actions and tool calls, allowing developers to pinpoint exactly where failures occur.

Users can further tailor evaluations by providing specific system context, tools, and constraints. For instance, a developer building a document research agent could instruct ASSERT to verify that the system does not email individuals outside the organization, restricts confidential data to C-level executives, and delivers concise summaries that account for prior context. ASSERT would then automatically generate test cases to continuously verify adherence to these rules.

Microsoft states that ASSERT addresses a critical gap left by general-purpose evaluations, which often fail to account for the unique context, policies, and tools inherent to a specific application or product. “One of the things we’ve learned is that evaluations are absolutely critical to making good decisions,” stated Sarah Bird, Chief Product Officer of Responsible AI at Microsoft. “Because if you don’t understand the behavior of the AI system, it’s really hard to know if it’s meeting your organization’s bar […] What we found is that if you really want to have a trustworthy system, you should evaluate many more dimensions that are application-specific.”

Bird noted that ASSERT is versatile enough to be employed during the development phase, post-deployment, and for ongoing continuous monitoring. This release aligns with a broader industry trend toward repeatable testing and regression checks as models become increasingly capable. Concurrently, organizations such as Stanford’s HELM, MLCommons’ AILuminate, and evaluation groups like METR are introducing benchmarks designed to measure model behavior under varying conditions.

Source: TechCrunch Generated at: 2026-06-02 19:02:21 UTC