TechCrunch

New Microsoft tool lets devs spin up AI behavior tests using text descriptions

Title: Microsoft Unveils ASSERT: A New Framework for Text-Driven AI Behavior Testing

While the AI research community has made significant strides in assessing models for broad concerns such as safety, compliance, alignment, and sycophancy, a distinct challenge remains for developers: ensuring their AI systems perform exactly as required for specific products or services. To address this niche requirement, Microsoft announced the launch of ASSERT on Tuesday. Short for Adaptive Spec-driven Scoring for Evaluation and Regression Testing, this new open-source framework aims to streamline the evaluation of application-specific AI behaviors.

According to Microsoft, ASSERT simplifies the testing process by leveraging AI to convert high-level, natural-language descriptions of goals, policies, or desired behaviors into comprehensive, scored test suites. The framework accepts plain-language inputs regarding expected model conduct and policies, then structures these into defined acceptable and unacceptable behaviors. It subsequently generates problem scenarios and test cases, executes them against the target system, and scores the outcomes. Furthermore, ASSERT records the AI’s decision-making paths, including intermediate actions and tool calls, allowing developers to pinpoint exactly where failures occur.

Users can further tailor evaluations by providing specific system context, tools, and constraints. For instance, a developer building a document research agent could instruct ASSERT to verify that the system does not email individuals outside the organization, restricts confidential data to C-level executives, and delivers concise summaries that account for prior context. ASSERT would then automatically generate test cases to continuously verify adherence to these rules.

Microsoft states that ASSERT addresses a critical gap left by general-purpose evaluations, which often fail to account for the unique context, policies, and tools inherent to a specific application or product. “One of the things we’ve learned is that evaluations are absolutely critical to making good decisions,” stated Sarah Bird, Chief Product Officer of Responsible AI at Microsoft. “Because if you don’t understand the behavior of the AI system, it’s really hard to know if it’s meeting your organization’s bar […] What we found is that if you really want to have a trustworthy system, you should evaluate many more dimensions that are application-specific.”

Bird noted that ASSERT is versatile enough to be employed during the development phase, post-deployment, and for ongoing continuous monitoring. This release aligns with a broader industry trend toward repeatable testing and regression checks as models become increasingly capable. Concurrently, organizations such as Stanford’s HELM, MLCommons’ AILuminate, and evaluation groups like METR are introducing benchmarks designed to measure model behavior under varying conditions.


Source: TechCrunch Generated at: 2026-06-02 19:02:21 UTC

Related Articles

Galaxy Digital Sets Up Prediction Market OTC Derivatives Des
Bloomberg

Galaxy Digital Sets Up Prediction Market OTC Derivatives Des

Dimon Slams Coinbase’s Armstrong in Tense Crypto Bill Fight
Bloomberg

Dimon Slams Coinbase’s Armstrong in Tense Crypto Bill Fight

JPMorgan CEO Dimon criticized Coinbase’s Armstrong during a heated cryptocurrency legislation battle. This clash highlig...

Franklin Templeton Expert Ting Highlights Enduring Strength of Korean & Taiwanese Markets
Bloomberg

Franklin Templeton Expert Ting Highlights Enduring Strength of Korean & Taiwanese Markets

Franklin Templeton’s Ting highlights the enduring strength and strong performance of the Korean and Taiwanese markets.

Tech Life
BBC News

Tech Life

Microsoft unveils its new Majorana 2 quantum processor, sparking debate on its practical future. The episode also covers...

Reuters

Microsoft reveals new quantum chip made with AI, says it will have systems by 2029

Microsoft unveiled an AI-designed quantum chip, aiming for operational systems by 2029.

Investors Rethink Strategies Amid Mega IPOs
Bloomberg

Investors Rethink Strategies Amid Mega IPOs

Investors are adjusting their strategies in response to the surge of mega IPOs, signaling a shift in market dynamics and...