NIST Seeks Input on Draft AI Benchmark Evaluation Guidance

The National Institute of Standards and Technology is asking industry, government and research stakeholders to weigh in on a new draft framework aimed at improving how language models are evaluated through automated benchmarking.

NIST said Friday that its Center for AI Standards and Innovation, or CAISI, released an initial public draft of NIST AI 800-2, “Practices for Automated Benchmark Evaluations of Language Models,” and is accepting public comments through March 31.

NIST Seeks Public Input on Draft Best Practices for Automated AI Benchmark Testing

The Potomac Officers Club’s 2026 Artificial Intelligence Summit on March 18 will bring together federal, defense and GovCon leaders to discuss how AI is being integrated into mission and enterprise environments. Through keynotes and panels, the event will highlight practical approaches to scaling AI, modernizing legacy systems, and building the data and infrastructure foundations needed for responsible adoption across government. Register now.

Table of Contents

Why Is NIST Issuing Guidance on Automated Benchmark Evaluations?

Automated benchmark evaluations are increasingly used to support AI procurement and deployment decisions, particularly when organizations face limited time or resources. However, NIST cautions that benchmarks are not suitable for every evaluation need. This reflects a growing concern that while these tests have become essential tools for assessing artificial intelligence performance, consistent standards for ensuring valid, reproducible and transparent results are still in their infancy.

The draft organizes guidance around three areas: defining evaluation objectives and select benchmarks, implementing and running evaluations, and analyzing and reporting results. It notes that automated benchmarks work best when tasks are structured, verifiable and stable over time, but are less effective for subjective, dynamic or human-in-the-loop evaluations.

What Does CAISI Recommend for Benchmark Design and Reporting?

One of the central recommendations is that evaluators should begin by clearly documenting what they are trying to measure and how results will be used.

CAISI emphasizes that evaluation objectives should specify both the intended use of the measurements and the underlying capability or construct being assessed. It also urges organizations to carefully select benchmarks, documenting what each benchmark actually measures and whether it directly aligns with the evaluation goal or serves only as a proxy.

Beyond benchmark selection, CAISI highlights the importance of evaluation protocol design — the operational procedures that shape results.

The draft identifies several emerging principles, including:

Comparability across models
External validity tied to real-world use
Cost control, since a higher reasoning effort can inflate performance safeguards against evaluation “cheating,” such as models searching for answers online

CAISI notes that providing internet access during evaluations is a particularly consequential decision, since it can introduce contamination and undermine benchmark integrity.

The draft also calls for stronger norms around statistical analysis and reporting. It recommends that evaluators quantify uncertainty through confidence intervals or standard errors, rather than treating benchmark scores as absolute measures. CAISI further advises that organizations should make qualified claims and avoid overgeneralizing benchmark outcomes beyond their intended scope.

The draft reflects CAISI’s growing mission as the federal government’s primary industry-facing hub for testing frontier AI models. Recent CAISI initiatives include seeking AI experts to work on national security risk evaluations, AI red-teaming and secure deployment guidance as part of the Trump administration’s AI Action Plan.

NIST has also separately requested industry input on security risks and safeguards for agentic AI systems, highlighting threats such as backdoor attacks and data poisoning.

Lt. Gen. Joshua Rudd. The U.S. Indo-Pacific Command deputy commander has been confirmed as head of NSA and USCYBERCOM.

Lt. Gen. Joshua Rudd Confirmed as USCYBERCOM, NSA Leader

The Senate on Tuesday confirmed Lt. Gen. Joshua Rudd to serve as director of the National Security Agency and commander of U.S. Cyber Command in a 71–29 vote, according to congressional records. The leadership transition at NSA and USCYBERCOM highlights the growing importance of cybersecurity strategy and national defense priorities. As government and industry leaders navigate evolving cyberthreats, forums for collaboration and insight are more critical than ever. Reserve your seat now at the 2026 Cyber Summit to join the conversation. Following the confirmation, Rudd, a 2026 Wash100 Award winner, will be promoted to general as he assumes leadership of

July 24, 2025

GSA Issues Draft AI Contract Terms

The General Services Administration has proposed new terms and conditions for artificial intelligence systems that would require vendors selling AI technology to the federal government to grant agencies broad usage rights and meet neutrality standards for system outputs. As federal agencies move to strengthen oversight and procurement rules for AI technologies, conversations about how government acquires and deploys AI continue to gain momentum across the public sector. The 2026 Artificial Intelligence Summit on March 18 will bring together experts to discuss the evolving AI landscape. Register now to save your spot! The draft guidance from GSA’s Federal Acquisition Service outlines

July 24, 2025

Labor Department CIO Mangala Kuppa. Mangala Kuppa has been appointed chief information officer at the Department of Labor.

Mangala Kuppa Named Permanent CIO at Department of Labor

Mangala Kuppa announced on LinkedIn Monday that she has been appointed chief information officer of the Department of Labor. Who Is Mangala Kuppa? Kuppa is a technology leader with over 25 years of experience across the public and private sectors. She is known for leading complex technology initiatives, modernizing IT environments and aligning technology strategies with organizational goals. Her work has strengthened digital capabilities, enhanced cybersecurity resilience and advanced the adoption of emerging technologies in government. We recently recognized Kuppa as one of the nation’s leading technology professionals advancing strategies that help government agencies adopt and effectively implement AI. What Roles Has

July 24, 2025

Why Is NIST Issuing Guidance on Automated Benchmark Evaluations?

What Does CAISI Recommend for Benchmark Design and Reporting?

Related Articles