The Department of War, in coordination with the Office of the Director of National Intelligence, is seeking industry proposals for an evaluation harness and government-defined benchmarks that would enable rigorous, reproducible and vendor-agnostic testing of artificial intelligence systems against criteria specified by the government.
Sign up for the Potomac Officers Club’s 2026 Artificial Intelligence Summit on March 18 to hear Cameron Stanley, chief digital and AI officer at the Department of War, and other federal, defense and industry leaders discuss the impact of AI, machine learning and automation.
Table of Contents
What Features Are Required in the Evaluation Harness?
According to the commercial solutions opening notice published by the Defense Innovation Unit, the War Department is pursuing an evaluation harness that connects to AI models, facilitates evaluation workflows and measures their performance against benchmarks. The harness should support human-in-the-loop, agentic and adversarial evaluations. It should simulate an integrated environment to continuously test and monitor an AI model performance in challenging settings. Furthermore, the harness should generate evaluation reports and manage benchmark execution.
What Standards Must the New Benchmarks Meet?
Vendors must provide methodologies for creating benchmarks across unclassified, secret and top secret workflows that are resistant to gaming, adaptable as requirements and AI models evolve, and supported by training materials. These benchmarks should identify capabilities for particular missions, break those capabilities into measurable tasks and create realistic evaluation scenarios. They should also define clear scoring criteria, establish fair performance baselines using open models and ensure benchmarks are valid, reliable and capable of distinguishing different levels of performance.
Why Is the Government Expanding AI Evaluation Capabilities?
The government is pursuing new evaluation systems to address the rapid advancement of AI technologies. The new infrastructure should be able to evaluate newly released AI models against mission-specific benchmarks. In addition, the system should assess human-machine collaboration to determine whether joint operations yield better mission outcomes than either humans or automated systems alone.
The effort, dubbed “Mystic Depot,” follows calls by Pentagon leadership to accelerate the adoption of AI across warfighting and administrative operations, DefenseScoop reported. Interested vendors can submit their responses to the CSO by March 24.

