Artificial intelligence. MITRE and the FAA launched the ALUE benchmark to enable the evaluation of LLMs for aerospace tasks.
The Federal Aviation Administration and MITRE have introduced a benchmark to facilitate the assessment of large language models for aerospace tasks.
/

MITRE, FAA Launch Aerospace LLM Evaluation Benchmark

2 mins read

The Federal Aviation Administration and MITRE have unveiled a benchmark to facilitate the assessment of large language models, or LLMs, for aerospace tasks.

ALUE Benchmark

MITRE said Wednesday the Aerospace Language Understanding Evaluation, or ALUE, benchmark is designed to streamline the inference and evaluation of LLMs using information specific to the aerospace domain.

ALUE supports open-source and domain-specific LLMs, custom datasets, user-defined prompts and various quantitative performance metrics. LLM evaluations are important in assessing a model’s performance and understanding its potential risks and limitations, including biases, hallucinations and privacy concerns.

“ALUE allows the FAA and the aerospace community to create a definitive library of diverse and specific aviation nomenclature and terms that will enable the agency to harness the power of AI for tools and tasks that will continuously improve safety and efficiency today and into the future,” said Kerry Buckley, vice president at MITRE and director of the Center for Advanced Aviation System Development.

MITRE noted that ALUE will help ensure artificial intelligence tools are fit to improve the safety of the National Airspace System.

Ongoing & Future Work Related to ALUE Benchmark

According to MITRE, ongoing work will continue to expand the ALUE benchmark’s scope to address more complex aerospace challenges, such as developing tasks for extracting complex data from charts.

For future work, the nonprofit organization said it expects the benchmark to integrate tasks that require LLMs to consult aircraft operational manuals and other external data sources to determine thrust and flap settings and other parameters under specific conditions.