Model Evaluations & Red Teaming

Updated January 16, 2025

 

A key technical process for ensuring AI safety is to evaluate AI models after they have been developed, and ideally before they are put into use (or at least before they are put into widespread use).

 

Red Teaming

One broadly-applicable approach to model evaluations that has received substantial recent attention is "red teaming". This term has not always been used in a clear way. For example, OpenAI has written: "The term red teaming has been used to encompass a broad range of risk assessment methods for AI systems, including qualitative capability discovery, stress testing of mitigations, automated red teaming using language models, providing feedback on the scale of risk for a particular vulnerability, etc."

 

Some resources on red teaming include:

 

Government Resources

Governments are asserting significant roles in AI testing, for example:

  • September 2024 - The Japan AI Safety Institute released a Guide to Evaluation Perspectives on AI Safety and Guide to Red Teaming Methodology on AI Safety.

  • August 2024 - The US AI Safety Institute, part of the US National Institute of Standards and Technology (NIST), announced collaboration agreements with Anthropic and OpenAI on AI safety research, testing and evaluation, including pre- and post-release testing of AI models from both companies. 

  • July 2024 - NIST released Dioptra, an open-source software package "aimed at helping AI developers and customers determine how well their AI software stands up to a variety of adversarial attacks".

  • May 2024 - NIST launched the Assessing Risks and Impacts of AI (ARIA) program, intended to "assess the societal risks and impacts of artificial intelligence systems (i.e., what happens when people interact with AI regularly in realistic settings)".
  • April 2024 - The US and UK signed a Memorandum of Understanding on AI safety, committing to cooperation between the US and UK AI Safety Institutes on testing of advanced AI modes.
  • February 2024 - The UK AI Safety Institute published a notice on its approaches to model evaluations.
  • February 2024 - The UK Department for Science, Innovation and Technology published guidance on AI assurance. The guidance defines “AI assurance” as the process for “measur[ing], evaluat[ing] and communicat[ing] the trustworthiness of AI systems”, and outlines a “toolkit” of capabilities and processes for delivering AI assurance. However, the Institute has reportedly encountered difficulty in obtaining access to major AI models for pre-release testing.

 

Testing of AI systems through government / private sector cooperation is also a central feature of the EU AI Act.

 

Other Approaches

A 2023 article by Mokander et al proposes a three-pronged combined governance / technical approach to auditing large language models, involving (1) governance audits, (2) model audits and (3) application audits.

 

Private Entities

See the section "Red Teaming" above for some resources from OpenAI, Google and Microsoft. In addition:

  • July 2024 - Anthropic (developer of the Claude chatbot) solicited proposals for funding for third-party model evaluation solutions, in three categories:

    • AI Safety Level assessments

    • advanced capability and safety metrics

    • infrastructure, tools, and methods for developing evaluations.

  • November 2024 - OpenAI published a blog on its evolving human and automated red teaming methods, used for testing OpenAI's models.
  • December 2024 - MLCommons release the AILuminate benchmark for safety of LLM text interactions.

 

There are a significant number of start-ups that are developing model evaluation and observability solutions, to assess issues like bias, hallucination and others, including:

 

We are particularly interested in open source solutions for AI model evaluation, including: