Saihub.info
Model Evaluations & Red Teaming
Updated December 16, 2024
A key technical process for ensuring AI safety is to evaluate AI models after they have been developed, and ideally before they are put into use (or at least before they are put into widespread use).
Red Teaming
One broadly-applicable approach to model evaluations that has received substantial recent attention is "red teaming". This term has not always been used in a clear way. For example, OpenAI has written: "The term red teaming has been used to encompass a broad range of risk assessment methods for AI systems, including qualitative capability discovery, stress testing of mitigations, automated red teaming using language models, providing feedback on the scale of risk for a particular vulnerability, etc."
Some resources on red teaming include:
Government Resources
Governments are asserting significant roles in AI testing, for example:
September 2024 - The Japan AI Safety Institute released a Guide to Evaluation Perspectives on AI Safety and Guide to Red Teaming Methodology on AI Safety.
August 2024 - The US AI Safety Institute, part of the US National Institute of Standards and Technology (NIST), announced collaboration agreements with Anthropic and OpenAI on AI safety research, testing and evaluation, including pre- and post-release testing of AI models from both companies.
July 2024 - NIST released Dioptra, an open-source software package "aimed at helping AI developers and customers determine how well their AI software stands up to a variety of adversarial attacks".
Testing of AI systems through government / private sector cooperation is also a central feature of the EU AI Act.
Other Approaches
A 2023 article by Mokander et al proposes a three-pronged combined governance / technical approach to auditing large language models, involving (1) governance audits, (2) model audits and (3) application audits.
Private Entities
See the section "Red Teaming" above for some resources from OpenAI, Google and Microsoft. In addition:
July 2024 - Anthropic (developer of the Claude chatbot) solicited proposals for funding for third-party model evaluation solutions, in three categories:
AI Safety Level assessments
advanced capability and safety metrics
infrastructure, tools, and methods for developing evaluations.
December 2024 - MLCommons release the AILuminate benchmark for safety of LLM text interactions.
There are a significant number of start-ups that are developing model evaluation and observability solutions, to assess issues like bias, hallucination and others, including:
We are particularly interested in open source solutions for AI model evaluation, including: