Technical Solutions

Saihub.info

Technical Solutions

Updated February 5, 2025

Technical solutions for safe and responsible AI are nascent. In addition to setting out widely-applicable technical solutions on this page (and its sub-pages), Saihub identifies technical solutions that are specific to the harms identified on our harms register (see sub-pages of the Harms page).

There are many emerging technical approaches to AI safety. In an October 2023 policy paper Emerging processes for frontier AI safety, published in advance of the first AI Safety Summit, the UK Department for Science, Innovation and Technology (DSIT) identified nine processes (with both technical and governance elements) for companies to apply with the aim to ensure AI safety:

Responsible Capability Scaling
Model Evaluations and Red Teaming
Model Reporting and Information Sharing
Security Controls including Securing Model Weights
Reporting Structure for Vulnerabilities
Identifiers of AI-generated Material
Prioritising Research on Risks Posed by AI
Preventing and Monitoring Model Misuse
Data Input Controls and Audits.

The interim International Scientific Report on the Safety of Advanced AI, published in May 2024 in connection with the Seoul AI summit, includes a useful classification of technical approaches to AI safety:

Risk management and safety engineering -- including (a) risk assessment and (b) risk management
Training more trustworthy models -- including (a) alignment, (b) reducing hallucinations, (c) improving robustness, (d) removing hazardous capabilities and (e) analyzing and editing the inner workings of models
Monitoring and intervention -- including (a) detecting AI-generated content, (b) detecting anomalies and attacks, (c) explaining model actions and (d) building safeguards into AI systems
Technical approaches to fairness and representation -- including (a) mitigation of bias and discrimination and (b) associated challenges of avoiding bias
Privacy techniques.

This report was followed by a more detailed International AI Safety Report in January 2025. We will provide further detail on this report soon.

In the June 2024 paper Open-Endedness is Essential for Artificial Superhuman Intelligence, a group of researchers from Google DeepMind proposed a definition of "open-ended" AI (which they assert is a fundamental attribute of superintelligent AI systems) and provide general thinking on safety approaches for controlling open-ended AI systems.

Specific Safety Techniques

Following are a few details on specific safety techniques. Over time, we will develop further detail on these techniques, and add new ones.

Model Evaluations and Red Teaming -- see separate page.

Digital Watermarking -- see separate page.

Interpretability of large language models (LLMs) and other AI models is an important technique for managing safety issue. Research on LLM interpretability includes:

Anthropic, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (May 2024)
OpenAI, Extracting Concepts from GPT-4 (June 2024)
thesephist.com, Prism: mapping interpretable concepts and features in a latent space of language (June 2024)
Nature, Detecting hallucinations in large language models using semantic entropy (June 2024).

Standards. Technical approaches to AI governance can be set out in technical standards, which are evolving. As part of the UK National AI Strategy, the UK government has established an AI Standards Hub.

Toolkits and Frameworks. Google has launched:

a Responsible Generative AI Toolkit together with its Gemma model in February 2024
a Frontier Safety Framework -- "a set of protocols for proactively identifying future AI capabilities that could cause severe harm and putting in place mechanisms to detect and mitigate them" -- in May 2024.

Provably Secure AI. Early work has begun on approaches to use formal mathematical methods to prove (to a selected degree of probabilistic certainty) that AI models are safe against known harms. A 2021 article Trustworthy AI by Jeannette Wing provides an excellent summary of the approaches and challenges of such formal methods. In early 2024, the UK Advanced Research and Invention Agency initiated Safeguarded AI, a major research program on formal methods for AI safety. Such approaches are technically complex, and are likely to be primarily useful in applications where there are strong reasons and willlingness to sacrifice functionality for security (e.g. certain government / military applications).

An important question for all of these approaches is whether technical safety measures should be (a) a feature of AI models and/or (b) external to AI models. It is our strong intuition that successful safety approaches will be a combination of model-based and non-model-based, with differences depending upon the specific AI harm. An interesting blog suggesting that approaches external to AI models are more important in the case of certain misuse of AI models is AI safety is not a model property.

Government Entities

Numerous countries are establishing AI safety institutes (AISIs) to address technical issues for AI safety, beginning with the UK and US AISIs announced in November 2023 in connection with the first AI Safety Summit:

U.S. AI Safety Institute (USAISI)
- The US Department of Commerce announced the leadership of USAISI in February and April 2024.
- The U.S. AI Safety Institute Consortium includes over 200 entities that will contribute technical expertise to the work of USAISI.
UK AI Safety Institute.

At the second AI Seoul Summit in May 2024, an International Network of AI Safety Institutes was announced, agreed by Australia, Canada, the EU, France, Germany, Italy, Japan, Korea, Singapore, the UK and the US. The network met for the first time in November 2024 in San Francisco, with representatives from Australia, Canada, the EU, France, Japan, Kenya, the Korea, Singapore, the UK and the US. The Future Society published a report in November 2024 on the AISI network, examining cooperation (and potential for improved cooperation) among the various AISIs and between AISIs and major AI model developers.