Technical - Adversarial Attacks

Saihub.info

Category: Technical

Potential Source of Harm: Adversarial Attacks

Updated February 5, 2025

Nature of Harm

Adversarial attacks involve the use of specially-structured data that causes AI models to perform other than as expected or intended. Adversarial data is usually presented to a trained model that is in use, but may also be included in training data. Research shows that many AI models are inherently vulnerable to such attacks.

There are many types of adversarial attacks, which are constantly evolving (similar to the way that other computer malware evolves). Example of attacks include:

large language model (LLM) "jailbreaks" and other prompt injection attacks -- This involves specially-designed prompts that avoid the "guardrails" that are included in many LLMs to avoid toxic, dangerous and/or otherwise unwanted content. Some resources on these attacks are:
- Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition (arXiv paper March 2024)
- Jailbreaking Black Box Large Language Models in Twenty Queries (arXiv paper October 2023)
- Lakera, Jailbreaking Large Language Models: Techniques, Examples, Prevention Methods
- Stanford Human-Centered Artificial Intelligence, Can Foundation Models Be Safe When Adversaries Can Customize Them?
- Gary Marcus, When it comes to security, LLMs are like Swiss cheese — and that’s going to cause huge problems (October 2024).
manipulation of images or physical objects to defeat image recognition -- This involves adding specially-structured data to an image or applying additional features to a physical object, in a way that changes the classification of an object by an AI model without fundamentally changing how the object is perceived by a human. For example, this article from Carnegie Mellon University gives well-known examples including misclassification of a stop sign as a speed limit sign and a panda as a gibbon.
data exfiltration, e.g.
- Data exfiltration from Writer.com with indirect prompt injection (December 2023)
- Hacker plants false memories in ChatGPT to steal user data in perpetuity (September 2024).

Regulatory and Governance Solutions

Regulation governing the robustness of AI systems is beginning to emerge. For example:

The EU AI Act includes requirements on risk management and data goverance for AI systems.
The Chinese Regulations on the Administration of Internet Information Service Recommendation Algorithms require that algorithms be assessed for efficacy and security.

Requirements like these fairly clearly include an obligation to avoid dangerous vulnerability to adversarial attacks.

The need to deal with adversarial techniques is more explicit in emerging AI governance initiatives. For example, the Guidelines for secure AI system development (released in November 2023 by the UK National Cyber Security Centre, the US Cybersecurity and Infrastructure Security Agency and cybersecurity bodies of about 20 other countries) identify "adversarial machine learning" as a key challenge of AI security and recommend steps to address it. The US National Institute of Standards and Technology has published Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations.

Technical Solutions

Development of AI models that are robust against adversarial threat is a core area of AI research and development.

Preventing adversarial attacks on AI models is very challenging, particularly because the large size and "black box" nature of deep neural networks makes them inherently vulnerable to such attacks. There is extensive research literature on these issues, some of which is summarized in this December 2022 survey of adversarial attacks and defenses for image recognition.

The Adversarial Robustness Toolbox (ART) is an open-source Python library that "provides tools that enable developers and researchers to evaluate, defend, certify and verify Machine Learning models and applications against the adversarial threats of Evasion, Poisoning, Extraction, and Inference."

Resistance of leading LLMs to adversarial attacks appears to be gradually increasing. For example, in early 2025, Anthropic announced new techniques for its Claude LLMs that substantially reduce their vulnerability to jailbreaks.

Government Entities

Chinese regulators, including the Cyberspace Administration of China (CAC), already have significant authority for regulation of AI models, as mentioned above. EU and EU member state regulators will eventually have analogous authority under the EU AI Act; and the US has given some authority to various government agencies.

Government AI research institutions may also play a significant role in developing solutions for AI model mismatch.

Private Entities

The Coalition for Secure AI (CoSAI) is "an open ecosystem of AI and security experts from industry leading organizations dedicated to sharing best practices for secure AI deployment and collaborating on AI security research and product development", announced in July 2024 with participation by many leading AI companies.

Many private companies and other entities are working on improved AI models and applications using them. Detailing this work is beyond the scope of Saihub.info, but we may later add more detailed summaries of this work.