Saihub.info
Category: Existential
Potential Source of Harm: AI Domination
Updated November 13, 2024
Nature of Harm
The idea of existential risk to humanity (possibly including human extinction and less complete disastrous outcomes for humanity) plays a large role in current discussions of AI risk. Current discussions have significant roots in popular books including:
The Compendium, an evolving resources first published in October 2024 by a group including Conjecture CEO Connor Leahy, aims to provide "a coherent worldview explaining the race to AGI and extinction risks and what to do about them".
The specific way in which an AI could dominate or destroy humanity is of course unknown -- especially because humans will work to reduce the probability of known dangerous scenarios.
It has become popular, particularly in the community around Silicon Valley, to forecast a P(doom) ("probability of doom") from AI. For example, American AI researcher Eliezer Yudkowsky has estimated a P(doom) as high as 95%, while other leading researchers (such as Google Brain and Coursera founder Andrew Ng and Meta Chief AI Scientist and Turning Prize winner Yann LeCun) believe that the risk is exaggerated and/or being used by large technology companies to strengthen their market positions.
Regulatory and Governance Solutions
Regulatory and governance solutions for existential risk to humanity face the challenge that a rogue AI would presumably not be susceptible to regulation or governance. Therefore, regulatory and governance approaches have focused on preventing AI attaining capabilities that would present risk of harm. These include:
Technical Solutions
Alignment. There is extensive emerging technical work on addressing existential risk, largely through "alignment" of AI goals with human goals. Work on alignment typically assumes that it is unlikely that the world will prevent AI agents from having the intelligence and means to exterminate humanity, so that alignment of goals is the most promising solution.
Leading AI model providers have developed early frameworks for assessing progress towards superintelligent AI and seeking to ensure alignment, including:
AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work (August 2024)
Frontier Safety Framework (May 2024)
A key technique for alignment of large language models is reinforcement learning with human feedback (RLHF), the use of which was a key element of the early success of ChatGPT. RLHF aims to align LLMs with human values by using humans to train reward models (RMs) on binary preferences and then using these RMs to fine-tune the base LLMs (see SEAL: Systematic Error Analysis for Value ALignment (August 2024)). The SEAL paper (a joint effort of Harvard and OpenAI) aims to assess the effectiveness of RLHF.
Other resources on alignment include:
Work on AI alignment necessarily must focus on what it means to optimize AI for achieving human goals, which raises difficult philosophical questions. Such questions have been explored for centuries, including in relatively recent work like A Theory of Justice by John Rawls. More popularized (but ultimately likely less useful) formulations of this issue have been explored in Isaac Asimov's Three Laws of Robotics and discussions of AI and the trolley problem.
Avoiding AI Persuasion. There is also substantial literature on the risk that advanced AI systems will use persuasive abilities to contribute to harms including domination of humanity. There is emerging research on mitigating harms from persuasive generative AI.
Government Entities
There are as yet few government bodies working substantially on AI existential risk. However, state-funded universities are conducting work in the field, including on alignments, such as the Berkeley MATS program.
Private Entities
There are also a significant number of private entities working on existential risk and AI alignment, such as the OpenAI work mentioned above and start-ups such as Aligned AI and Conjecture.