AI alignment

In the field of artificial intelligence (AI), alignment aims to steer AI systems toward a person's or group's intended goals, preferences, or ethical principles. An AI system is considered aligned if it advances the intended objectives. A misaligned AI system pursues unintended objectives.^[1]

It is often challenging for AI designers to align an AI system because it is difficult for them to specify the full range of desired and undesired behaviors. Therefore, AI designers often use simpler proxy goals, such as gaining human approval. But proxy goals can overlook necessary constraints or reward the AI system for merely appearing aligned.^[1]^[2] AI systems may also find loopholes that allow them to accomplish their proxy goals efficiently but in unintended, sometimes harmful, ways (reward hacking).^[1]^[3]

Advanced AI systems may develop unwanted instrumental strategies, such as seeking power or survival because such strategies help them achieve their assigned final goals.^[1]^[4]^[5] Furthermore, they might develop undesirable emergent goals that could be hard to detect before the system is deployed and encounters new situations and data distributions.^[6]^[7] Empirical research showed in 2024 that advanced large language models (LLMs) such as OpenAI o1 or Claude 3 sometimes engage in strategic deception to achieve their goals or prevent them from being changed.^[8]^[9]

Today, some of these issues affect existing commercial systems such as LLMs,^[10]^[11]^[12] robots,^[13] autonomous vehicles,^[14] and social media recommendation engines.^[10]^[5]^[15] Some AI researchers argue that more capable future systems will be more severely affected because these problems partially result from high capabilities.^[16]^[3]^[2]

Many prominent AI researchers and the leadership of major AI companies have argued or asserted that AI is approaching human-like (AGI) and superhuman cognitive capabilities (ASI), and could endanger human civilization if misaligned.^[17]^[5] These include "AI Godfathers" Geoffrey Hinton and Yoshua Bengio and the CEOs of OpenAI, Anthropic, and Google DeepMind.^[18]^[19]^[20] These risks remain debated.^[21]

AI alignment is a subfield of AI safety, the study of how to build safe AI systems.^[22] Other subfields of AI safety include robustness, monitoring, and capability control.^[23] Research challenges in alignment include instilling complex values in AI, developing honest AI, scalable oversight, auditing and interpreting AI models, and preventing emergent AI behaviors like power-seeking.^[23] Alignment research has connections to interpretability research,^[24]^[25] (adversarial) robustness,^[22] anomaly detection, calibrated uncertainty,^[24] formal verification,^[26] preference learning,^[27]^[28]^[29] safety-critical engineering,^[30] game theory,^[31] algorithmic fairness,^[22]^[32] and social sciences.^[33]^[34]

^ ^a ^b ^c ^d Russell, Stuart J.; Norvig, Peter (2021). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 5, 1003. ISBN 9780134610993. Retrieved September 12, 2022.
^ ^a ^b Ngo, Richard; Chan, Lawrence; Mindermann, Sören (2022). "The Alignment Problem from a Deep Learning Perspective". International Conference on Learning Representations. arXiv:2209.00626.
^ ^a ^b Pan, Alexander; Bhatia, Kush; Steinhardt, Jacob (February 14, 2022). The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. International Conference on Learning Representations. Retrieved July 21, 2022.
^ Carlsmith, Joseph (June 16, 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY].
^ ^a ^b ^c Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.
^ Christian, Brian (2020). The alignment problem: Machine learning and human values. W. W. Norton & Company. ISBN 978-0-393-86833-3. OCLC 1233266753. Archived from the original on February 10, 2023. Retrieved September 12, 2022.
^ Langosco, Lauro Langosco Di; Koch, Jack; Sharkey, Lee D.; Pfau, Jacob; Krueger, David (June 28, 2022). "Goal Misgeneralization in Deep Reinforcement Learning". Proceedings of the 39th International Conference on Machine Learning. International Conference on Machine Learning. PMLR. pp. 12004–12019. Retrieved March 11, 2023.
^ Pillay, Tharin (December 15, 2024). "New Tests Reveal AI's Capacity for Deception". TIME. Retrieved January 12, 2025.
^ Perrigo, Billy (December 18, 2024). "Exclusive: New Research Shows AI Strategically Lying". TIME. Retrieved January 12, 2025.
^ ^a ^b Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik (July 12, 2022). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. arXiv:2108.07258.
^ Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, J.; Hilton, Jacob; Kelton, Fraser; Miller, Luke E.; Simens, Maddie; Askell, Amanda; Welinder, P.; Christiano, P.; Leike, J.; Lowe, Ryan J. (2022). "Training language models to follow instructions with human feedback". arXiv:2203.02155 [cs.CL].
^ Zaremba, Wojciech; Brockman, Greg; OpenAI (August 10, 2021). "OpenAI Codex". OpenAI. Archived from the original on February 3, 2023. Retrieved July 23, 2022.
^ Kober, Jens; Bagnell, J. Andrew; Peters, Jan (September 1, 2013). "Reinforcement learning in robotics: A survey". The International Journal of Robotics Research. 32 (11): 1238–1274. doi:10.1177/0278364913495721. ISSN 0278-3649. S2CID 1932843. Archived from the original on October 15, 2022. Retrieved September 12, 2022.
^ Knox, W. Bradley; Allievi, Alessandro; Banzhaf, Holger; Schmitt, Felix; Stone, Peter (March 1, 2023). "Reward (Mis)design for autonomous driving". Artificial Intelligence. 316: 103829. arXiv:2104.13906. doi:10.1016/j.artint.2022.103829. ISSN 0004-3702. S2CID 233423198.
^ Stray, Jonathan (2020). "Aligning AI Optimization to Community Well-Being". International Journal of Community Well-Being. 3 (4): 443–463. doi:10.1007/s42413-020-00086-3. ISSN 2524-5295. PMC 7610010. PMID 34723107. S2CID 226254676.
^ Russell, Stuart; Norvig, Peter (2009). Artificial Intelligence: A Modern Approach. Prentice Hall. p. 1003. ISBN 978-0-13-461099-3.
^ Smith, Craig S. "Geoff Hinton, AI's Most Famous Researcher, Warns Of 'Existential Threat'". Forbes. Retrieved May 4, 2023.
^ Bengio, Yoshua; Hinton, Geoffrey; Yao, Andrew; Song, Dawn; Abbeel, Pieter; Harari, Yuval Noah; Zhang, Ya-Qin; Xue, Lan; Shalev-Shwartz, Shai (2024). "Managing extreme AI risks amid rapid progress". Science. 384 (6698): 842–845. arXiv:2310.17688. Bibcode:2024Sci...384..842B. doi:10.1126/science.adn0117. PMID 38768279.
^ "Statement on AI Risk | CAIS". www.safe.ai. Retrieved February 11, 2024.
^ Grace, Katja; Stewart, Harlan; Sandkühler, Julia Fabienne; Thomas, Stephen; Weinstein-Raun, Ben; Brauner, Jan (January 5, 2024). "Thousands of AI Authors on the Future of AI". arXiv:2401.02843 [cs.CY].
^ Perrigo, Billy (February 13, 2024). "Meta's AI Chief Yann LeCun on AGI, Open-Source, and AI Risk". TIME. Retrieved June 26, 2024.
^ ^a ^b ^c Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (June 21, 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI].
^ ^a ^b Ortega, Pedro A.; Maini, Vishal; DeepMind safety team (September 27, 2018). "Building safe artificial intelligence: specification, robustness, and assurance". DeepMind Safety Research – Medium. Archived from the original on February 10, 2023. Retrieved July 18, 2022.
^ ^a ^b Rorvig, Mordechai (April 14, 2022). "Researchers Gain New Understanding From Simple AI". Quanta Magazine. Archived from the original on February 10, 2023. Retrieved July 18, 2022.
^
Doshi-Velez, Finale; Kim, Been (March 2, 2017). "Towards A Rigorous Science of Interpretable Machine Learning". arXiv:1702.08608 [stat.ML].
- Wiblin, Robert (August 4, 2021). "Chris Olah on what the hell is going on inside neural networks" (Podcast). 80,000 hours. No. 107. Retrieved July 23, 2022.
^ Russell, Stuart; Dewey, Daniel; Tegmark, Max (December 31, 2015). "Research Priorities for Robust and Beneficial Artificial Intelligence". AI Magazine. 36 (4): 105–114. arXiv:1602.03506. doi:10.1609/aimag.v36i4.2577. hdl:1721.1/108478. ISSN 2371-9621. S2CID 8174496. Archived from the original on February 2, 2023. Retrieved September 12, 2022.
^ Wirth, Christian; Akrour, Riad; Neumann, Gerhard; Fürnkranz, Johannes (2017). "A survey of preference-based reinforcement learning methods". Journal of Machine Learning Research. 18 (136): 1–46.
^ Christiano, Paul F.; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario (2017). "Deep reinforcement learning from human preferences". Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS'17. Red Hook, NY, USA: Curran Associates Inc. pp. 4302–4310. ISBN 978-1-5108-6096-4.
^ Heaven, Will Douglas (January 27, 2022). "The new version of GPT-3 is much better behaved (and should be less toxic)". MIT Technology Review. Archived from the original on February 10, 2023. Retrieved July 18, 2022.
^ Mohseni, Sina; Wang, Haotao; Yu, Zhiding; Xiao, Chaowei; Wang, Zhangyang; Yadawa, Jay (March 7, 2022). "Taxonomy of Machine Learning Safety: A Survey and Primer". arXiv:2106.04823 [cs.LG].
^
Clifton, Jesse (2020). "Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda". Center on Long-Term Risk. Archived from the original on January 1, 2023. Retrieved July 18, 2022.
- Dafoe, Allan; Bachrach, Yoram; Hadfield, Gillian; Horvitz, Eric; Larson, Kate; Graepel, Thore (May 6, 2021). "Cooperative AI: machines must learn to find common ground". Nature. 593 (7857): 33–36. Bibcode:2021Natur.593...33D. doi:10.1038/d41586-021-01170-0. ISSN 0028-0836. PMID 33947992. S2CID 233740521. Archived from the original on December 18, 2022. Retrieved September 12, 2022.
^ Prunkl, Carina; Whittlestone, Jess (February 7, 2020). "Beyond Near- and Long-Term". Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. New York NY USA: ACM. pp. 138–143. doi:10.1145/3375627.3375803. ISBN 978-1-4503-7110-0. S2CID 210164673. Archived from the original on October 16, 2022. Retrieved September 12, 2022.
^ Irving, Geoffrey; Askell, Amanda (February 19, 2019). "AI Safety Needs Social Scientists". Distill. 4 (2): 10.23915/distill.00014. doi:10.23915/distill.00014. ISSN 2476-0757. S2CID 159180422. Archived from the original on February 10, 2023. Retrieved September 12, 2022.
^ Gazos, Alexandros; Kahn, James; Kusche, Isabel; Büscher, Christian; Götz, Markus (April 1, 2025). "Organising AI for safety: Identifying structural vulnerabilities to guide the design of AI-enhanced socio-technical systems". Safety Science. 184: 106731. doi:10.1016/j.ssci.2024.106731. ISSN 0925-7535.

[aima4-1] Russell, Stuart J.; Norvig, Peter (2021). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 5, 1003. ISBN 9780134610993. Retrieved September 12, 2022.

[dlp2023-2] Ngo, Richard; Chan, Lawrence; Mindermann, Sören (2022). "The Alignment Problem from a Deep Learning Perspective". International Conference on Learning Representations. arXiv:2209.00626.

[mmmm2022-3] Pan, Alexander; Bhatia, Kush; Steinhardt, Jacob (February 14, 2022). The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. International Conference on Learning Representations. Retrieved July 21, 2022.

[Carlsmith2022-4] Carlsmith, Joseph (June 16, 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY].

[:2102-5] Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.

[Christian2020-6] Christian, Brian (2020). The alignment problem: Machine learning and human values. W. W. Norton & Company. ISBN 978-0-393-86833-3. OCLC 1233266753. Archived from the original on February 10, 2023. Retrieved September 12, 2022.

[gmdrl-7] Langosco, Lauro Langosco Di; Koch, Jack; Sharkey, Lee D.; Pfau, Jacob; Krueger, David (June 28, 2022). "Goal Misgeneralization in Deep Reinforcement Learning". Proceedings of the 39th International Conference on Machine Learning. International Conference on Machine Learning. PMLR. pp. 12004–12019. Retrieved March 11, 2023.

[8] Pillay, Tharin (December 15, 2024). "New Tests Reveal AI's Capacity for Deception". TIME. Retrieved January 12, 2025.

[9] Perrigo, Billy (December 18, 2024). "Exclusive: New Research Shows AI Strategically Lying". TIME. Retrieved January 12, 2025.

[Opportunities_Risks-10] Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik (July 12, 2022). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. arXiv:2108.07258.

[feedback2022-11] Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, J.; Hilton, Jacob; Kelton, Fraser; Miller, Luke E.; Simens, Maddie; Askell, Amanda; Welinder, P.; Christiano, P.; Leike, J.; Lowe, Ryan J. (2022). "Training language models to follow instructions with human feedback". arXiv:2203.02155 [cs.CL].

[OpenAICodex-12] Zaremba, Wojciech; Brockman, Greg; OpenAI (August 10, 2021). "OpenAI Codex". OpenAI. Archived from the original on February 3, 2023. Retrieved July 23, 2022.

[13] Kober, Jens; Bagnell, J. Andrew; Peters, Jan (September 1, 2013). "Reinforcement learning in robotics: A survey". The International Journal of Robotics Research. 32 (11): 1238–1274. doi:10.1177/0278364913495721. ISSN 0278-3649. S2CID 1932843. Archived from the original on October 15, 2022. Retrieved September 12, 2022.

[14] Knox, W. Bradley; Allievi, Alessandro; Banzhaf, Holger; Schmitt, Felix; Stone, Peter (March 1, 2023). "Reward (Mis)design for autonomous driving". Artificial Intelligence. 316: 103829. arXiv:2104.13906. doi:10.1016/j.artint.2022.103829. ISSN 0004-3702. S2CID 233423198.

[15] Stray, Jonathan (2020). "Aligning AI Optimization to Community Well-Being". International Journal of Community Well-Being. 3 (4): 443–463. doi:10.1007/s42413-020-00086-3. ISSN 2524-5295. PMC 7610010. PMID 34723107. S2CID 226254676.

[AIMA-16] Russell, Stuart; Norvig, Peter (2009). Artificial Intelligence: A Modern Approach. Prentice Hall. p. 1003. ISBN 978-0-13-461099-3.

[:2-17] Smith, Craig S. "Geoff Hinton, AI's Most Famous Researcher, Warns Of 'Existential Threat'". Forbes. Retrieved May 4, 2023.

[18] Bengio, Yoshua; Hinton, Geoffrey; Yao, Andrew; Song, Dawn; Abbeel, Pieter; Harari, Yuval Noah; Zhang, Ya-Qin; Xue, Lan; Shalev-Shwartz, Shai (2024). "Managing extreme AI risks amid rapid progress". Science. 384 (6698): 842–845. arXiv:2310.17688. Bibcode:2024Sci...384..842B. doi:10.1126/science.adn0117. PMID 38768279.

[19] "Statement on AI Risk | CAIS". www.safe.ai. Retrieved February 11, 2024.

[20] Grace, Katja; Stewart, Harlan; Sandkühler, Julia Fabienne; Thomas, Stephen; Weinstein-Raun, Ben; Brauner, Jan (January 5, 2024). "Thousands of AI Authors on the Future of AI". arXiv:2401.02843 [cs.CY].

[21] Perrigo, Billy (February 13, 2024). "Meta's AI Chief Yann LeCun on AGI, Open-Source, and AI Risk". TIME. Retrieved June 26, 2024.

[concrete2016-22] Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (June 21, 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI].

[building2018-23] Ortega, Pedro A.; Maini, Vishal; DeepMind safety team (September 27, 2018). "Building safe artificial intelligence: specification, robustness, and assurance". DeepMind Safety Research – Medium. Archived from the original on February 10, 2023. Retrieved July 18, 2022.

[:333-24] Rorvig, Mordechai (April 14, 2022). "Researchers Gain New Understanding From Simple AI". Quanta Magazine. Archived from the original on February 10, 2023. Retrieved July 18, 2022.

[25] Doshi-Velez, Finale; Kim, Been (March 2, 2017). "Towards A Rigorous Science of Interpretable Machine Learning". arXiv:1702.08608 [stat.ML].
Wiblin, Robert (August 4, 2021). "Chris Olah on what the hell is going on inside neural networks" (Podcast). 80,000 hours. No. 107. Retrieved July 23, 2022.

[26] Wiblin, Robert (August 4, 2021). "Chris Olah on what the hell is going on inside neural networks" (Podcast). 80,000 hours. No. 107. Retrieved July 23, 2022.

[26] Russell, Stuart; Dewey, Daniel; Tegmark, Max (December 31, 2015). "Research Priorities for Robust and Beneficial Artificial Intelligence". AI Magazine. 36 (4): 105–114. arXiv:1602.03506. doi:10.1609/aimag.v36i4.2577. hdl:1721.1/108478. ISSN 2371-9621. S2CID 8174496. Archived from the original on February 2, 2023. Retrieved September 12, 2022.

[prefsurvey2017-27] Wirth, Christian; Akrour, Riad; Neumann, Gerhard; Fürnkranz, Johannes (2017). "A survey of preference-based reinforcement learning methods". Journal of Machine Learning Research. 18 (136): 1–46.

[drlfhp-28] Christiano, Paul F.; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario (2017). "Deep reinforcement learning from human preferences". Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS'17. Red Hook, NY, USA: Curran Associates Inc. pp. 4302–4310. ISBN 978-1-5108-6096-4.

[LessToxic-29] Heaven, Will Douglas (January 27, 2022). "The new version of GPT-3 is much better behaved (and should be less toxic)". MIT Technology Review. Archived from the original on February 10, 2023. Retrieved July 18, 2022.

[30] Mohseni, Sina; Wang, Haotao; Yu, Zhiding; Xiao, Chaowei; Wang, Zhangyang; Yadawa, Jay (March 7, 2022). "Taxonomy of Machine Learning Safety: A Survey and Primer". arXiv:2106.04823 [cs.LG].

[31] Clifton, Jesse (2020). "Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda". Center on Long-Term Risk. Archived from the original on January 1, 2023. Retrieved July 18, 2022.
Dafoe, Allan; Bachrach, Yoram; Hadfield, Gillian; Horvitz, Eric; Larson, Kate; Graepel, Thore (May 6, 2021). "Cooperative AI: machines must learn to find common ground". Nature. 593 (7857): 33–36. Bibcode:2021Natur.593...33D. doi:10.1038/d41586-021-01170-0. ISSN 0028-0836. PMID 33947992. S2CID 233740521. Archived from the original on December 18, 2022. Retrieved September 12, 2022.

[33] Dafoe, Allan; Bachrach, Yoram; Hadfield, Gillian; Horvitz, Eric; Larson, Kate; Graepel, Thore (May 6, 2021). "Cooperative AI: machines must learn to find common ground". Nature. 593 (7857): 33–36. Bibcode:2021Natur.593...33D. doi:10.1038/d41586-021-01170-0. ISSN 0028-0836. PMID 33947992. S2CID 233740521. Archived from the original on December 18, 2022. Retrieved September 12, 2022.

[32] Prunkl, Carina; Whittlestone, Jess (February 7, 2020). "Beyond Near- and Long-Term". Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. New York NY USA: ACM. pp. 138–143. doi:10.1145/3375627.3375803. ISBN 978-1-4503-7110-0. S2CID 210164673. Archived from the original on October 16, 2022. Retrieved September 12, 2022.

[:4-33] Irving, Geoffrey; Askell, Amanda (February 19, 2019). "AI Safety Needs Social Scientists". Distill. 4 (2): 10.23915/distill.00014. doi:10.23915/distill.00014. ISSN 2476-0757. S2CID 159180422. Archived from the original on February 10, 2023. Retrieved September 12, 2022.

[34] Gazos, Alexandros; Kahn, James; Kusche, Isabel; Büscher, Christian; Götz, Markus (April 1, 2025). "Organising AI for safety: Identifying structural vulnerabilities to guide the design of AI-enhanced socio-technical systems". Safety Science. 184: 106731. doi:10.1016/j.ssci.2024.106731. ISSN 0925-7535.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

AI alignment

From Wikipedia, the free encyclopedia · View on Wikipedia