Social interaction-inspired AI alignment
Conjecture: Understanding the psychology of human social interaction will help with AI alignment.
And personally, I think this will likely form a relatively large and neglected component of alignment research.
Below: one potential past example and one potential future example.
Potential past example: boundaries
Boundaries seem to be a useful concept in both human psychology and AI safety.
Boundaries codify the autonomy of individuals. When the boundaries between individuals are preserved, unnecessary conflict between those individuals is largely minimized. I think boundaries could help specify the safety in AI safety. (See claim 4 here, #9 here.)
My work in this area has mostly been writing distillations and convening researchers [CBW retrospective, MBW retrospective]. I became interested in this topic after I started thinking about the natural boundaries between humans. It was my interest in psychological boundaries that got me interested in understanding the boundaries/causal distance between agents in general. I reasoned that these ideas would be helpful for understanding the boundaries between humans and AIs, and, as it turned out, other researchers were already thinking about this.
[FAQ: How might boundaries be applied in AI safety? The most near-term way is as a formal spec for provably safe AI in davidad’s £59m ARIA programme.]
My interest and intuitive understanding of AI safety boundaries came from psychology. I don’t know if this was the case for others interested in boundaries, but I do wonder. For example, how much was Andrew Critch’s thinking about boundaries between agents and boundaries in AI safety inspired by thinking about boundaries in human social interaction?
Potential future example: Goodness
Conjecture: Understanding ‘Goodness’ in human social interaction will help with AI alignment — potentially greatly.
Context: One way I like to think about what I want from ‘full alignment’ is in terms of two (somewhat-independent) properties:
I want goodness to be present and unsafety to be absent.
(Is there a better term than “Goodness”?)
Also, notice that I’m not smooshing Goodness and Safety into one axis (one that might more commonly be called “Utility”). I think these can’t be cleanly placed on the same spectrum.
Recall that I see boundaries as a way to mostly specify safety. However, even if you’re safe, that doesn’t necessarily mean that goodness is present. So boundaries don’t necessarily specify Goodness. Open question: How can Goodness be specified?
At the same time, in my psychology thinking, I’ve been wondering: What causes joy, connection, and collaboration? What generates Goodness?
In my own life, once I learned to do boundaries well, I became much less concerned about social conflicts. And while I was glad to feel less anxious and more safe, I also wasn’t immediately and automatically connecting with other people / being actively happy / feeling Goodness.
What can I do to create Goodness?
I don’t expect what I’m about to say to convince anyone who isn’t already convinced, but currently, I suspect that the most common missing factor for Goodness, in both psychology and AI alignment, is actually collective intelligence. I’ll leave the explanation to another post.
But if that’s right, I think the best feedback loop we have for understanding collective intelligence in general is to understand the collective intelligence that already exists in human social interactions.
Why might alignment be like social interaction?
As Michael Levin says, “all intelligence is collective intelligence”. There is no such thing as a centralized (and also significant) intelligence. Every intelligence worth worrying about is made of smaller parts, and those parts must figure out how to coordinate and align with each other.
In which case, I think alignment is really a question of, How do you align parts to a greater whole? How do you avoid internal conflicts (e.g., cancer)?
Social interaction psychology deals with the same questions.
Thanks to thinking partners Alex Zhu, Adam Goldstein, Ivan Vendrov, and David Spivak. Thanks to Stag Lynn for help editing.