Update: A new post makes this more concrete.
Hypothesis: Understanding human alignment is just like understanding AI alignment because the underlying principles (alignment) are the same.
Example: Boundaries
Boundaries provide autonomy. When individuals’ boundaries are preserved, unnecessary conflict between those individuals is minimized. Boundaries could help specify the safety in AI safety. [Claim #4 here, Claim #9 here.]
My work in this area has mostly been writing distillations and convening researchers.
Conceptual Boundaries Workshop (3-days)
Mathematical Boundaries Workshop (5-days)
I became interested in this after I started thinking about the human boundaries, and this required understanding the causal distance between agents in general. And as it turned out, other researchers were already also thinking about this.1
How might boundaries be concretely applied to AI safety? One way is as a formal spec for provably safe AI in davidad’s £59m ARIA programme.
Conjecture: “Goodness”?
One way I like to think about what we want from ‘full alignment’ is two somewhat-independent properties:
(Also, notice that I haven’t smooshed Goodness and Safety into one axis. Usually when people do this they call it “Utility”.) Goodness and Safety are have different causes!)
And while boundaries/safety is nice, they don’t actively provide Goodness. So the question is: How can Goodness be specified?
Similarly, for human alignment: learning to do boundaries minimizes unnecessary social conflict, but… What causes joy? Connection? Collaboration? What causes Goodness?
I suspect the answer is whatever causes collective intelligence / synchronous social interaction.
Why might AI alignment be like human alignment?
As Michael Levin says, “all intelligence is collective intelligence”. Every intelligence worth hoping for or worrying about is made of smaller parts.
In which case, alignment itself can be defined as “How do smaller parts build bigger agents and avoid internal cancers?” Human alignment deals with the same questions:
How do groups of humans make a decision?
How do parallel predictions in the mind decide what’s right?
More here:
Thanks to Alex Zhu, Adam Goldstein, Ivan Vendrov, and David Spivak.
Were the other researchers inspired by social boundaries too? E.g.: I haven’t directly asked Andrew Critch how much his thinking about boundaries for agents and for AI safety was inspired by his thinking about social boundaries, but it does seem likely.