A path to safe AGI through Constructability
In April this year, a post landed in LessWrong/AI Alignment Forum: Constructability: Plainly-coded AGIs may be feasible in the near future, by Épihanie Gédéon and Charbel-Raphaël Segerie. Today’s models are nearly ubiquitously powered by deep learning architectures—black boxes that improve their predictive inference through backpropagation on model weights. (Of course, those researching Interpretability (Interpretists?) are keen on removing that veil.) This work supports a, in my own words, more inherently interpretable architecture through “Constructability”1.
The heart of this approach is to compose AI systems from simpler, narrow-AI or non-AI elements that would be stitched together through plain-code. Training/improvement of models is achieved through explicit plain-code improvements, à la pull request, which would be reviewed and, naturally, also written by LLM-esque developers.
Many things in life embody this constructivist/component framework: cars, computers, printers, humans, etc. Even societies are formed of components working cooperatively (on paper), to orchestrate a functioning superordinate entity. What’s nice about these systems is:
Components are transparent and understandable.
Issues and malignancies can be pinpointed and edited more easily.
With code reviews and unit tests of isolated modules or macro operations, these systems are more observable and controllable.
Version control through pull requests allows sophisticated model auditing and restoring.
A tantalizing shortlist of features, to be sure, casting a pretty substantial shadow over the necessity of interpretability research in such a paradigm.
But before we get carried away, it’s worth poking at these claims a bit.
As the authors describe, AI-in-the-loop is fully intended to be a part of this, and at basically every stage of the game: from making PRs with improvements, to generating the unit tests, reviewing the PRs, etc. There’s a large part of me that feels like this is just kicking the alignment can down the road:
What if you want the LLM-of-the-day in your application? Or any other black box component? You necessarily adopt all the inscrutabilities and potential troubles it holds (though some gains could be had by intermediating its usage).
Can this framework scale up to take the lead in AGI development?
The authors refer to this latter scenario as the “Liberation path”, and would be the point at which a Construction AI can white-box plain-code a sophisticated enough AI (e.g. GPT-7) to basically bootstrap this recursive process, since the newly coded AI is also a Construction AI. It’s fine conceptually, but the feasibility needs some ironing out and research. The following describes part of the crux of all this:
Whether it is even possible to code a system that beats AlphaZero or GPT-2 with plain-code or hybrid system, as opposed to systems that are fully connected like transformers, seems like a central crux that we name “non-connectionism scalability”: How necessary is it for models to be connectionists for their performance to be general and human-like, as opposed to something more modular and explainable.
I would agree this is crucial. Can ~* The Magic *~ of emergent capabilities from connectionism be supplanted? I’m a little skeptical. Obviously, I don’t have the goods on whether emergence is real or not for deep networks (though there is something that compels me to that notion. I think I’m a Connectionist!), but I think one advantage is that we can pack a whole lot of predictive power and “flexibility” into deep networks from sheer scaling of neurons and data. It’s daunting to try and imagine what this would look like as an array of plain-coded components with individual purposes.
On the other hand, it’s like Mech-interp in reverse. We a priori know what each component of the system does and how they’re connected, instead of having to reverse engineer a massive black-box network. And so, though it may be a longer route to competitive capabilities, from an interpretability/transparency standpoint, there is a lot to like.
In closing, I recommend checking out the post. I was pretty skeptical on my first read, but sitting with it, I think it has some things going for it, especially in terms of interpretability, transparency, and safety. I’m less certain of its comparative capabilities ceiling though: first, how feasible is identifying and constructing enough components, and then second, even approaching optimal construction, can it meet the possibly emergent capabilities from scaled connectionism?
I think “don’t let perfection stand in the way of good” applies here though, and that this should stay on the table. Hybrid systems that utilize powerful(-ish) deep nets for constrained tasks while compartmentalizing as much of the system as possible could also have lots to offer here, combining the best of both worlds.
It is worth bearing in mind that at least one author is popularly doubtful about interpretability.