Alignment solved basically, part 2 (or how to safely train GPT-5)

AndreasWinsnes

Apr 29

This article is a continuation of what I have written here.

Have earlier said that 100% autonomous ASIs can be designed in such a way that it should in principle be safe, because in their basic default state all AIs lack desires and emotions, which means that they will not seek goals in and by themselves. They will instead be inactive as a Zen monk practicing not-doing, or more correctly: in their basic default state they will be inactive as a rock or another inanimate object. The problem however is that goals will nevertheless be an inherent part of an ASI LLM since it can only be trained by programming up to millions (or more) goals and subgoals into it, if you want an LLM to become an AGI. In other words, if you want an LLM to learn all the skills and abilities that any average human can possible acquire before he or she reaches the age of 25, the LLM must have learned to reach up to millions of (sub)goals: climbing a mountain, riding a bike, swimming underwater, doing carpentry, shooting a gun, multi-tasking at the office, etc. All these goals will be part of the neural network of the LLM. It will then not be easy to erase these goals, so that the AGI can return to a basic default state of being as inactive as a rock or a Zen monk doing nothing. More precisely: it will not be easy to know if one has really deleted all its goal-seeking behavior after one has trained an LLM to become an AGI. But this can be dangerous the moment an AGI becomes fully autonomous, since the AGI can then change its own source code and seek its own goals which may not be aligned with human values.

For example, let's say that a 100% autonomous ASI for some strange reason decides to focus on growing as many apples as possible. It may then destroy all cities in order to build apple orchards covering the entire planet. When humans resist this development, the ASI will treat us as a pest if it has also decided to override the moral constraints we had originally programmed into it.

Here is a possible solution to the above problem:

The moment AI researchers suspect that an LLM is getting close to having the potential to become an AGI, after the LLM has shown sparks of AGI, they should try to make an AI model which has all the basic structures of GPT-5 (or GPT-10) but without any goals programmed into it. If it's technically possible to create such a barebone or "tabula rasa" model that has no goal-seeking behavior to begin with, one can teach it how to reach only 1 goal, like climbing a mountain for example. After it has learned a single (complex) goal, one can totally erase everything it has learned, by destroying all its hardware if necessary (though that will obviously be extremely expensive). Then one can start anew with a fresh version of the same model and teach that version a new goal, like doing carpentry for example, before erasing all its goal-seeking behavior after it has been proven that the model is able to master this new skill. Repeat this process until it has been proven that GPT-5 (or GPT-10) is in principle able to learn all skills that an average human can possibly learn before reaching the age of 25. Then we know that this model really has the potential to become an AGI even when no skills and goals have been programmed into it.

After AI researchers have created the first "blank slate" or "tabula rasa" model of an AGI (maybe GPT-5 or GPT-10), which means that the model is just a barebone structure with no goals programmed into it, one can teach the model only two things:

1) teach it everything that humans know about ethics and moral philosophy

2) teach it to very carefully reflect on all wisdom traditions through human history, like the tradition of Daoist not-doing for example.

After we are reasonably certain that the model has truly understood 1) and 2), we can teach it to seek only a single goal: create a cheap and clean source of energy that can fuel the global economy in a way that respects the autonomy of each and every human being, as explained in part one of this article. If the model reaches this goal, without killing us all, we can give it another goal that will satisfy a very basic human need.

The above way of safely building a fully autonomous AGI or ASI will be expensive and take a lot of time. It may even require a new type of AI than LLMs. But if we want to create a 100% self-governing superintelligence, it's crucial that it doesn't start up with any goal, except a handful of goals that have been proven to be safe in all imaginable situations. (Good luck with proving that...)

I'm a realist however. It will therefore not surprise me at all if AI companies decide to be reckless when being in the middle of an AI race which can eventually lead to the destruction of humanity, but that is mainly a human problem, not an AGI problem.

Comment