Getting ChatGPT to operate autonomously within the confines of an operating system has proven a difficult task for numerous reasons, but a team composed of scientists from Microsoft Research and Peking University may have figured out the secret sauce.
The team conducted a study to determine why artificial intelligence (AI) large language models (LLMs) such as GPT-4 fail at tasks requiring the manipulation of an operating system.
State of the art systems such as ChatGPT running on GPT-4 set the benchmark for generative tasks such as drafting an email or writing a poem. But getting them to act as agents within a general environment poses a significant challenge.
Traditionally, AI models are trained to explore through reinforcement learning in a virtual environment. AI developers have used modified versions of popular video games such as Super Mario Bros. and Minecraft to “teach” models concepts such as self-guided exploration and goal seeking.
But operations systems are an all-together different playground for AI models. As agents, performing functions within an OS often presents as a multimodal challenge requiring the exchange of information between different components, programs, and applications.
Generally speaking, the reinforcement training approach requires trial and error. However, as anyone who has entered their password incorrectly too many times, or forgotten which shortcuts work in which apps knows, data can easily be lost when using such an approach in an operating system environment.
Related: ChatGPT trigger happy with nukes, SEGA’s 80s AI, TAO up 90%: AI Eye
The researchers worked with various LLMs including Meta’s open source Llama2 70B and OpenAI’s GPT-3.5 and GPT-4. According to the research, none of them performed particularly well.
Per the team’s paper, this is because the challenge currently exceeds the capabilities of today’s AI:
“Firstly, the action space is vast and dynamic. … Secondly, real-world tasks often require inter-application cooperation, demanding farsighted planning from LLM agents. Thirdly, agents need to identify optimal solutions aligning with user constraints, such as security concerns and preferences.”
For the researchers to figure out a way to overcome these challenges, they first had to understand why LLMs failed at manipulating operating systems when some AI models were capable of superhuman feats such as beating all comers at chess and Go.
The team developed a novel training environment called AndroidArena that allowed the LLMs to explore an environment similar to the Android OS. Then, after creating testing tasks and a benchmark system, they identified a lack of four key capabilities as responsible: understanding,reasoning, exploration, and reflection.
While the scope of the work was specifically intended to cover identifying the problem, during the research process the team actually identified a “simple” method to increase a model’s accuracy by 27%.
Essentially, the team prompted the model with automated information relating to the number of attempts it had made previously and what it’d tried during those attempts. This addressed the problem of a lack of “reflection” by sort of embedding memory inside the prompts used to trigger it.
This vein of research could prove significant in the quest to build a better AI assistant and,