A better way to plan complex visual tasks | Massachusetts Institute of Technology News



MIT researchers have developed a generative artificial intelligence-driven approach that is approximately twice as effective as existing techniques for planning long-term visual tasks such as robot navigation.

Their method uses a specialized visual language model to recognize scenarios in images and simulate the actions needed to achieve a goal. A second model then translates those simulations into a standard programming language for problem planning and refining the solution.

Ultimately, the system automatically generates a set of files that can be input into traditional planning software that calculates a plan to achieve the goal. This two-stage system produced plans with an average success rate of about 70%, outperforming the best baseline method, which achieved only about 30%.

Importantly, the system can solve new problems that have not been encountered before, making it suitable for real-world environments where conditions can change instantaneously.

“Our framework combines the benefits of vision language models, such as the ability to understand images, with the powerful planning capabilities of formal solvers,” said Yilun Hao, an AeroAstro graduate student at MIT and lead author of an open-access paper on the technique. “We can take a single image, move it through simulation, and then move it into a reliable long-term plan that can be useful in many real-world applications.”

She is joined on the paper by Yongchao Chen, a graduate student at MIT’s Laboratory for Information and Decision Systems (LIDS). Chuchu Fan, Associate Professor at AeroAstro and Principal Investigator at LIDS. and Yang Zhang, a research scientist at the MIT-IBM Watson AI Lab. This paper will be presented at the International Conference on Learning Representations.

Tackle visual tasks

In recent years, Huang and colleagues have been exploring the use of generative AI models to perform complex inference and planning, often using large-scale language models (LLMs) to process text input.

Many real-world planning problems, such as robot assembly and autonomous driving, involve visual input that cannot be handled well by LLM alone. Researchers sought to expand into the visual domain by leveraging vision language models (VLMs), powerful AI systems that can process images and text.

However, VLMs struggle to understand the spatial relationships between objects in a scene and often take many steps to infer correctly. This makes it difficult to use VLM for long-term planning.

Meanwhile, scientists have developed robust, formal planners that can generate effective long-term plans for complex situations. However, these software systems cannot process visual input and require specialized knowledge to encode the problem into a language that the solver understands.

Huang and her team have built an automated planning system that incorporates the best of both methods. The system, called VLM-Guided Formal Planning (VLMFP), utilizes two specialized VLMs that work together to transform visual planning problems into files that can be quickly used by formal planning software.

The researchers first carefully trained a small model called SimVLM, which specializes in using natural language to describe scenarios in images and simulating sequences of actions in those scenarios. A much larger model called GenVLM then uses the descriptions from SimVLM to generate a set of initial files in a formal planning language known as Planning Domain Definition Language (PDDL).

The file is ready to be input into a traditional PDDL solver that computes a step-by-step plan to solve the task. GenVLM compares the solver results with the simulator results and iteratively adjusts the PDDL file.

“The generator and simulator work together to achieve exactly the same result. It’s an action simulation that achieves the goal,” Hao said.

Since GenVLM is a large-scale generative AI model, we saw many examples of PDDL during training and learned how this formal language can solve a wide range of problems. This prior knowledge allows the model to generate accurate PDDL files.

flexible approach

VLMFP generates two separate PDDL files. The first is the domain file, which defines the environment, valid actions, and domain rules. It also generates a problem file that defines the initial state and goals for the particular problem at hand.

“One of the benefits of PDDL is that the domain file is the same for all instances within that environment. This makes our framework good at generalizing to unseen instances within the same domain,” Hao explains.

To enable the system to generalize effectively, researchers had to carefully design just enough training data for SimVLM so that the model could understand the problem and goal without memorizing scenario patterns. When tested, SimVLM described scenarios, simulated actions, and detected whether the goal was achieved in approximately 85% of the experiments.

Overall, the VLMFP framework achieved approximately 60 percent success rate on six 2D planning tasks and over 80 percent success rate on two 3D tasks involving multi-robot collaboration and robotic assembly. It also generated valid plans for more than 50% of never-before-seen scenarios, significantly outperforming baseline methods.

“Our framework can be generalized when rules change in different situations. This gives our system the flexibility to solve different types of vision-based planning problems,” adds Huang.

In the future, the researchers hope to enable VLMFPs to handle more complex scenarios and explore ways to identify and reduce VLM-induced hallucinations.

“Longer term, generative AI models could act as agents, leveraging the right tools to solve more complex problems. But what does it mean to have the right tools, and how do we incorporate those tools? We still have a long way to go, but incorporating visual-based planning makes this effort an important piece of the puzzle,” says Huang.

This research was partially funded by the MIT-IBM Watson AI Lab.



Source link

Leave a Reply