Survey on Multimodal Model-driven Embodied AI Robots

The quest for artificial intelligence that can perceive, reason, and act in the physical world as seamlessly as humans do has been a long-standing ambition. While traditional AI has excelled in constrained digital domains, bridging the gap to the messy, open-ended reality requires a paradigm shift. This shift is embodied intelligence, where an agent’s intelligence is grounded in its physical interactions with the environment. The recent, explosive progress in multimodal models, particularly Vision-Language Models (VLMs) and Large Language Models (LLMs), has injected unprecedented momentum into this field. These models, trained on vast, diverse datasets, possess remarkable capabilities in understanding, reasoning, and generating content across modalities. This article reviews how these powerful multimodal models are driving advances in embodied AI robots, transforming them from pre-programmed machines into adaptive, task-oriented agents. We will structure our exploration around the core cognitive loop of an embodied agent: Environmental Perception & Understanding and Task Planning & Execution.

The fundamental architecture of an embodied AI robot can be conceptualized as a continuous perception-planning-action cycle. This architecture is centered on enabling the robot to make sense of its surroundings and act purposefully within them.

This cycle involves three core modules: Perception, Planning/Reasoning, and Execution/Control. The Perception module is where multimodal models shine. It fuses raw sensory data—primarily visual (from cameras) and linguistic (from commands or transcribed speech)—into a coherent, actionable representation of the world. The Planning/Reasoning module uses this representation, along with task instructions and potentially an internal memory or knowledge base, to decompose goals, reason about sequences, and generate a plan. The Execution/Control module translates the high-level plan into low-level motor commands or actions that the robot’s actuators can perform. The entire process is iterative, with the consequences of actions feeding back into new perceptual inputs, allowing the embodied AI robot to adapt and recover from errors.

Environmental Perception and Understanding

The foundation of intelligent action is a rich and accurate understanding of the environment. For an embodied AI robot, this is a multimodal challenge. We analyze the key model families that enable this capability.

2.1 Foundational Visual Models

Visual perception is paramount. Models for processing 2D and 3D visual data form the backbone of an embodied AI robot‘s sight.

2D Visual Models: These models process RGB images. Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) are dominant. Key tasks include object detection (locating and classifying objects) and semantic segmentation (labeling each pixel). A summary is provided in Table 1.

Model	Key Principle	Primary Task
Faster R-CNN	Region Proposal Network (RPN) + CNN detector	Object Detection
Mask R-CNN	Extends Faster R-CNN with a mask head	Instance Segmentation
YOLO Series	Single-shot detection via CNN, highly efficient	Real-time Object Detection
DETR	Transformer-based end-to-end set prediction	Object Detection
Swin Transformer	Hierarchical ViT with shifted windows	Detection, Segmentation

Table 1: Summary of Key 2D Visual Models for Embodied AI Robots.

3D Visual Models: To interact physically, an embodied AI robot often needs 3D spatial understanding. Models like PointNet++ process point cloud data from LiDAR or depth cameras, learning features directly from unordered point sets to perform 3D classification and segmentation. The core idea involves learning functions on point sets that are invariant to permutations:
$$ f(\{x_1, …, x_n\}) \approx \gamma \left( \underset{i=1,…,n}{\text{MAX}} \, \{ \phi(x_i) \} \right) $$
where ${x_i}$ are the 3D points, $\phi$ is a shared multi-layer perceptron, and $\gamma$ is another network for generating the final output.

2.2 Foundational Language Models

Understanding and grounding natural language instructions is critical for human-robot interaction. The evolution from statistical models to neural networks, and finally to LLMs, has been transformative.

Model Type	Example	Core Architecture	Capability for Embodied AI
Statistical	N-gram	Markov assumption	Very limited, no context
Contextual Embeddings	BERT	Bidirectional Transformer Encoder	Good for encoding commands
Autoregressive LLMs	GPT-3, LLaMA	Transformer Decoder	Strong reasoning, instruction following, plan generation
Large Multimodal LLMs	GPT-4, Claude 3.5	Multimodal Transformer	Integrated vision-language understanding

Table 2: Evolution of Language Models for Embodied AI Robots.

The power of modern LLMs lies in their scale and the Transformer architecture’s self-attention mechanism:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
where $Q$ (Query), $K$ (Key), and $V$ (Value) are matrices derived from the input. This allows the model to weigh the importance of different parts of the input sequence when generating an output, enabling complex reasoning and instruction understanding crucial for an embodied AI robot.

2.3 Vision-Language Models (VLMs)

VLMs are the critical bridge, aligning visual and textual information into a shared semantic space. This allows an embodied AI robot to understand phrases like “the red cup on the left table” by linking the words to visual features.

Contrastive Pre-training (e.g., CLIP): These models learn by associating images with their text descriptions. The training objective is a contrastive loss that maximizes the similarity between correct image-text pairs and minimizes it for incorrect ones. For an image encoder $I$ and text encoder $T$, the similarity score for the $i$-th image and $j$-th text in a batch is:
$$ s_{ij} = I(\text{image}_i)^T \cdot T(\text{text}_j) $$
The model learns to make $s_{ii}$ high and $s_{ij, i \neq j}$ low.

Generative VLMs (e.g., BLIP-2, LLaVA): These models connect a frozen visual encoder (like ViT) to a frozen LLM using a lightweight trainable module (Q-Former, projection layers). They are trained on image-text pairs to generate textual descriptions or answers conditioned on the image, which is directly useful for an embodied AI robot to describe scenes or answer questions about its environment.

2.4 Multimodal Large Models

The frontier is moving towards true multimodal foundation models that natively accept and reason over multiple input modalities (text, image, audio, video) and can output in multiple forms. These models promise to serve as a unified “brain” for an embodied AI robot.

Model	Input Modalities	Output Modalities	Relevance to Embodied AI
GPT-4o / o1	Text, Image, Audio	Text, Audio	Real-time audiovisual reasoning and dialogue
Gemini 2.0	Text, Image, Audio, Video	Text, Image, Audio	Comprehensive world model, video understanding
Claude 3.5 Sonnet	Text, Image	Text	Advanced visual reasoning and plan critique

Table 3: Modern Multimodal Large Models as Potential Controllers for Embodied AI Robots.

Task Planning and Execution

With a grounded understanding of the world, the embodied AI robot must decide what to do and how to do it. Multimodal models are revolutionizing this high-level cognitive function.

3.1 Vision-Language-Action (VLA) Models

VLA models directly map visual observations and language instructions to actionable policies or low-level motor commands. This is the essence of end-to-end control for an embodied AI robot.

Architectural Paradigms:

LLM as Planner + Low-level Policy: Models like PaLM-E use a large multimodal LLM to consume visual embeddings and text, then output a high-level plan or code that calls pre-defined low-level skill APIs (e.g., `pick_up(red_cup)`).
Fine-tuned Generative Policy: Models like RT-2 fine-tune a pre-trained VLM (e.g., PaLM-E) on robot action sequences. The model’s output vocabulary is extended to include discretized action tokens (e.g., end-effector coordinates), allowing it to directly generate actions: $\text{Action} \sim P(\cdot | \text{Image}, \text{Instruction})$.
Embodied Reasoning with Chain-of-Thought: Models like EmbodiedGPT or CognitiveDog first generate a textual reasoning trace (“I see a cup. I need to pick it up. I will move the gripper above it…”) before outputting the action, improving transparency and reliability.

Model	Core Methodology	Action Space	Key Innovation
PaLM-E	Multimodal LLM (ViT+PaLM) as planner	High-level skill API calls	Positive transfer from web-scale training
RT-2	Fine-tuned VLM on robot data	Discretized end-effector poses	Co-fining vision, language, and action
RoboMamba	Mamba SSM + Visual Encoder	Joint velocities / poses	Efficient long-sequence modeling for control
MultiPLY	Multisensory (vision, touch, audio) fusion into LLM	Skill sequence	Rich, object-centric multisensory grounding

Table 4: Vision-Language-Action Models for Embodied AI Robot Control.

The policy in a fine-tuned VLA model can be seen as maximizing the likelihood of successful action sequences given the history. In a reinforcement learning context, this is akin to optimizing the objective:
$$ J(\theta) = \mathbb{E}_{(o_t, l, a_t) \sim \mathcal{D}} \left[ \log \pi_\theta(a_t | o_t, l) \cdot A_t \right] $$
where $\pi_\theta$ is the VLA policy, $o_t$ is the visual observation, $l$ is the language instruction, $a_t$ is the action, and $A_t$ is an advantage estimate from the demonstration dataset $\mathcal{D}$.

3.2 Vision-Language-Navigation (VLN) Models

A critical subclass of tasks for mobile embodied AI robots is navigation by natural language instruction (e.g., “Go to the kitchen and bring me the mug on the counter”).

Classic vs. LLM-driven Approaches: Traditional VLN models used an encoder (for vision and language) coupled with a recurrent (LSTM) or cross-modal transformer policy trained via imitation or reinforcement learning. Modern approaches leverage the powerful reasoning of LLMs.

LLM as Navigator: Frameworks like NavGPT, VELMA, and VLN-ICV use the LLM as the core planner. The typical process is:
1. Perceptual Grounding: The current panoramic view is processed by a VLM to generate a textual description of the scene.
2. Prompt Construction: The description, the navigation instruction, and the history of past actions/observations are formatted into a prompt for the LLM.
3. Reasoning and Action Prediction: The LLM reasons over the prompt and outputs the next navigation action (e.g., “turn left 30 degrees,” “move forward 1 meter,” “stop”).

Model	Scene Representation	Navigator Core	Mechanism
Classic VLN (e.g., PREVALENT)	Visual CNN features	LSTM / Transformer Policy	End-to-end RL/IL training
NavGPT / VELMA	Textual scene description from VLM	Large Language Model (GPT-4, Claude)	In-context reasoning from prompt
NaVid	Video frame embeddings (EVA-CLIP)	Fine-tuned LLM (Vicuna)	Direct generation from video-text sequence

Table 5: Evolution of Vision-Language-Navigation Models for Embodied AI Robots.

The action selection in an LLM-based navigator can be formalized as:
$$ a_t = \underset{a \in \mathcal{A}}{\text{argmax}} \, P_{LLM}(a | \mathcal{P}_t) $$
where $\mathcal{P}_t = [D(o_1), a_1, …, D(o_{t-1}), a_{t-1}, D(o_t), I]$ is the prompt at time $t$, $D(\cdot)$ is the scene description function (via VLM), $o_i$ are observations, $a_i$ are past actions, and $I$ is the original instruction. The LLM’s prior knowledge of spatial concepts and common sense dramatically improves generalization.

Conclusion and Future Directions

The integration of multimodal models is undeniably transforming the capabilities of embodied AI robots, moving them from narrow, scripted behaviors towards general, instructionable agents. We have reviewed how these models form the core of the perception-understanding-planning loop. However, significant challenges remain on the path to robust and scalable real-world deployment.

Key Challenges:

The Simulation-to-Reality (Sim2Real) Gap: Models trained largely on internet data or simplified simulations struggle with the complexity, noise, and long-tailed distributions of the real world. A spilled drink, unusual lighting, or a slightly deformed object can confuse an otherwise capable model.
Data Scarcity and Generalization: High-quality, large-scale robot interaction data is expensive to collect. While web-scale pre-training provides a strong prior, adapting efficiently to new tasks, environments, and robot morphologies with minimal in-domain data is a critical challenge.
Safety and Reliability: LLMs and VLMs can hallucinate or make incorrect reasoning steps. For an embodied AI robot operating around humans, such failures can be dangerous. Developing verifiable, predictable, and safe control frameworks that incorporate these powerful but stochastic models is essential.
Temporal Reasoning and Memory: Most current models are heavily focused on the current observation. Effective long-term operation requires sophisticated memory to track object state changes over time, remember past failures, and build persistent environment maps.

Promising Future Directions:

World Models and Internal Simulation: Future systems may employ learned world models that allow the embodied AI robot to “imagine” the consequences of actions before executing them, enabling safer and more efficient planning. The objective is to learn a dynamics model $p(s_{t+1} | s_t, a_t)$ and a reward model $r(s_t, a_t)$ in a compact latent space.
Multisensory Integration Beyond Vision: True embodiment involves touch, force, sound, and proprioception. Models like MultiPLY point the way towards richer, multisensory grounding, which is crucial for dexterous manipulation and understanding occluded or noisy environments.
Lifelong and Foundation Agent Learning: The goal is to develop foundation models for action that can continuously learn from online interaction, adapting their policies and expanding their skill repertoire over a lifetime of experience, much like humans do.
Neuro-Symbolic Hybrids: Combining the pattern recognition and generative power of neural multimodal models with the rigor, transparency, and constraint-satisfaction capabilities of symbolic reasoning systems could lead to more reliable and interpretable embodied AI robots.

In conclusion, multimodal models are not just another tool in the robotics toolbox; they represent a fundamental shift towards cognitive architectures for embodied AI robots. By providing a unified substrate for perception, reasoning, and communication, they are paving the way for a new generation of intelligent machines that can understand our world and assist within it in truly natural and effective ways. The journey from impressive lab demonstrations to ubiquitous, helpful embodied AI robots in our homes and workplaces will be driven by overcoming the aforementioned challenges through continued research at the intersection of machine learning, robotics, and human-computer interaction.