The emergence of Multimodal Large Language Models (MLLMs) marks a paradigm shift in artificial intelligence, endowing machines with unprecedented capabilities in cross-modal understanding and reasoning. This advancement has injected formidable momentum into the field of embodied intelligence, bringing us closer to the long-envisioned goal of creating general-purpose embodied AI robots. An embodied AI robot refers to an intelligent entity that can perceive a physical environment, reason about tasks based on instructions, and execute actions through a physical form to accomplish goals. Unlike traditional AI confined to virtual domains, these agents emphasize the crucial interplay between a “body” (sensors and actuators) and the “environment” for the emergence of intelligent behavior. This survey, from my perspective, aims to systematically review the rapid progress, core methodologies, and future trajectories of embodied AI robot research fueled by MLLMs.
The foundational prowess of modern embodied AI robots stems from sophisticated Vision-Language Models (VLMs). These models bridge the semantic gap between visual perception and linguistic understanding. The core architecture typically involves a visual encoder (e.g., a Vision Transformer or ViT), a large language model (LLM) backbone, and a projection or alignment module to fuse modalities. A seminal work, CLIP, pioneered contrastive pre-training on massive image-text pairs, enabling robust zero-shot visual categorization. Subsequent models like Flamingo and BLIP-2 introduced more sophisticated cross-attention mechanisms (e.g., Gated Cross-Attention, Q-Former) to integrate visual tokens with language models effectively, often using a two-stage training process: feature alignment followed by instruction tuning. The training objectives combine various losses, as summarized in the table below, to align and generate coherent cross-modal representations.
| Model | Visual Encoder | Alignment Method | Key Loss Functions | LLM Backbone |
|---|---|---|---|---|
| CLIP | ViT/ResNet | Contrastive Learning | Contrastive Loss | None (Encoder-only) |
| Flamingo | NFNet | Perceiver Resampler & Gated X-Attention | Contrastive Loss, LM Loss | Chinchilla |
| BLIP-2 | ViT | Q-Former + Linear Projection | Image-Text Contrastive (ITC), Image-Text Matching (ITM), LM Loss | FlanT5 / OPT |
| LLaVA-1.5 | CLIP-ViT-L | Linear Projection / MLP | Language Modeling (LM) Loss | Vicuna |
The alignment process can be formalized. Let $I$ be an input image encoded into visual features $V = \text{Enc}_v(I)$, and $T$ be a text instruction encoded into text features $L = \text{Enc}_t(T)$. The alignment module $A$ learns a mapping to a joint space: $Z = A(V, L)$. For generative VLMs, the objective is often an autoregressive language modeling loss conditioned on the aligned visual context:
$$ \mathcal{L} = -\sum_{t} \log P(w_t | w_{<t}, Z; \Theta) $$
where $w_t$ is the $t$-th token, and $\Theta$ represents the model parameters. This foundational capability of MLLMs to “see” and “understand” provides the perceptual cornerstone for embodied AI robots.
The training and evaluation of embodied AI robots rely heavily on diverse, large-scale datasets that combine perception, action, and language. These datasets serve as the crucial substrate for learning generalizable policies. Early datasets focused on specific domains like indoor navigation (Matterport3D) or robotic grasping (Dex-Net, RoboNet). A significant leap came with datasets that explicitly link natural language instructions to action sequences in embodied settings, such as ALFRED for simulated household tasks. The recent drive toward generalist embodied AI robots has spurred the creation of massive, cross-embodiment datasets. Open X-Embodiment aggregates data from 22 different robot platforms, encompassing over 500 skills. Similarly, emerging large-scale datasets from various research consortia aim to provide the breadth and depth needed to train robust, general-purpose models. The table below highlights key datasets instrumental in this evolution.
| Dataset | Key Characteristics | Modalities | Primary Use |
|---|---|---|---|
| Matterport3D | Large-scale indoor 3D scans, panoramas | RGB-D, 3D Mesh, Poses | Navigation, 3D Scene Understanding |
| ALFRED | Language-instructed daily tasks in simulation | Language, Action Trajectories, RGB | High-Level Task Planning & Execution |
| Ego4D | Massive first-person (egocentric) video | RGB Video, Audio, Transcripts | Egocentric Perception, Action Anticipation |
| Open X-Embodiment | Aggregated data from 22 robot types | RGB, Proprioception, Actions, Language | Training Generalist Robot Policies (e.g., RT-X) |
| RoboMIND / ARIO | Large-scale, multi-embodiment manipulation data | Multi-view RGB, States, Actions, Language | Training and Benchmarking General Manipulation |
The physical instantiation of an embodied AI robot is as critical as its algorithmic brain. The choice of embodiment dictates the agent’s operational domain and capabilities. Common embodiments form a spectrum of complexity and application:
- Robotic Arms: The workhorses of manipulation, ideal for precise, repetitive tasks in structured environments like manufacturing and lab automation. Their control often involves solving inverse kinematics: finding joint angles $\vec{\theta}$ that position the end-effector at a desired pose $T_{desired}$. Given forward kinematics $FK(\vec{\theta})$, the problem is:
$$ \text{Find } \vec{\theta} \text{ such that } FK(\vec{\theta}) \approx T_{desired} $$
MLLMs can now generate these target poses or directly output joint controls. - Mobile Bases (Wheeled/UAV/Quadruped): These platforms provide mobility. Wheeled robots excel on flat terrain for logistics and services. UAVs offer unparalleled aerial perspectives for surveillance and delivery. Quadrupeds, with their dynamic stability, traverse complex, uneven ground for inspection and rescue. Integrating MLLMs enhances their autonomous navigation and task-level decision-making in unknown environments.
- Dexterous Hands & Humanoid Robots: Representing the apex of embodied AI robot design. Dexterous hands, with high degrees of freedom (DoF), aim to replicate human-like manipulation for delicate tasks. Humanoid robots combine mobility and manipulation in a human-form factor, intended to operate seamlessly in human-centric environments. Controlling these high-DoF systems is immensely challenging, often requiring hierarchical policies or advanced reinforcement learning, with MLLMs providing the high-level task guidance.

The most integrated approach is the development of Embodied Foundation Models or “Embodied LMMs.” These are end-to-end models that ingest raw sensor data (images, proprioception) and language instructions, and directly output low-level action commands for the robot. This paradigm treats robot control as a sequence modeling problem. The canonical architecture, exemplified by models like RT-1 and RT-2, involves encoding multi-modal inputs into a sequence of tokens processed by a Transformer decoder, which autoregressively predicts action tokens. The action prediction at step $t$ can be formulated as:
$$ a_t \sim P(a_t | o_{\leq t}, i, a_{<t}; \Theta) $$
where $o_{\leq t}$ are observations, $i$ is the instruction, and $a_{<t}$ are past actions. RT-1 demonstrated this effectively on a mobile manipulator. RT-2 leveraged a pre-trained VLM (PaLM-E) for improved generalization by representing actions as tokens in a vocabulary that includes both language and control outputs. Recent models like GR-1 pre-train on large-scale video-text data before fine-tuning on action sequences, showing the benefit of web-scale visual knowledge. These models represent a direct translation of MLLM capabilities into actuator commands, striving for a unified “brain” for the embodied AI robot.
| Model Paradigm | Key Idea | Example Models | Advantages | Challenges |
|---|---|---|---|---|
| End-to-End Embodied LMM | Directly map observations & instructions to low-level actions via a single Transformer. | RT-1, RT-2, GR-1 | Unified model, elegant, benefits from large-scale pre-training. | Requires massive robot data, sim2real gaps, data efficiency. |
| Modular Planning & Control | Decompose problem into high-level planning (by LLM) and low-level execution (by dedicated controller). | SayCan, Code as Policies (CaP) | Leverages LLM reasoning, uses existing robust controllers, more interpretable. | Integration complexity, error propagation between modules. |
| Hybrid (LLM + RL/IL) | Use LLM for sub-goal generation or reward shaping, train RL/Imitation Learning policy to execute. | LLM-guided RL, Inner Monologue | Combines strong reasoning with adaptive control, can learn from feedback. | Complex training pipelines, reward design. |
For complex, long-horizon tasks, a singular end-to-end policy may be insufficient. Here, High-Level Task Planning with MLLMs shines. The core idea is to leverage the strong reasoning and commonsense knowledge of LLMs/VLMs to decompose a complex user instruction into a feasible sequence of sub-tasks. For example, “Make me a cup of coffee” might be decomposed into [Find cup, Find coffee machine, Pick up cup, Navigate to machine, Place cup under spout, …]. This decomposition is highly context-dependent, requiring perception of the current environment. I see several dominant methodologies:
- Prompt Engineering & Code Generation: Framing the robot as a code-executing agent. Methods like “Code as Policies” (CaP) pre-define an API of primitive skills (e.g., `move_to()`, `grasp()`). The LLM, prompted with examples, writes code in a domain-specific language that calls these primitives in sequence. The code is then executed. This approach effectively grounds the LLM’s output in executable actions.
- Feedback-Driven Refinement: Creating closed-loop systems where the embodied AI robot uses environmental feedback to correct its plan. A framework like Inner Monologue incorporates success/failure signals, human feedback, and new visual observations back into the LLM’s context, allowing it to re-plan or adjust its sub-task sequence dynamically. This mimics a form of reactive deliberation.
- Memory & Knowledge Augmentation: Equipping the agent with external memory (e.g., vector databases of past experiences) or pre-built scene representations (e.g., 3D scene graphs). When planning, the LLM can retrieve relevant past episodes or query a semantic map of the environment (e.g., “where is the nearest coffee mug?”) to make more informed decisions. This is often implemented using Retrieval-Augmented Generation (RAG) techniques adapted for embodiment.
- Multi-Agent Collaboration: Distributing the planning and execution roles among multiple LLM-based “agents.” For instance, one agent specializes in visual perception and description, another in task planning, and a third in critiquing or verifying the plan’s feasibility. Their dialogue leads to a more robust final plan for the embodied AI robot.
The sub-tasks generated by the high-level planner must be translated into precise motor commands. This is the domain of Low-Level Action Control. The strategies here vary significantly with the embodiment’s complexity.
- Robotic Arm Control: For manipulation, control can be posed as predicting end-effector poses (position + orientation) or joint velocities. Imitation Learning (IL) from demonstration data is a popular approach within end-to-end models. Alternatively, the LLM planner can output sub-goals (e.g., a target 6D pose for a cup) that are achieved by a dedicated, possibly classical, motion planner and controller. The dynamics of a robotic arm can be described by the Lagrangian formulation:
$$ M(q)\ddot{q} + C(q, \dot{q})\dot{q} + g(q) = \tau $$
where $q$ are joint angles, $M$ is the inertia matrix, $C$ Coriolis forces, $g$ gravity, and $\tau$ the joint torques. While end-to-end models don’t solve this explicitly, they learn policies that approximate optimal control. - Bipedal Locomotion: Controlling a humanoid or bipedal embodied AI robot to walk and balance is a high-dimensional, unstable control problem. Deep Reinforcement Learning (DRL) in simulation has produced remarkable results. Here, the MLLM’s role is often higher-level, specifying navigational goals (“walk to the kitchen”) while a specialized, robust DRL policy handles the intricate dynamics of stepping, balancing, and recovering from pushes. The policy $\pi$ is trained to maximize cumulative reward $R$:
$$ \pi^* = \arg\max_{\pi} \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t} \gamma^t R(s_t, a_t) \right] $$
where $\tau = (s_0, a_0, s_1, …)$ is a trajectory of states and actions. - Dexterous Hand Control: This is one of the most challenging frontiers. The high DoF and complex contact dynamics make direct MLLM control very difficult. Current approaches often use MLLMs for semantic understanding (object identification, grasp type selection) which then parameterizes a lower-level grasping policy. This policy might be trained via DRL or IL from human demonstrations. The MLLM effectively provides the “what” and “why,” while the specialized policy handles the “how.”
Despite exhilarating progress, the path toward truly capable and ubiquitous embodied AI robots is fraught with significant challenges that present rich opportunities for future research.
- Unified Evaluation and Benchmarking: The field lacks a comprehensive, standardized benchmark to holistically assess an embodied AI robot‘s capabilities in perception, reasoning, planning, and control across diverse embodiments and tasks. Creating such benchmarks, especially those that bridge simulation and reality, is crucial for measuring progress and guiding research directions.
- Data Scarcity & Scalability: While datasets are growing, collecting large-scale, diverse, real-world robot interaction data remains expensive and slow. Promising directions include better simulation-to-reality (sim2real) transfer techniques, using MLLMs to automatically generate synthetic training data or instructions, and developing algorithms for efficient few-shot and meta-learning that allow robots to adapt quickly with minimal new data.
- 3D Spatial Reasoning & World Models: Current VLMs are predominantly 2D-image based. For an embodied AI robot to interact effectively, it needs a deep understanding of 3D geometry, object permanence, physics, and affordances. Research into 3D-aware VLMs and neural “world models” that can predict the outcomes of actions is essential. A world model learns a latent dynamics function:
$$ \hat{s}_{t+1}, \hat{r}_t = f(s_t, a_t) $$
allowing the agent to plan and reason internally. - Long-Horizon Planning & Robust Execution: While MLLMs excel at initial decomposition, they often fail at deep causal reasoning and recovering from unexpected failures during long task sequences. Enhancing their planning robustness through iterative refinement, better integration with symbolic reasoning, and learning from interactive failures are key areas. This involves creating tighter feedback loops between the planner and the environment.
- Full-Body Coordination & Control: Controlling a high-DoF system like a humanoid robot end-to-end remains a massive challenge. Hierarchical control architectures that combine MLLM-based high-level planning with mid-level behavior generators and low-level stabilizers seem a necessary path forward. Improving the efficiency and safety of these hierarchical systems is critical.
- Efficiency & On-Device Deployment: The computational demand of large MLLMs is at odds with the need for real-time, low-latency control on a mobile robot platform. Research into model distillation, quantization, efficient architectures (e.g., State Space Models), and specialized hardware acceleration is vital for practical deployment of sophisticated embodied AI robots.
- Continuous Learning & Adaptation: An ideal embodied AI robot should learn continuously from its interactions without catastrophic forgetting. Developing lifelong learning algorithms that allow the agent to safely acquire new skills, update its world model, and refine its policies in a changing environment is a fundamental challenge for achieving general intelligence.
The integration of Multimodal Large Language Models with robotics has catalyzed a revolution in embodied artificial intelligence. We are witnessing the transition from robots programmed for specific tasks in structured environments to more general embodied AI robots that can interpret ambiguous instructions, perceive complex scenes, reason about actions, and adapt their behavior. This convergence has been powered by advances in foundation models, the collection of large-scale multi-embodiment datasets, and innovative architectural designs that bridge reasoning and control. While formidable challenges in evaluation, data efficiency, spatial understanding, robust planning, and whole-body control lie ahead, the trajectory is clear. The future points towards increasingly capable, adaptable, and autonomous embodied AI robots that can collaborate with humans in our everyday physical world, ultimately blurring the line between artificial intelligence and physical agency. The journey to truly intelligent embodied AI robots is well underway, and its next chapters will be written at the intersection of large-scale learning, cognitive reasoning, and interactive embodiment.
