In the rapidly evolving field of artificial intelligence, the integration of large models with embodied intelligence is revolutionizing how robots navigate complex environments. Embodied intelligence, which emphasizes the role of physical interaction in cognitive processes, is now being supercharged by large language models and multimodal models, enabling robots to understand natural language commands, perceive dynamic surroundings, and make precise navigation decisions without prior maps. This synergy addresses long-standing limitations of traditional navigation methods, such as poor adaptability in unstructured scenarios and limited human-robot interaction capabilities. As industries from industrial automation to disaster relief seek more autonomous systems, embodied robots equipped with large model-driven navigation are emerging as a transformative technology, promising enhanced generalization, real-time responsiveness, and deeper environmental understanding.

The concept of embodied intelligence dates back to Alan Turing’s early ideas in the 1950s, but it gained prominence in the 1990s through cognitive science research, highlighting that intelligence arises from the interaction between an agent and its environment. In robotics, this has translated into systems capable of autonomous perception, decision-making, and physical execution. Traditional navigation methods relied on sequential processes like mapping, path planning, and motion control, which often faltered in dynamic settings due to their reliance on predefined rules and symbolic representations. The advent of deep reinforcement learning in the 2010s introduced a paradigm shift, allowing robots to learn navigation policies through trial-and-error interactions. However, these approaches struggled with sample inefficiency and poor generalization to real-world scenarios. The recent breakthroughs in large models, such as ChatGPT and Gemini, have opened new avenues for embodied intelligence, enabling robots to leverage natural language processing, multimodal fusion, and commonsense reasoning for navigation tasks. This review delves into the technical evolution, current methodologies, datasets, and future directions of large model-driven embodied intelligent robot navigation, underscoring its potential to bridge the gap between theoretical research and practical applications.
Background: The Convergence of AI and Embodied Navigation
Artificial intelligence has undergone significant transformations since the 1950s, evolving from symbolic logic-based systems to data-driven approaches rooted in connectionism. Deep learning, with architectures like convolutional neural networks and recurrent neural networks, has achieved remarkable success in image processing and natural language tasks. For instance, models like DeepLab improved semantic segmentation accuracy, while YOLO enhanced real-time object detection. In natural language processing, the introduction of Transformer architectures in 2017, followed by models like BERT and GPT-3, enabled unprecedented language understanding and generation capabilities. Vision-language models such as CLIP and GLIP further bridged the gap between visual and textual data, facilitating zero-shot learning and cross-modal alignment. These advancements laid the groundwork for integrating AI into robotics, particularly in navigation, where embodied intelligence requires seamless interaction with the physical world.
Embodied intelligent navigation has progressed from fixed strategies to adaptive learning-based methods. Early approaches used depth reinforcement learning to enable robots to navigate unknown environments by interacting with their surroundings. For example, methods like SRL-ORCA combined reinforcement learning with dynamic obstacle avoidance, while other frameworks allowed robots to make decisions based on raw sensor data. The introduction of vision-language navigation tasks, such as those in the R2R dataset, marked a shift toward human-robot interaction, where robots follow natural language instructions in real environments. Despite these innovations, reinforcement learning-based methods remained constrained by manually designed reward functions and limited generalization. The rise of large models has addressed these issues by embedding commonsense knowledge and multimodal reasoning into navigation systems, paving the way for more robust and versatile embodied robots.
Large Models in Navigation: Applications and Architectures
Large models, including large language models and multimodal large language models, are being increasingly applied to embodied intelligent navigation, enhancing robots’ ability to interpret instructions, perceive environments, and plan paths. These models empower embodied robots to handle tasks like zero-shot object navigation, where they locate objects described in natural language without prior training, and social navigation, where they predict human movements for safer interactions. For instance, systems like DRAGON assist visually impaired users by interpreting dialogue-based commands, while VLM-Social-Nav improves collision avoidance in dynamic settings. The core of these applications lies in leveraging the pretrained knowledge of large models, which allows embodied intelligence to transcend the limitations of task-specific training.
Current embodied intelligent navigation systems typically comprise four modules: perception, decision-making, action execution, and a local knowledge base. The perception module processes multimodal inputs, such as RGB images, depth maps, and language instructions, using encoders like CLIP or BLIP-2 to convert data into a unified format. The decision module, often powered by LLMs or MLLMs, generates navigation plans by integrating environmental cues and task goals. The action module translates these plans into physical movements, and the local knowledge base stores interaction outcomes for continuous learning. Based on architecture, these systems are categorized into two paradigms: end-to-end models and hierarchical models, each with distinct advantages for embodied robots.
End-to-End Embodied Navigation Models
End-to-end models directly map multimodal inputs to navigation actions, eliminating intermediate representations and reducing error accumulation. These models, often built on Transformer or graph neural network architectures, offer faster inference and lower deployment costs by processing raw data streams. A prominent example is PaLM-E, developed by Google, which integrates language, vision, and robot control into a single model. By encoding sensor data and text into a shared embedding space, PaLM-E enables embodied robots to perform tasks like navigation and manipulation based on natural language commands. However, such models require extensive training data and computational resources, limiting their practicality in diverse environments.
Another notable end-to-end approach is NavGPT, which uses large language models to convert visual observations into textual descriptions and generate explicit navigation plans. This method enhances interpretability by producing human-readable reasoning traces, but it may overlook spatial details and suffer from memory loss over long trajectories. Its successor, NavGPT-2, addresses these issues by incorporating vision-language models and topological graphs, improving spatial understanding and dynamic path adjustment. Other models, like LVLM-OGN and VLFM, integrate semantic mapping and boundary-based exploration to boost efficiency in unknown environments. Despite their generalization benefits, end-to-end models often necessitate fine-tuning with large datasets, hindering real-world adoption for embodied intelligence applications.
Hierarchical Embodied Navigation Models
Hierarchical models decompose navigation into modular layers, such as perception, planning, and execution, enhancing explainability and maintainability. These systems use standardized intermediate representations to facilitate information flow between modules, allowing for targeted updates and human oversight. Although they risk cascading errors and increased computational load, their structured approach ensures more stable outputs compared to end-to-end methods. Hierarchical frameworks can be grouped into three types based on how large models are utilized.
First, some methods employ vision-language models for perceptual alignment, converting multimodal inputs into consistent representations for downstream navigation policies. For example, ADAPT uses CLIP to align language instructions with visual cues through action prompts, while LM-Nav combines GPT-3, CLIP, and ViNG for outdoor navigation. PoSE leverages BLIP and CLIP-Seg to assess object presence and guide exploration, reducing perceptual noise. These approaches rely on predefined rules for decision-making, constraining the autonomy of embodied robots.
Second, other frameworks leverage large language models for high-level reasoning, transforming navigation into text-based planning tasks. L3MVN uses LLMs to infer scene priors for visual target navigation, and SayNav constructs 3D scene graphs to generate multi-step exploration plans. E2BA incorporates semantic maps and boundary points, with LLMs selecting optimal paths to minimize redundant actions. While these methods excel in complex task decomposition, their performance depends heavily on prompt engineering and may produce unstable outputs.
Third, hybrid models integrate both VLM and LLM capabilities, assigning distinct roles to each module. ESC applies GLIP for scene understanding and LLMs for commonsense reasoning, using probabilistic soft logic to guide exploration. Lang2LTL and its enhanced version, Lang2LTL-2, translate natural language commands into linear temporal logic specifications by combining GPT-4V for image description and CLIP for semantic alignment. These systems achieve robust multimodal reasoning but face challenges in computational efficiency and real-time response due to multiple model interactions.
| Model Type | Key Features | Advantages | Disadvantages |
|---|---|---|---|
| End-to-End Models | Direct mapping from inputs to actions; integrated perception-decision-execution | Reduced error accumulation; faster inference; better generalization | High training data requirements; poor interpretability; unstable outputs |
| Hierarchical Models | Modular architecture; separate perception, planning, and execution layers | Enhanced explainability; stable outputs; easier maintenance | Cascading errors; increased computational cost; slower real-time performance |
Datasets and Evaluation Metrics for Embodied Navigation
Training and evaluating embodied intelligent navigation systems require diverse datasets that simulate real-world scenarios. These datasets support the development of generalization capabilities and task-specific performance for embodied robots. MatterPort3D (MP3D) provides 3D indoor scans with RGB images, depth data, and semantic annotations, ideal for indoor navigation research. TOUCHDOWN, built from Google Street View, offers panoramic images for outdoor navigation and spatial reasoning. Room-to-Room (R2R) focuses on vision-language navigation, using natural language instructions in indoor environments, while CVDN emphasizes human-robot dialogue with long conversational sequences. REVERIE introduces complex tasks involving object retrieval, testing memory and understanding in embodied intelligence. ProcTHOR generates customizable residential scenes for large-scale training, and X-Embodiment aggregates over 1 million robot trajectories across 527 skills, enabling cross-platform policy learning. These datasets drive progress in embodied robots by providing rich, multimodal training resources.
Evaluation metrics for embodied navigation assess both traditional performance and instruction adherence. Key indicators include Success Rate (SR), which measures the proportion of tasks completed successfully; Path Length (PL), the actual trajectory distance; and Navigation Error (NE), the distance from the stopping point to the goal. Success Weighted by Path Length (SPL) combines success rate and path efficiency, offering a comprehensive measure of navigation quality. These metrics ensure that embodied intelligence systems balance efficiency, accuracy, and compliance with human instructions.
| Metric | Definition | Significance |
|---|---|---|
| Path Length (PL) | Total trajectory length from start to end | |
| Navigation Error (NE) | Distance between stop point and goal | |
| Success Rate (SR) | Probability of stopping within a threshold of the goal | |
| Success Weighted by Path Length (SPL) | Combines success rate and path optimality |
Conclusion and Future Directions
Large model-driven embodied intelligent robot navigation represents a paradigm shift from specialized algorithms to general-purpose intelligence, leveraging multimodal reasoning and natural language understanding to overcome traditional limitations. By integrating large models, embodied robots achieve unprecedented adaptability in dynamic environments, comprehend ambiguous instructions, and perform long-horizon tasks with minimal training. However, several challenges remain, including the high computational cost of model fine-tuning, real-time performance issues in hierarchical systems, and generalization gaps in unseen scenarios. Future research should focus on lightweight training strategies, such as knowledge distillation, to reduce deployment barriers. Meta-reinforcement learning could enhance environmental adaptation, while optimized hierarchical architectures may improve response times through efficient data transmission. As these technologies mature, embodied intelligence will unlock new applications in healthcare, logistics, and smart cities, ultimately fostering more intuitive and reliable human-robot collaboration.
The evolution of embodied robots underscores the transformative potential of large models in robotics. From end-to-end systems that streamline decision-making to hierarchical frameworks that ensure controllability, these advances are paving the way for autonomous agents that navigate, reason, and interact with human-like proficiency. As datasets grow and evaluation metrics refine, the synergy between embodied intelligence and large models will continue to drive innovation, making robots more capable partners in our daily lives.