Embodied Large Model for Home Service Robot Task Planning

In recent years, the integration of large language models (LLMs) into embodied robotics has opened new avenues for developing intelligent home service robots capable of understanding and executing complex human instructions. However, a significant challenge remains: LLMs often generate task plans that are not executable in real-world environments due to a lack of alignment with physical scene information. This misalignment, often referred to as the “hallucination” problem, leads to plans that involve objects not present in the deployment scene or actions that violate physical constraints. To address this, we propose the TaPA framework, an embodied large model designed for home service robot task planning. TaPA effectively bridges the gap between LLMs and real-world scenes by synthesizing a multimodal instruction-tuning dataset and leveraging scene perception, enabling the generation of feasible and context-aware action sequences. Our approach not only enhances the practicality of embodied robot systems but also advances the field of embodied intelligence by ensuring that task plans are grounded in physical reality.

The core of our work lies in the development of a multimodal dataset that combines visual scene information, human instructions, and corresponding action plans. This dataset is synthesized using advanced foundation models, such as GPT-3.5, to generate diverse and realistic task scenarios. Each data sample is a triple $X = (X_v, X_q, X_a)$, where $X_v$ represents the visual scene information (e.g., object lists from RGB images), $X_q$ is the human instruction, and $X_a$ is the step-by-step action plan. By fine-tuning pre-trained LLMs like LLaMA-7B on this dataset, we empower the embodied robot with the ability to generate executable plans that consider the actual objects available in the environment. During inference, the embodied robot collects multi-view RGB images by navigating through accessible areas, and an open-vocabulary object detector identifies all present objects, providing a comprehensive scene representation. This process ensures that the generated plans are not only logically sound but also physically feasible, addressing the critical issue of object hallucination in traditional LLM-based planners.

In the following sections, we delve into the related work, methodology, experimental evaluation, and conclusions. We begin by discussing the advancements in pre-trained large models and embodied task planning, highlighting the limitations of existing approaches. Subsequently, we detail the TaPA framework, including the synthetic dataset creation and the scene information perception pipeline. We then present extensive experimental results that demonstrate the superiority of TaPA over state-of-the-art models, followed by ablation studies that explore different scene perception strategies. Finally, we summarize our contributions and suggest directions for future research in embodied robotics.

Related Work

The field of embodied intelligence has seen rapid progress, driven by advancements in pre-trained large models and robotic systems. Pre-trained large models, such as LLMs and vision-language models (VLMs), have demonstrated remarkable capabilities in natural language understanding, image recognition, and multimodal reasoning. Models like GPT-3, LLaMA, and CLIP have been widely adopted for tasks ranging from visual question answering to tool usage, owing to their extensive knowledge base and generalization abilities. However, when applied to embodied robotics, these models often fall short due to their inability to perceive and adapt to dynamic physical environments. For instance, while LLMs can generate plausible task plans based on textual instructions, they frequently overlook scene-specific constraints, leading to unexecutable actions. This limitation underscores the need for embodied models that integrate visual perception with linguistic reasoning.

In embodied task planning, researchers have explored various methods to ground LLMs in physical worlds. Early approaches relied on heuristic search algorithms or predefined templates in simulated environments like ALFRED or VirtualHome. While these methods achieve some success in constrained settings, they struggle with the complexity and diversity of real-home scenarios. More recent works, such as LLM-Planner, incorporate scene object information through visual perception to generate feasible plans. However, these methods are often limited to simple tasks and fail to scale to the intricate demands of home service robots. Our TaPA framework builds upon these ideas by introducing a scalable data synthesis process and a robust scene perception mechanism, enabling the embodied robot to handle a wide range of tasks in varied environments.

Another key area is multimodal instruction tuning, where models like LLaVA and MiniGPT-4 have shown promise in combining visual and linguistic data for tasks like dialogue and detailed image description. These models typically use single-image inputs, which are insufficient for capturing entire scene contexts in embodied robotics. TaPA addresses this by employing multi-view image collection and open-vocabulary detection, ensuring a holistic scene representation. Furthermore, our framework emphasizes the alignment of scene information with task plans, reducing instances of hallucination and反事实 actions. By leveraging synthetic data and fine-tuning, TaPA enhances the embodied robot’s ability to generate context-aware plans, marking a significant step toward practical home service applications.

Methodology

The TaPA framework is designed to enable embodied robots to generate executable task plans by aligning large language models with physical scene information. This section details the two core components: the synthesis of the multimodal embodied planning dataset and the embodied task planning process that integrates scene perception. We begin by describing the dataset creation, followed by the planning mechanism that uses visual inputs to guide plan generation.

Embodied Planning Dataset Synthesis

To train the embodied robot for task planning, we construct a multimodal dataset comprising triples of scene information, human instructions, and action plans. This dataset is synthesized using the AI2-THOR simulator, which provides realistic 3D home environments. Each scene $N_s$ is characterized by a list of object categories $X_l$ present in the environment. Instead of relying on handcrafted task templates, as in prior works like ALFRED, we utilize the reasoning capabilities of foundation models like GPT-3.5 to generate diverse instructions $X_q$ and corresponding action plans $X_a$. The synthesis process involves prompt engineering that simulates a dialogue between a service robot and a human, ensuring that the generated instructions are executable and the actions are constrained to the objects in $X_l$. This approach mitigates object hallucination by grounding the plans in the actual scene context.

The dataset is structured as a set of triples $X = (X_v, X_q, X_a)$, where $X_v$ includes visual observations such as RGB images and object lists. For training, we use ground-truth object lists to avoid inaccuracies from visual perception, while during inference, predicted object lists from open-vocabulary detectors are employed. We partition the AI2-THOR scenes into 80 for training and 20 for evaluation, and through data augmentation techniques, we expand the training set to 6,400 scenes by modifying object lists with plausible replacements from room-type-specific sets. This results in a training dataset of 15,000 samples and an evaluation set of 60 triples, ensuring diversity in task complexity and scene layout. The synthetic data generation can be formalized as follows: given a scene $N_s$ with object list $X_l$, we generate an instruction $X_q$ and plan $X_a$ through a prompt-based function $G$:

$$X_q, X_a = G(X_l, P_{prompt})$$

where $P_{prompt}$ is the designed prompt that includes context examples and constraints. This process ensures that the embodied robot learns to generate plans that are both semantically correct and physically feasible.

Table 1: Examples of Synthetic Data Triples
Scene Information (Object List) Human Instruction Action Plan
[Cup, Sink, Clean Bottle, …] Can you clean the sink and the toilet? Step1 Grasp a sponge; Step2 Move to the sink; …
[Pot, Plate, Bread, Sink, Tomato, …] Can you make me a sandwich? Step1 Grasp a plate; Step2 Grasp the knife; …

Embodied Task Planning

During inference, the embodied robot must perceive the deployment scene to generate executable plans. This involves collecting RGB images from various accessible positions and using an open-vocabulary object detector to compile a list of present objects. The scene perception strategy $S$ is defined as a set of positions and camera orientations:

$$S = \{(x, y, \theta) \mid (x, y) \in L(\lambda, A), \theta = k\theta_0\}$$

where $(x, y)$ are coordinates in the reachable area $A$, $L(\lambda, A)$ is a position selection criterion with hyperparameters $\lambda$, and $\theta_0$ is the unit camera rotation angle with $k$ as an integer. We explore several position selection strategies, including traversal points, random points, overall center point, and partitioned center points. The partitioned center point strategy, which uses K-means clustering to divide the scene into subregions and selects centroids, proves most effective by balancing coverage and computational cost. For each collected image $I_i$, the object detector $D$ identifies objects, and the union of detections across all images is processed to remove duplicates, yielding the predicted object list $X_l$:

$$X_l = R_d \left( \bigcup_i D(I_i) \right)$$

where $R_d$ is the duplicate removal operation. This object list, along with the human instruction $X_q$, is then input to the fine-tuned LLM to generate the action plan $X_a$:

$$X_a = T_a(P_{in}, X_l, X_q)$$

Here, $T_a$ represents the task planning function of the LLM, and $P_{in}$ is the input prompt that guides the model to produce feasible steps. By integrating scene perception with LLM reasoning, TaPA ensures that the embodied robot generates plans that are aligned with the physical environment, thereby enhancing executability.

Table 2: Position Selection Strategies for Scene Perception
Strategy Hyperparameters Number of Images Average Success Rate (%)
Traversal Grid size G=0.75m, D=120° 40.4 44.78
Random N=1%, D=120° 3.0 47.95
Overall Center G=0.75m, D=60° 6.0 47.16
Partitioned Center G=0.75m, D=60° 23.1 61.11

Experimental Evaluation

We conduct extensive experiments to evaluate the performance of the TaPA framework in generating executable task plans for embodied robots. The experiments are carried out in the AI2-THOR simulator, using the synthesized dataset for fine-tuning and a separate set for testing. We compare TaPA with state-of-the-art models, including GPT-3.5, LLaVA, and LLaMA, and assess the impact of different scene perception strategies. The evaluation focuses on the success rate of generated plans, with failure cases categorized into hallucination (interacting with non-existent objects) and反事实 (violating physical rules).

Evaluation Methodology

To measure the success rate, we engage 30 volunteers who are researchers in large multimodal models. Each generated action plan is evaluated by three volunteers, who determine if the plan is executable based on the ground-truth object list and instruction. A plan is deemed successful if at least two volunteers approve it. Failures are annotated as either hallucination or反事实, providing insights into the model’s limitations. The evaluation metric is the percentage of successful plans across different room types: kitchen, living room, bedroom, and bathroom. This human evaluation approach ensures a realistic assessment of the embodied robot’s task planning capabilities.

Results and Comparison

Table 3 presents the success rates of TaPA and baseline models on the embodied task planning benchmark. TaPA achieves an average success rate of 61.11%, outperforming GPT-3.5 (54.73%), GPT-4o (60.61%), and other models. The improvement is consistent across all room types, with the highest performance in living rooms (84.21%) and the lowest in kitchens (28.57%), due to the complexity of cooking tasks. LLaVA shows the lowest success rate (22.43%), highlighting the inadequacy of single-image inputs for scene representation. LLaMA, without fine-tuning, performs poorly (5.96%), emphasizing the importance of instruction tuning for embodied robotics. These results demonstrate that TaPA effectively reduces hallucination and反事实 cases by aligning plans with scene information.

Table 3: Success Rates of Different Models on Embodied Task Planning
Model Kitchen (%) Living Room (%) Bedroom (%) Bathroom (%) Average (%)
LLaVA 14.29 42.11 33.33 0.00 22.43
GPT-3.5 28.57 73.68 66.67 50.00 54.73
GPT-4o 35.71 77.32 73.68 55.74 60.61
DeepSeek 14.53 63.15 34.85 63.72 44.06
Qwen2.5 30.65 50.43 70.23 66.31 54.41
LLaMA 0.00 10.52 13.33 0.00 5.96
TaPA 28.57 84.21 73.33 58.33 61.11

Figure 1 illustrates the percentage of failure cases for each model. TaPA has the lowest rate of反事实 cases (13.3%) and reduces hallucination by 26.7% compared to LLaVA and 5.0% compared to GPT-3.5. This indicates that the synthetic dataset and scene perception mechanism effectively ground the LLM in physical reality, making TaPA a robust solution for embodied robot task planning.

Ablation Studies

We further investigate the impact of scene perception strategies on planning success. Table 2 compares different position selection methods, showing that the partitioned center point strategy achieves the highest success rate (61.11%) with a moderate number of images (23.1). In contrast, traversal strategies with fine grid sizes collect excessive images (e.g., 782.4 for G=0.25m, D=60°) but do not improve performance, due to increased noise from object detection. Random sampling and overall center point strategies yield similar results, but partitioned centers provide better coverage by leveraging room layout priors. This ablation study underscores the importance of efficient scene perception for embodied robots, as it balances information completeness and computational efficiency.

Additionally, we examine the effect of training parameters on TaPA’s performance. Table 4 shows that larger batch sizes and more iterations lead to higher success rates, with the best configuration (batch size 16, 400k iterations) achieving 52.44%. This highlights the need for substantial computational resources for fine-tuning embodied models, ensuring that the LLM acquires the necessary expertise for task planning.

Table 4: Impact of Training Parameters on Success Rate
Batch Size Max Iterations Weight Decay Success Rate (%)
1 100k 0.01 23.71
1 400k 0.01 33.29
1 400k 0.02 34.13
8 100k 0.01 35.41
8 400k 0.01 50.00
8 400k 0.02 48.87
16 100k 0.02 48.32
16 400k 0.01 52.44

Conclusion

In this work, we introduced TaPA, an embodied large model framework for home service robot task planning that addresses the critical challenge of aligning LLMs with physical scene information. By synthesizing a multimodal dataset and integrating scene perception through open-vocabulary detection, TaPA enables embodied robots to generate executable action plans that respect environmental constraints. Our experimental results demonstrate that TaPA outperforms state-of-the-art models in success rate, with significant reductions in hallucination and反事实 cases. The ablation studies further validate the effectiveness of partitioned center point strategies for scene perception and the importance of extensive fine-tuning for embodied intelligence.

Future work will focus on extending TaPA to more complex environments and real-world deployments, incorporating dynamic object interactions and long-horizon planning. We also plan to explore the integration of reinforcement learning for adaptive plan execution, enhancing the robustness of embodied robots in unpredictable scenarios. Ultimately, TaPA represents a step forward in bridging the gap between virtual reasoning and physical embodiment, paving the way for more intelligent and reliable home service robots.

Scroll to Top