In the rapidly evolving field of artificial intelligence, embodied intelligence represents a pivotal pathway toward creating agents that can perceive, interact,, and accomplish tasks in the physical world. Central to this endeavor are Vision-Language-Action (VLA) models, which integrate multimodal感知理解 with physical execution. These models serve as the backbone for embodied robots, enabling them to translate visual and linguistic inputs into precise actions. However, the core challenge lies in action representation and generation strategies—the critical bridge between perception and execution—which must navigate high-dimensional continuous spaces, diverse action modalities, and the demands of real-time control. This review systematically explores the evolution, technologies, and future directions of action representation and generation in VLA models, highlighting their role in advancing embodied intelligence.

The development of VLA models marks a significant leap in embodied intelligence, moving away from fragmented pipelines where vision, language, and action systems operated in isolation. Early models like RT-1 pioneered the use of Transformer architectures to离散ize robot actions into tokens, treating control as a sequence generation task. This approach laid the groundwork for leveraging large-scale pre-trained models in robotics. Subsequent advancements, such as RT-2, demonstrated that internet-scale knowledge from vision-language models could be transferred to robotic control, enabling capabilities like visual chain-of-thought reasoning. More recently, models like Octo and OpenVLA have emphasized openness and efficiency, integrating diverse robot datasets and employing generative strategies like diffusion for improved generalization. The latest trends, seen in systems like Groot N1, involve hybrid architectures that combine high-level planning with low-level control, pushing the boundaries of what embodied robots can achieve in complex environments.
- Action Representation in VLA Models
- Generation Strategies for Action in VLA Models
- Probability-based strategies using CVAE model action distributions through conditional variational autoencoders. ACT exemplifies this by employing a CVAE-Transformer to predict chunks of continuous actions, enhancing diversity and precision for tasks like bimanual manipulation. The training involves maximizing evidence lower bound (ELBO) with reconstruction and regularization losses, ensuring actions align with expert demonstrations while maintaining generalization.
- Iterative strategies using diffusion models generate actions via a denoising process, starting from noise and refining it over iterations. Diffusion Policy pioneered this by training a noise prediction network to iteratively produce smooth trajectories. Models like MDT and RDT-1B adopt Diffusion Transformer (DiT) architectures for better scalability, while CogACT applies diffusion to industrial tasks, achieving high success rates. Despite superior action quality, diffusion strategies are computationally intensive, limiting their use in high-frequency scenarios for embodied intelligence.
- Flow-based strategies using flow matching offer an efficient alternative by learning vector fields that transform noise into actions through ordinary differential equations. 0π utilizes flow matching within a VLA framework, enabling direct generation of continuous control signals. GraspVLA combines flow matching with progressive action generation for zero-shot generalization in grasping tasks. This approach promises faster training and inference but is still emerging in the context of embodied robots.
- Model Evaluation and Performance Insights
- Challenges and Future Opportunities in Embodied Intelligence
Action representation is a cornerstone of VLA models, defining how abstract understanding is converted into physical commands for embodied robots. It addresses the complexities of high-dimensional, continuous action spaces and the multiplicity of valid solutions for tasks. Research has coalesced around two primary approaches: discrete and continuous action representation, each with distinct implications for embodied intelligence.
Discrete action representation involves quantizing continuous robot actions—such as joint angles or end-effector poses—into a finite set of tokens, akin to words in a language model. This method allows VLA models to leverage powerful sequence models like Transformers for action prediction. For instance, RT-1 discretized action dimensions into 256 bins, enabling a Transformer to handle diverse kitchen tasks. Gato extended this by tokenizing all inputs—images, text, and actions—into a unified sequence, showcasing generalizability across tasks. RT-2 further advanced this by fine-tuning vision-language models on robotic data, facilitating the transfer of web knowledge to physical control. However, discrete representation suffers from precision loss due to quantization, which can be critical for tasks requiring millimeter-level accuracy, such as assembly. This limitation has spurred interest in continuous representation methods for embodied intelligence.
Continuous action representation, in contrast, models actions as probability distributions in a continuous space, capturing the multimodality of valid trajectories. This approach avoids the averaging effect of regression models, which can lead to mode collapse. Models like ACT employ conditional variational autoencoders (CVAE) to learn latent spaces that encompass diverse action possibilities, enabling high-precision manipulation. Diffusion-based strategies, as seen in Diffusion Policy, reframe action generation as an iterative denoising process, producing smooth and varied trajectories. Octo and later models like 0π use diffusion or flow matching to generate actions that adapt to different robot morphologies and tasks. Continuous representation excels in handling fine-grained operations but often incurs higher computational costs, posing challenges for real-time control in embodied robots.
Generation strategies are the decision-making engines of VLA models, determining how multimodal inputs map to action sequences. These strategies involve trade-offs between precision, diversity, efficiency, and robustness, which are crucial for the effective deployment of embodied intelligence in dynamic environments.
Autoregressive generation strategies produce action sequences step-by-step, with each token dependent on previous outputs. This method, rooted in Transformer decoders, treats robot control as a sequence prediction problem. For example, VIMA uses an autoregressive Transformer to generate actions based on object-centric visual and language tokens, while ChatVLA integrates planning and dialogue within a unified model. Autoregressive approaches benefit from the scalability of sequence models but face speed limitations due to serial decoding, often operating at low frequencies (e.g., 3-5 Hz) that fall short of the 100 Hz required for real-time embodied robot control.
Non-autoregressive generation strategies break serial dependencies to accelerate inference, addressing the latency issues of autoregressive methods. These include probabilistic, iterative, and flow-based approaches:
Hybrid generation strategies combine multiple approaches to leverage their strengths. For instance, HybridVLA integrates autoregressive and diffusion strategies within a single model, allowing high-level planning and low-level control to coexist. This architecture enables adaptive action ensembles that enhance robustness in complex tasks, such as dual-arm manipulation. Hybrid strategies represent a forward-looking direction for embodied intelligence, aiming to balance speed and quality, though they introduce challenges in aligning symbolic and continuous representations.
Evaluating VLA models is essential for assessing their scalability, generalization, and applicability to embodied intelligence. Benchmarks like LIBERO and Open X-Embodiment provide standardized datasets to measure performance across diverse tasks and robot platforms.
LIBERO focuses on lifelong learning for robot manipulation, testing knowledge transfer across spatial, procedural, and conceptual domains. It reports task success rates, with models like OpenVLA-OFT achieving up to 95.4% average success, highlighting the superiority of continuous action representations and advanced generative strategies. In contrast, discrete models like OpenVLA reach 76.5%, indicating limitations in precision. These results underscore the importance of action representation in enhancing embodied intelligence for complex tasks.
Open X-Embodiment aggregates data from multiple robots to foster generalist policies. Evaluations show that continuous models, such as 0π, achieve higher success rates (70.1%) compared to discrete models like RT-2-X (60.7%), though performance varies with architecture and training. This dataset emphasizes the potential of VLA models to transcend specific robot morphologies, advancing the goal of universal embodied robots. However, computational efficiency remains a concern, as high-performing models often require substantial resources, impacting real-world deployment.
Despite progress, VLA models face significant hurdles in action representation and generation, which must be overcome to realize robust embodied intelligence. Key challenges include real-time control, safety, and generalization, while opportunities lie in integration with world models and cross-robot adaptability.
One major challenge is the computational demand of advanced generative strategies, such as diffusion models, which hinder high-frequency control. Solutions like parallel decoding in Groot N1 and action token compression in FAST aim to reduce latency, enabling 100 Hz operation for embodied robots. Additionally, safety and reliability in open-world settings are critical; current systems rely on rigid thresholds and exhibit感知 vulnerabilities, necessitating adaptive safety mechanisms that can handle dynamic environments.
Future opportunities include merging VLA models with world models to enable predictive planning, allowing embodied robots to simulate outcomes and optimize long-horizon tasks. Another promising direction is developing universal action representations that are robot-agnostic, facilitating skill transfer across diverse platforms like humanoids or quadrupeds. Furthermore, enhancing VLA models with reinforcement learning could bridge the gap between imitation and exploration, fostering more autonomous embodied intelligence. As research advances, addressing these aspects will be crucial for creating general-purpose embodied robots that operate safely and efficiently in real-world scenarios.
In summary, the evolution of action representation and generation in VLA models reflects a broader trajectory toward more capable and efficient embodied intelligence. From discrete tokenization to continuous probabilistic modeling, and from autoregressive to hybrid strategies, these advancements are paving the way for embodied robots that can navigate complex environments with greater autonomy. By addressing challenges in real-time control, safety, and generalization, future VLA models hold the potential to transform industries ranging from healthcare to manufacturing, solidifying the role of embodied intelligence in the next generation of AI systems.