Survey on Action Representation and Generation Strategies in Vision-Language-Action Models for Embodied Intelligence

Embodied intelligence represents a transformative pathway toward achieving general artificial intelligence, where intelligent agents perceive, interact, and accomplish tasks within the physical world. Central to this paradigm is the Vision-Language-Action (VLA) model, which seamlessly integrates multimodal perception with actionable outputs, enabling embodied robots to translate abstract understanding into concrete physical actions. This integration overcomes the fragmented pipelines of traditional robotics, where vision, language, and control modules operated in isolation, thereby enhancing adaptability and task execution fluency. The core of VLA models lies in their action representation and generation strategies, which serve as the critical bridge between perception and execution. These components face significant challenges, including high-dimensional continuous spaces, diverse action modalities, and the stringent demands of real-time robotic control. This article provides a comprehensive review of the evolution, key technologies, and future directions in action representation and generation for VLA models, highlighting their intrinsic trade-offs in precision, diversity, and efficiency. Emerging strategies, such as hybrid architectures, are also explored to address the needs of real-time control in embodied intelligence applications.

The development of VLA models has accelerated rapidly in recent years, driven by advancements in large language models and vision-language models. Early models like RT-1 pioneered the use of Transformer architectures for large-scale real-world robot control, discretizing continuous actions into tokens to leverage sequence modeling capabilities. This approach unified robotics with powerful sequence models, allowing for the transfer of internet-scale knowledge to physical control tasks. Subsequent models, such as RT-2, further demonstrated that pre-trained vision-language models could directly generate action tokens through techniques like symbol-tuning, enabling the emergence of advanced reasoning abilities like visual chain-of-thought. The field has since expanded to include open-source initiatives like OpenVLA and diffusion-based strategies like Octo, which enhance generalization across diverse robotic platforms. The progression toward more complex systems, such as humanoid robots, underscores the growing emphasis on embodied intelligence in creating versatile, collaborative agents. The increasing publication volume on VLA models, as evidenced by academic search trends, reflects the burgeoning interest and investment in this domain, positioning embodied robots at the forefront of AI research.

VLA models typically comprise three core components: a visual encoder, a language encoder, and an action decoder. The visual encoder processes pixel inputs from sensors using pre-trained foundation models like Vision Transformers, converting scene information into structured features that capture object categories, positions, and geometric relationships. The language encoder, often based on large language models, interprets natural language instructions and encodes them into vector representations. The action decoder, or tokenizer, generates robot control commands by fusing visual and linguistic information, frequently employing autoregressive methods to produce sequences of actions, such as joint angles or end-effector poses. This end-to-end architecture mitigates the inefficiencies of disjointed modules, fostering greater flexibility and robustness in unstructured environments. For instance, in tasks like object manipulation or navigation, VLA models enable embodied robots to dynamically adapt to changing conditions, leveraging multimodal inputs to guide precise actions. The synergy between these components is crucial for advancing embodied intelligence, as it allows agents to not only understand their surroundings but also execute tasks with human-like dexterity and reasoning.

Action Representation in VLA Models

Action representation is a pivotal aspect of VLA models, defining how abstract perceptions are transformed into executable physical commands. The inherent challenges of high-dimensional, continuous action spaces and the multiplicity of valid solutions for a single task have led to two primary approaches: discrete and continuous action representations. Discrete representations involve quantizing continuous actions into a finite set of tokens, enabling compatibility with sequence models like Transformers. This method facilitates the integration of pre-trained vision-language models but introduces quantization errors that can compromise precision in tasks requiring fine motor skills. In contrast, continuous representations model actions as probability distributions, capturing the full spectrum of possible trajectories without discretization. This approach, often implemented through techniques like conditional variational autoencoders or diffusion models, excels in generating smooth, diverse actions but demands greater computational resources. The choice between these representations involves trade-offs between accuracy, efficiency, and adaptability, directly influencing the performance of embodied robots in real-world scenarios.

Discrete Action Representation

Discrete action representation revolutionizes robot control by framing it as a sequence generation problem, akin to language modeling. By dividing continuous action dimensions—such as XYZ coordinates or gripper states—into discrete bins, models like RT-1 and Gato convert control signals into token sequences predictable by Transformer networks. This unification allows embodied intelligence systems to leverage vast pre-trained knowledge from internet data, enhancing generalization across tasks. For example, RT-1 demonstrated success in hundreds of kitchen operations by tokenizing actions into 256 intervals per dimension, while Gato extended this to a universal agent handling diverse inputs from images to robot commands. The advent of RT-2 marked a milestone by fine-tuning vision-language models on robotic data, enabling direct knowledge transfer for semantic-driven manipulation. However, discrete methods suffer from inherent precision loss, making them less suitable for high-accuracy tasks like assembly, and their sequential decoding can limit real-time performance. Despite these drawbacks, recent advancements, such as Humanoid-VLA for full-body pose control and JARVIS-VLA for virtual environments, continue to expand the applicability of discrete representations in embodied robots.

Discretized Action Representation in Typical Models
Year	Model	Core Paradigm	Platform	Task Domain	Action Space Type	Action Dimensions	Discrete Intervals	Issues
2022	RT-1	Imitation Learning	Mobile Manipulator	Mobile Manipulation	End-effector Pose + Base	11	256	Imitation Learning Limit, Generalization Constraints
2022	Gato	General Supervised Learning	Sawyer Arm, etc.	Robot Operation, Games, Dialogue	End-effector Velocity + Gripper	5	1024	Context Length Limits, Slow Inference
2023	RT-2	VLM Co-fine-tuning	Mobile Manipulator	Semantic-driven Manipulation	End-effector Pose + Base	11	256	Physical Skill Limits, High Computational Cost
2023	Q-Transformer	Offline Reinforcement Learning	Mobile Manipulator	Multi-task Operation	End-effector Pose, Gripper	8	256	Reward Function Limits, High-dimensional Action Constraints
2024	OpenVLA	VLM Fine-tuning	Various Manipulators	Cross-embodiment Manipulation	End-effector Pose + Gripper	7	256	Single Image Support Only, Low Inference Efficiency
2025	Humanoid-VLA	Language-Motion Alignment	Humanoid Robot	Mobile Manipulation	Full-body Pose	24	1024	Limited Data Quality and Quantity, Dependency on Underlying RL Policy
2025	JARVIS-VLA	ActVLP	Virtual Agent	Game Operation	Keyboard and Mouse	—	51	Slow Inference Speed, Gap from Top Human Players

Continuous Action Representation

Continuous action representation addresses the limitations of discretization by modeling actions as probability distributions, thereby preserving precision and accommodating multiple valid trajectories. This approach is essential for tasks requiring fine-grained control, such as dexterous manipulation, where quantization errors could lead to failure. Models like ACT employ conditional variational autoencoders to learn a latent space of actions, sampling from this space to generate diverse and accurate sequences. Diffusion Policy further advanced this by reformulating action generation as an iterative denoising process, where noise is gradually removed from a random initial state to produce smooth trajectories. This method excels in capturing complex, multimodal distributions but incurs high computational costs due to its iterative nature. Recent models, such as Octo and 0π, have scaled continuous representations to generalist robots, using diffusion and flow matching techniques to enhance generalization across different embodiments. For instance, DexVLA incorporates a billion-parameter diffusion expert for delicate hand coordination, while HybridVLA integrates multiple generation strategies for robust control. Despite their advantages, continuous methods face challenges in real-time deployment, as their iterative processes can introduce latency, and they often rely on large, high-quality datasets for training.

Continuous Action Representation in Typical Models
Year	Model	Core Paradigm	Platform	Task Domain	Action Dimensions	Representation Method Type	Issues
2023	ACT	Imitation Learning	Manipulator	Fine Bimanual Operation	14	Conditional Variational Autoencoder	Hardware Limits, Perception Challenges
2024	Octo	Imitation Learning	Manipulator	Cross-embodiment General Operation	7/14	Conditional Diffusion	Poor Wrist Camera Handling, Reliance on Demonstration Data
2024	0π	VLM Fine-tuning	Manipulator, Mobile Robot	High Dexterity, Long-horizon Operation	18	Conditional Flow Matching	Heavy Reliance on Large-scale, Partially Closed High-quality Data
2025	HybridVLA	Collaborative Training	Manipulator	General Tabletop Operation	7/14	Hybrid Generation	Inference Speed Constraints
2025	DexVLA	Embodied Curriculum Learning	Manipulator, Dexterous Hand Robot	Cross-embodiment Dexterous Operation	—	Multi-head Diffusion	Limitations in Contact-rich Complex Scenes

Action Generation Strategies in VLA Models

Action generation strategies are the engine of VLA models, determining how multimodal inputs are mapped to action sequences. These strategies involve critical trade-offs between precision, efficiency, diversity, and stability, directly impacting the performance of embodied robots. Self-regressive generation, a foundational approach, produces actions sequentially by predicting each token based on previous outputs, leveraging Transformer decoders with masked self-attention. This method benefits from the robustness of sequence models but suffers from slow inference due to its sequential nature. Non-self-regressive strategies, including those based on CVAE, diffusion models, and flow matching, generate actions in parallel or through iterative processes, improving speed and diversity at the cost of increased complexity. Hybrid strategies combine multiple approaches, such as using self-regressive models for high-level planning and diffusion for low-level control, to achieve a balance between long-horizon reasoning and precise execution. The evolution of these strategies reflects the ongoing pursuit of embodied intelligence that can operate reliably in dynamic environments, with recent innovations focusing on real-time efficiency and cross-platform generalization.

Self-regressive Generation Strategy

The self-regressive generation strategy formulates action generation as a sequence prediction task, where each action token is produced conditioned on previous tokens and inputs. This approach aligns with the autoregressive nature of language models, enabling VLA models to handle complex, multi-step tasks by decomposing the joint probability of action sequences into a product of conditional probabilities. For example, models like VIMA and ChatVLA use Transformer-based decoders to generate actions from mixed visual and textual prompts, facilitating tasks such as object manipulation and dialogue-guided planning. The self-regressive framework’s strength lies in its ability to leverage pre-trained sequence models for embodied intelligence, providing a unified structure for diverse inputs. However, its sequential decoding imposes latency constraints, often limiting control frequencies to 3-5 Hz, which is insufficient for real-time robotic applications requiring 100 Hz or higher. Additionally, the discretization inherent in this strategy can lead to precision loss, particularly in tasks demanding millimeter-level accuracy. Despite these limitations, self-regressive methods remain prevalent due to their simplicity and compatibility with large-scale training, contributing to advancements in embodied robots capable of semantic reasoning and task decomposition.

Non-self-regressive Generation Strategies

Non-self-regressive generation strategies aim to overcome the speed bottlenecks of self-regressive methods by enabling parallel or iterative action generation. These strategies are categorized into probability-based, diffusion-based, and flow matching approaches, each offering unique advantages for continuous action spaces.

Probability-based Generation Strategy with CVAE

The probability-based generation strategy using conditional variational autoencoders models action sequences as probability distributions to address multimodality and mode collapse. CVAEs learn a latent variable space that encapsulates the diversity of valid actions, sampling from this space to generate varied trajectories. The training involves optimizing an evidence lower bound loss, which combines reconstruction loss—ensuring accurate action prediction—and regularization loss—enforcing alignment with a prior distribution. ACT exemplifies this approach, integrating CVAE with action chunking to predict segments of future actions, enhancing performance in dexterous manipulation tasks. While CVAE-based strategies produce high-precision, diverse actions, their architectural complexity and training demands pose challenges for real-time deployment in embodied intelligence systems.

Iterative Generation Strategy with Diffusion Models

Diffusion-based generation reframes action generation as a denoising process, iteratively refining random noise into coherent action sequences under given conditions. Models like Diffusion Policy employ a noise prediction network trained with mean squared error loss to remove noise over multiple steps, resulting in smooth, multimodal trajectories. This approach has been scaled in models such as MDT and RDT-1B, which use Diffusion Transformer architectures for improved performance in bimanual and industrial tasks. CogACT further demonstrates the applicability of diffusion models in high-precision assembly, achieving superior success rates. However, the iterative nature of diffusion models incurs significant computational overhead, limiting their use in high-frequency control scenarios for embodied robots. Research into distillation techniques and hardware acceleration is ongoing to mitigate these efficiency issues.

Generation Strategy with Flow Matching

Flow matching offers an efficient alternative to diffusion models by learning a vector field that directly transforms noise into action sequences through ordinary differential equation solving. This method regresses the flow between noise and data distributions, minimizing a flow matching loss for stable training. Models like 0π and GraspVLA utilize flow matching for robot control, with 0π combining it with a vision-language base for generalist performance and GraspVLA integrating progressive action generation for zero-shot generalization. Flow matching strategies promise faster inference and training stability compared to diffusion, though their full potential in embodied intelligence is still under exploration, particularly for complex, contact-rich environments.

Hybrid Generation Strategy

Hybrid generation strategies integrate multiple generation paradigms to leverage their complementary strengths, such as combining self-regressive models for high-level planning with diffusion models for low-level control. This architecture addresses the temporal misalignment between slow, reasoning-intensive planning and fast, precision-demanding execution. HybridVLA exemplifies this approach by unifying diffusion and self-regressive generators within a single large language model, enabling adaptive action ensembling for robust manipulation. The hybrid strategy enhances performance in long-horizon tasks by decoupling abstraction from execution, but it introduces challenges in feature space alignment and asynchronous coordination. Future directions include developing shared representation spaces and asynchronous frameworks to seamless integrate planning and control in embodied robots, paving the way for more versatile and reliable embodied intelligence.

Model Evaluation in Embodied Intelligence

The evaluation of VLA models relies on standardized benchmarks to assess scalability, generalization, and lifelong learning capabilities. Key datasets include LIBERO and Open X-Embodiment, which provide diverse tasks and high-quality demonstrations for rigorous testing. LIBERO focuses on knowledge transfer in manipulation, with metrics like forward transfer, backward transfer, and success rate area under the curve. Evaluations on LIBERO show that continuous action models, particularly those using flow matching and diffusion strategies, achieve higher average success rates than discrete models, with OpenVLA-OFT leading at 95.4%. Open X-Embodiment emphasizes cross-robot generalization, where continuous models like 0π attain 70.1% success, outperforming discrete counterparts such as RT-2-X at 60.7%. These results underscore the importance of advanced generation strategies in enhancing the performance of embodied robots, though disparities in model architecture and training data highlight the need for further optimization in embodied intelligence applications.

Challenges and Opportunities in Embodied Intelligence

The advancement of VLA models for embodied intelligence faces several challenges while presenting exciting opportunities for future research. Key issues include the integration with world models to enable predictive planning, breaking from reactive control toward long-horizon task execution. Additionally, the shift from traditional modular robotics to end-to-end VLA paradigms offers improved generalization but requires addressing real-time efficiency constraints. Emerging techniques like parallel decoding and action token compression, as seen in Groot N1 and 0π-Fast, aim to achieve 100 Hz control speeds, facilitating deployment on resource-constrained embodied robots. Cross-embodiment generalization seeks to develop robot-agnostic policies through abstract action representations, allowing skill transfer across diverse platforms with minimal calibration. Safety and reliability in open-world environments remain critical, with current systems struggling with perception robustness and collision prediction accuracy. Computational and energy demands also pose barriers, particularly for edge devices, necessitating model compression and hardware co-design. By tackling these challenges, the field can unlock new potentials in embodied intelligence, creating agents that learn, adapt, and collaborate safely in complex physical settings.

Conclusion

In summary, Vision-Language-Action models represent a pivotal advancement in embodied intelligence, bridging multimodal understanding with physical action through innovative representation and generation strategies. The evolution from discrete tokenization to continuous probability modeling and hybrid approaches highlights a continuous pursuit of balance between precision, diversity, and efficiency. As research progresses, the integration of world models, efficient generation techniques, and cross-embodiment frameworks will drive the development of more general, robust, and deployable embodied robots. This survey provides a comprehensive foundation for future work, emphasizing the critical role of action representation and generation in realizing the full potential of embodied intelligence in real-world applications.