Embodied intelligence represents a pivotal pathway toward artificial general intelligence, where agents perceive, interact, and accomplish tasks in the physical world. Central to this paradigm are Vision-Language-Action (VLA) models, which seamlessly integrate multimodal perception with physical execution. These models bridge the gap between high-level understanding and low-level control, enabling embodied robots to perform complex tasks in unstructured environments. However, the core challenge lies in action representation and generation strategies, which act as the critical nexus connecting perception to execution. This survey systematically explores the evolution, methodologies, and future directions of action representation and generation in VLA models for embodied robots, addressing trade-offs in precision, diversity, and efficiency.

The development of VLA models has revolutionized how embodied robots operate, moving from fragmented pipelines to end-to-end learning frameworks. Early models like RT-1 demonstrated the feasibility of discretizing continuous robot actions into tokens, enabling the use of transformer architectures for control. Subsequent advancements, such as RT-2, leveraged large-scale pre-trained vision-language models to transfer internet-scale knowledge to robotic tasks. More recently, models like Octo and OpenVLA have emphasized open-source development and efficiency, while diffusion-based strategies have emerged for high-fidelity action generation. Despite these progress, embodied robots face inherent challenges in real-time control, safety, and generalization across diverse platforms.
In VLA models, action representation defines how robot commands are encoded, while generation strategies determine how these commands are produced from multimodal inputs. Discrete action representation quantizes continuous actions into a finite vocabulary, aligning with sequence models like transformers. In contrast, continuous action representation models actions as probability distributions, preserving precision and handling multimodality. Generation strategies include autoregressive methods, which generate actions sequentially; non-autoregressive approaches like diffusion models and flow matching, which enable parallel or iterative generation; and hybrid strategies that combine multiple methods for balanced performance. These elements collectively shape the capabilities of embodied robots in tasks ranging from manipulation to navigation.
This survey delves into the technical details of action representation and generation, providing a comprehensive analysis of their impact on embodied robots. We examine key datasets and benchmarks, discuss current challenges, and outline future opportunities, aiming to guide the development of more general and efficient embodied agents.
Evolution of VLA Models for Embodied Robots
The journey of VLA models began with efforts to unify perception, language, and action in embodied robots. Traditional robotics relied on modular systems where vision, planning, and control operated independently, leading to integration gaps and limited adaptability. The advent of transformer architectures and large-scale pre-training enabled a shift toward end-to-end models. For instance, RT-1 pioneered the use of transformer-based policies trained on large-scale robot data, demonstrating robust performance across hundreds of tasks. This model discretized robot actions into tokens, treating control as a sequence generation problem. Similarly, Gato extended this idea by creating a generalist agent that processed diverse data types—images, text, and actions—into a unified token sequence.
The integration of vision-language models marked a significant leap. RT-2 utilized a pre-trained VLM fine-tuned on robot data, enabling the transfer of semantic knowledge from the web to physical control. This approach allowed embodied robots to perform tasks requiring reasoning, such as identifying objects based on abstract descriptions. Open-source initiatives like OpenVLA further democratized access to VLA models, showing that smaller models could achieve competitive performance through efficient fine-tuning. Recent models, such as Octo, have embraced diffusion policies for high-quality action generation, while HybridVLA combines autoregressive and diffusion strategies for enhanced robustness. These advancements highlight a trend toward more scalable, generalizable, and efficient models for embodied robots.
The generic architecture of a VLA model comprises three components: a visual encoder, a language encoder, and an action decoder. The visual encoder processes pixel inputs using pre-trained models like Vision Transformers (ViTs), extracting features related to objects, spatial relationships, and scene context. The language encoder, often based on large language models, interprets natural language instructions and encodes them into vector representations. The action decoder then fuses these multimodal inputs to generate robot commands, which can be joint angles, end-effector poses, or velocities. This end-to-end framework eliminates the need for intermediate representations, allowing embodied robots to adapt dynamically to changing environments.
The progression of VLA models is evident in their expanding applications. Initially focused on table-top manipulation with robotic arms, they now encompass humanoid robots, mobile platforms, and virtual agents. For example, Humanoid-VLA adapts autoregressive control to full-body motion, while JARVIS-VLA applies VLA models to game environments. This evolution underscores the growing versatility of embodied robots powered by VLA models.
Action Representation in Embodied Robots
Action representation is a foundational aspect of VLA models, determining how embodied robots translate perceptual understanding into physical movements. The choice of representation influences precision, scalability, and compatibility with learning algorithms. We categorize action representation into discrete and continuous approaches, each with distinct advantages and limitations for embodied robots.
Discrete Action Representation
Discrete action representation involves quantizing continuous robot actions into a finite set of tokens, enabling the use of sequence models like transformers. This approach treats robot control as a classification problem, where each action dimension is divided into bins. For instance, RT-1 uniformly discretized each action dimension into 256 intervals, assigning a unique integer ID to each bin. This tokenization allows embodied robots to leverage pre-trained language models for action generation.
The discrete approach has been widely adopted in early VLA models. Gato extended this by tokenizing all input modalities—images, text, and actions—into a flat sequence processed by a single transformer. This unified representation facilitated multi-task learning across diverse domains. RT-2 further advanced this paradigm by fine-tuning a vision-language model on robot data, enabling the model to output action tokens directly from semantic inputs. This “symbolic tuning” allowed embodied robots to perform tasks requiring commonsense reasoning, such as selecting tools based on context.
Despite its success, discrete representation faces challenges. Quantization introduces precision loss, which can be critical for tasks requiring millimeter-level accuracy, such as assembly or surgery. Additionally, the sequential nature of autoregressive generation limits inference speed, making it difficult to achieve real-time control at high frequencies (e.g., 100 Hz). To address these issues, models like Q-Transformer incorporated offline reinforcement learning, learning a Q-function to improve robustness from mixed-quality data. However, the inherent trade-off between precision and efficiency remains a key consideration for embodied robots.
The following table summarizes representative models using discrete action representation:
| Year | Model | Core Paradigm | Platform | Task Domain | Action Space Type | Action Dimensions | Discrete Bins | Key Issues |
|---|---|---|---|---|---|---|---|---|
| 2022 | RT-1 | Imitation Learning | Mobile Manipulator | Mobile Manipulation | End-effector Pose + Base | 11 | 256 | Imitation ceiling, limited generalization |
| 2022 | Gato | General Supervised Learning | Sawyer Arm et al. | Manipulation, Games, Dialogue | End-effector Velocity + Gripper | 5 | 1024 | Context length limits, slow inference |
| 2023 | RT-2 | VLM Co-fine-tuning | Mobile Manipulator | Semantic-driven Manipulation | End-effector Pose + Base | 11 | 256 | Physical skill limits, high compute cost |
| 2023 | Q-Transformer | Offline Reinforcement Learning | Mobile Manipulator | Multi-task Manipulation | End-effector Pose + Gripper | 8 | 256 | Reward function design, high-dimensional action challenges |
| 2024 | OpenVLA | VLM Fine-tuning | Various Manipulators | Cross-embodiment Manipulation | End-effector Pose + Gripper | 7 | 256 | Single-image support, inference inefficiency |
| 2025 | Humanoid-VLA | Language-Motion Alignment | Humanoid Robot | Locomotion-Manipulation | Full-body Pose | 24 | 1024 | Limited data quality, reliance on low-level RL |
| 2025 | JARVIS-VLA | ActVLP | Virtual Agent | Game Interaction | Keyboard & Mouse | — | 51 | Slow reasoning, gap to human experts |
Discrete representation has been instrumental in scaling VLA models for embodied robots, but its limitations in precision and speed have motivated the exploration of continuous alternatives.
Continuous Action Representation
Continuous action representation models robot actions as probability distributions in a continuous space, preserving precision and handling multimodality. This approach avoids quantization errors and is better suited for tasks requiring fine-grained control. However, it must address the challenge of mode collapse, where models average over multiple valid actions, producing suboptimal results. To overcome this, continuous methods use generative models like Conditional Variational Autoencoders (CVAEs), diffusion models, or flow matching to capture diverse action distributions.
ACT pioneered continuous representation with a CVAE-Transformer architecture, learning a latent space of actions for bimanual manipulation. By sampling from this space, the model generated diverse and precise action sequences. Diffusion Policy introduced diffusion models to robotics, formulating action generation as an iterative denoising process. This method produces smooth, high-quality trajectories and excels at modeling complex distributions. Octo extended this to a generalist robot policy, using a transformer-based diffusion decoder for cross-embodiment generalization.
Flow matching offers an efficient alternative to diffusion, learning a vector field that transforms noise into actions through ordinary differential equations. Models like π₀ and GraspVLA adopt this approach, enabling fast and stable training. For instance, π₀ combines a vision-language model with a flow-matching action expert, achieving state-of-the-art performance on manipulation tasks. These advancements highlight the potential of continuous representation for high-precision embodied robots.
The following table summarizes key models using continuous action representation:
| Year | Model | Core Paradigm | Platform | Task Domain | Action Dimensions | Representation Type | Key Issues |
|---|---|---|---|---|---|---|---|
| 2023 | ACT | Imitation Learning | Robotic Arm | Fine Bimanual Manipulation | 14 | Conditional VAE | Hardware limits, perception challenges |
| 2024 | Octo | Imitation Learning | Robotic Arm | General Cross-embodiment Manipulation | 7/14 | Conditional Diffusion | Wrist camera issues, demo data dependence |
| 2024 | π₀ | VLM Fine-tuning | Manipulator, Mobile Robot | Dexterous Long-horizon Tasks | 18 | Conditional Flow Matching | Reliance on large-scale, proprietary data |
| 2025 | HybridVLA | Collaborative Training | Robotic Arm | General Table-top Manipulation | 7/14 | Hybrid Generation | Inference speed constraints |
| 2025 | DexVLA | Embodied Curriculum Learning | Manipulator, Dexterous Hand | Cross-embodiment Dexterous Manipulation | — | Multi-head Diffusion | Limitations in contact-rich scenes |
Continuous representation enables embodied robots to perform delicate tasks with high accuracy, but it often requires more computational resources and complex training procedures.
Action Generation Strategies for Embodied Robots
Action generation strategies determine how VLA models produce robot commands from multimodal inputs. These strategies involve trade-offs between precision, diversity, and efficiency, critical for real-world deployment of embodied robots. We categorize them into autoregressive, non-autoregressive, and hybrid approaches.
Autoregressive Generation Strategies
Autoregressive generation produces action sequences step-by-step, with each action dependent on previous outputs. This strategy aligns naturally with transformer decoders, which use masked self-attention to enforce causal dependencies. Formally, the probability of generating an action sequence $a_{1:L}$ given observations $s$ and instructions $\pi$ is decomposed as:
$$p(a_{1:L} | s, \pi) = \prod_{t=1}^{L} p(a_t | a_{<t},
Here, $a_t$ is the action at time $t$, and $a_{<t}$ action="" actions="" actions.="" all="" allowing="" and="" approach="" been="" capture="" chatvla,="" decode="" dependencies.="" each="" from="" generates="" has="" in="" it="" language="" like="" model="" models="" p="" previous="" prompts.
Autoregressive strategies are effective for tasks requiring sequential decision-making, such as multi-step manipulation. However, their sequential nature limits inference speed, making them unsuitable for high-frequency control. Additionally, they rely on discrete action representations, which can compromise precision. Despite these drawbacks, autoregressive methods remain popular due to their compatibility with large language models and ease of training.
Non-Autoregressive Generation Strategies
Non-autoregressive strategies generate actions in parallel or through iterative refinement, addressing the speed limitations of autoregressive methods. These include probability-based approaches like CVAEs, diffusion models, and flow matching.
Probability-Based Generation with CVAEs
CVAEs learn a latent variable model of actions, enabling sampling from a diverse distribution. The encoder $q_\phi(z | a, o)$ maps actions $a$ and observations $o$ to a latent variable $z$, while the decoder $p_\theta(a | z, o)$ reconstructs actions from $z$. The training objective maximizes the evidence lower bound (ELBO):
$$\mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q_\phi(z|a,o)}[\log p_\theta(a|z,o)] – D_{\text{KL}}(q_\phi(z|a,o) \| p(z))$$
where $p(z)$ is a prior distribution, often standard normal. The reconstruction loss ensures action accuracy, and the KL divergence regularizes the latent space. ACT uses this approach with action chunking, generating sequences of actions in one pass for bimanual manipulation.
Iterative Generation with Diffusion Models
Diffusion models generate actions by iteratively denoising from noise. Starting from noisy actions $A_K$, the model applies a noise prediction network $\epsilon_\theta$ over $K$ steps:
$$A_{k-1} = \alpha_k (A_k – \gamma_k \epsilon_\theta(A_k, o, k)) + \sigma_k z$$
where $\alpha_k, \gamma_k, \sigma_k$ are noise schedule parameters, $o$ is the observation, and $z$ is random noise. The model is trained to minimize the mean squared error between predicted and actual noise:
$$\mathcal{L} = \mathbb{E}[\| \epsilon – \epsilon_\theta(A_k, o, k) \|^2]$$
Diffusion Policy and Octo employ this strategy for high-quality trajectory generation, though it requires multiple iterations, increasing computational cost.
Flow Matching for Efficient Generation
Flow matching learns a vector field $v_\theta(A_t, o, t)$ that transforms noise $A_0$ into actions $A_1$ by solving an ODE:
$$\frac{dA_t}{dt} = v_\theta(A_t, o, t)$$
The training loss is a simple regression objective:
$$\mathcal{L} = \mathbb{E}[\| v_\theta(A_t, o, t) – (A_1 – A_0) \|^2]$$
Models like π₀ and GraspVLA use flow matching for fast and stable action generation, often achieving comparable performance to diffusion with fewer steps.
Hybrid Generation Strategies
Hybrid strategies combine multiple generation methods to leverage their strengths. For example, HybridVLA integrates autoregressive and diffusion models within a unified LLM. The autoregressive component handles high-level planning and language reasoning, while the diffusion component generates precise low-level actions. This collaboration enhances robustness and adaptability for embodied robots in complex tasks.
Hybrid approaches address the trade-offs between speed and quality but introduce challenges in aligning different representation spaces and managing computational overhead. Future work may focus on asynchronous execution and shared latent spaces to improve efficiency.
Evaluation of VLA Models for Embodied Robots
Benchmark datasets like LIBERO and Open X-Embodiment provide standardized environments to evaluate VLA models. LIBERO focuses on lifelong learning, assessing knowledge transfer across tasks. It includes 130 procedurally generated tasks with human demonstrations, measuring metrics like forward transfer, backward transfer, and area under the success curve. Open X-Embodiment aggregates data from diverse robots, enabling cross-embodiment evaluation. Success rate is the primary metric, reflecting the model’s ability to generalize.
The following table compares VLA models on the LIBERO dataset:
| Action Type | VLA Model | Average Success Rate (%) |
|---|---|---|
| Continuous | Diffusion Policy | 72.4 |
| Continuous | Octo | 75.1 |
| Continuous | DiT Policy | 82.4 |
| Continuous | OpenVLA-OFT | 95.4 |
| Continuous | π₀ | 94.2 |
| Discrete | OpenVLA | 76.5 |
| Discrete | WorldVLA | 79.1 |
Continuous models generally achieve higher success rates, with OpenVLA-OFT leading at 95.4%. This highlights the advantage of advanced generation strategies like flow matching and diffusion for embodied robots.
On Open X-Embodiment, continuous models like π₀ achieve 70.1% success, outperforming discrete models such as RT-2-X (60.7%). However, model architecture and training data quality play crucial roles, as seen with Octo-Base’s lower performance (16.8%). These results underscore the importance of scalable training and efficient generation for embodied robots.
Challenges and Future Directions for Embodied Robots
Despite progress, VLA models face several challenges in real-world deployment. Integrating world models could enable predictive planning, allowing embodied robots to simulate action outcomes and avoid failures. However, this requires learning environment dynamics, which is computationally intensive.
Real-time control demands efficient generation. Techniques like parallel decoding (e.g., in Groot N1) and action token compression (e.g., FAST in π₀-Fast) reduce latency, enabling high-frequency control. For instance, FAST compresses action sequences in the frequency domain, achieving 15× speedup.
Cross-embodiment generalization aims to create robot-agnostic policies. Abstract action representations, such as task-space commands instead of joint angles, could allow skills to transfer across platforms. This would reduce the need for robot-specific training data.
Safety and reliability are critical in open-world environments. Current models suffer from perception vulnerabilities (e.g., accuracy drops of 20–30% in poor lighting) and delayed emergency stops (200–500 ms). Improving robustness through adversarial training and real-time monitoring is essential for embodied robots in sensitive applications like healthcare.
Computational efficiency remains a bottleneck. Large models like π₀ require over 28 GB of memory, exceeding edge device capacities. Knowledge distillation and hardware acceleration can help, but energy consumption must be optimized for mobile embodied robots.
Finally, data dependence limits scalability. Collecting high-quality demonstrations is costly, and models struggle with beyond-demonstration performance. Combining imitation learning with reinforcement learning could enable self-improvement, closing the gap between human and robot capabilities.
Conclusion
VLA models have transformed embodied robots by unifying perception, language, and action. Action representation and generation strategies are central to this progress, with discrete methods enabling scalability and continuous methods ensuring precision. Autoregressive, non-autoregressive, and hybrid generation strategies offer trade-offs in speed, quality, and diversity. As embodied robots advance, integrating world models, improving efficiency, and ensuring safety will be key. This survey provides a comprehensive overview to guide future research toward more general and reliable embodied agents.