Embodied robot intelligence represents a paradigm shift in artificial intelligence, where agents perceive, reason, and act in physical environments to accomplish complex tasks. Central to this evolution are Vision-Language-Action (VLA) models, which integrate multimodal inputs—visual data and language instructions—to generate actionable outputs for embodied robots. These models bridge the gap between abstract understanding and physical execution, enabling robots to perform tasks in unstructured, dynamic environments. The core of VLA models lies in their ability to translate perceptual and linguistic cues into precise, real-world actions, a process governed by two critical components: action representation and generation strategies. Action representation defines how robot commands are encoded—either as discrete tokens or continuous values—while generation strategies determine the methodology for mapping inputs to these representations. This review systematically explores the advancements, trade-offs, and future directions in these areas, emphasizing their impact on the efficiency, precision, and adaptability of embodied robots.
The development of VLA models has accelerated rapidly, driven by the success of large language models (LLMs) and vision-language models (VLMs). Early models like RT-1 demonstrated the feasibility of discretizing robot actions into tokens, enabling the use of transformer architectures for control tasks. Subsequent innovations, such as RT-2, leveraged pre-trained VLMs to transfer web-scale knowledge to robotic control, while recent approaches like Octo and OpenVLA have focused on scalability and open-source accessibility. A key challenge in this domain is balancing action precision, diversity, and real-time performance, which directly influences the deployment of embodied robots in practical scenarios. This review delves into the technical nuances of action representation and generation, providing insights into how these elements shape the capabilities of modern embodied robots.

In embodied robot systems, action representation serves as the foundational layer that translates high-level intentions into executable commands. The choice of representation—discrete or continuous—affects the robot’s ability to handle high-dimensional spaces and multimodal task solutions. Discrete representations quantize actions into finite sets of tokens, aligning with sequence models like transformers, but often at the cost of precision. Continuous representations, on the other hand, model actions as probability distributions, preserving granularity but requiring complex generative techniques. Similarly, generation strategies—whether autoregressive, non-autoregressive, or hybrid—dictate how actions are sequentially or concurrently produced, impacting inference speed and robustness. This review examines these aspects in detail, highlighting their implications for embodied robot performance in tasks ranging from industrial manipulation to human-robot collaboration.
Action Representation in Embodied Robots
Action representation is a cornerstone of VLA models, determining how embodied robots encode and execute physical commands. The representation must accommodate the inherent complexity of robot motion, which often involves high-dimensional, continuous spaces with multiple valid solutions for a single task. Two predominant approaches have emerged: discrete and continuous action representations. Discrete methods convert continuous actions—such as joint angles or end-effector poses—into a finite vocabulary of tokens, enabling compatibility with language models. Continuous methods, in contrast, directly model actions as real-valued vectors or distributions, capturing subtle variations and enhancing precision. The choice between these paradigms involves trade-offs in accuracy, computational efficiency, and adaptability, which are critical for the real-world performance of embodied robots.
Discrete Action Representation
Discrete action representation revolutionized embodied robot control by framing it as a sequence generation problem. This approach quantizes each action dimension—e.g., Cartesian coordinates or gripper states—into a fixed number of bins, each assigned a unique token ID. For instance, RT-1 discretized 11 action dimensions into 256 bins per dimension, allowing a transformer model to predict action sequences autoregressively. This method leverages the power of pre-trained language models, facilitating knowledge transfer from large-scale datasets to robotic tasks. However, discretization introduces quantization errors, which can compromise precision in tasks requiring fine motor skills, such as assembly or delicate manipulation. Despite this, discrete representations remain popular due to their simplicity and alignment with existing AI infrastructure.
The evolution of discrete representation is marked by several key models. Gato extended this paradigm by unifying diverse data types—images, text, and actions—into a single token sequence, enabling a generalist embodied robot agent. RT-2 further advanced this by fine-tuning VLMs on robotic data, allowing the model to “verbalize” actions as tokens, thus transferring semantic knowledge to physical control. More recently, models like Humanoid-VLA and JARVIS-VLA have applied discrete representations to humanoid robots and virtual environments, respectively, demonstrating their versatility. The following table summarizes representative models and their characteristics:
| Year | Model | Core Paradigm | Platform | Task Domain | Action Space Type | Action Dimensions | Discrete Bins | Key Challenges |
|---|---|---|---|---|---|---|---|---|
| 2022 | RT-1 | Imitation Learning | Mobile Manipulator | Mobile Manipulation | End-effector Pose + Base | 11 | 256 | Imitation Learning Limit, Generalization |
| 2022 | Gato | General Supervised Learning | Sawyer Arm | Multi-domain Tasks | Velocity Control + Gripper | 5 | 1024 | Context Length, Inference Speed |
| 2023 | RT-2 | VLM Co-finetuning | Mobile Manipulator | Semantic Manipulation | End-effector Pose + Base | 11 | 256 | Physical Skill Limits, Compute Cost |
| 2023 | Q-Transformer | Offline Reinforcement Learning | Mobile Manipulator | Multi-task Manipulation | End-effector Pose + Gripper | 8 | 256 | Reward Function Design, High-dimensional Actions |
| 2024 | OpenVLA | VLM Finetuning | Various Manipulators | Cross-embodiment Manipulation | End-effector Pose + Gripper | 7 | 256 | Single-image Input, Inference Efficiency |
| 2025 | Humanoid-VLA | Language-Motion Alignment | Humanoid Robot | Locomotion-Manipulation | Full-body Pose | 24 | 1024 | Data Scarcity, Reliance on Low-level RL |
| 2025 | JARVIS-VLA | ActVLP | Virtual Agent | Game Interaction | Keyboard & Mouse | — | 51 | Inference Latency, Human Performance Gap |
Mathematically, discrete action representation can be formalized using tokenization. Let $a_t$ denote a continuous action at time $t$, which is mapped to a discrete token $s_t$ via binning: $$s_t = \text{argmin}_i |a_t – c_i|,$$ where $c_i$ represents the center of the $i$-th bin. The probability of an action sequence $s_{1:L}$ given parameters $\theta$ is modeled autoregressively: $$\log p(s_{1:L} | \theta) = \sum_{l=1}^L \log p(s_l | s_{1:l-1}, \theta).$$ This formulation enables embodied robots to leverage sequence models for control, but the discretization error $\epsilon = a_t – c_{s_t}$ can accumulate over time, affecting task performance.
Continuous Action Representation
Continuous action representation addresses the limitations of discretization by modeling actions as real-valued vectors or distributions. This approach is essential for embodied robots performing tasks that require high precision, such as surgical operations or fine assembly. The key challenge here is multimodality—multiple valid action trajectories for a single task—which simple regression models fail to capture due to mode collapse. To overcome this, continuous representations often employ probabilistic models, such as Conditional Variational Autoencoders (CVAEs) or diffusion models, to learn the distribution of feasible actions. This allows sampling of diverse, smooth trajectories that adapt to environmental uncertainties.
Notable models in this category include ACT, which uses a CVAE to encode action sequences into a latent space, enabling precise bimanual manipulation. Diffusion Policy introduced diffusion models to robotics, iteratively denoising random noise into action trajectories conditioned on visual inputs. More recently, Octo and $0\pi$ have scaled continuous representations to diverse robot platforms using diffusion and flow matching techniques, respectively. The table below outlines these models and their attributes:
| Year | Model | Core Paradigm | Platform | Task Domain | Action Dimensions | Representation Type | Key Challenges |
|---|---|---|---|---|---|---|---|
| 2023 | ACT | Imitation Learning | Robotic Arm | Bimanual Manipulation | 14 | Conditional VAE | Hardware Limits, Perception Issues |
| 2024 | Octo | Imitation Learning | Various Manipulators | General Manipulation | 7/14 | Conditional Diffusion | Wrist Camera Processing, Data Dependency |
| 2024 | $0\pi$ | VLM Finetuning | Manipulator, Mobile Robot | Dexterous Long-horizon Tasks | 18 | Conditional Flow Matching | Data Scarcity, Proprietary Datasets |
| 2025 | HybridVLA | Collaborative Training | Robotic Arm | General Tabletop Tasks | 7/14 | Hybrid Generation | Inference Speed |
| 2025 | DexVLA | Embodied Curriculum Learning | Manipulator, Dexterous Hand | Cross-embodiment Dexterous Tasks | — | Multi-head Diffusion | Contact-rich Scenario Limits |
In continuous representation, actions are modeled as probability distributions. For example, in a CVAE, the latent variable $z$ is sampled from a prior $p(z)$, and the action sequence $a_{1:T}$ is decoded conditioned on observations $o$: $$p(a_{1:T} | o) = \int p(a_{1:T} | z, o) p(z) dz.$$ The training objective maximizes the evidence lower bound (ELBO): $$\mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q(z|a,o)}[\log p(a|z,o)] – D_{\text{KL}}(q(z|a,o) || p(z)),$$ where $q(z|a,o)$ is the encoder network. This enables embodied robots to generate diverse actions while maintaining precision, though it requires significant computational resources.
Action Generation Strategies for Embodied Robots
Action generation strategies in VLA models determine how embodied robots translate multimodal inputs into action sequences. These strategies are critical for real-time performance, as they influence inference speed, action quality, and adaptability. Three primary categories exist: autoregressive, non-autoregressive, and hybrid strategies. Autoregressive methods generate actions sequentially, leveraging transformer decoders but suffering from latency. Non-autoregressive strategies, including CVAE-based, diffusion-based, and flow matching approaches, produce actions in parallel or iteratively, improving speed and diversity. Hybrid strategies combine multiple methods to balance planning and control. Each approach involves trade-offs between precision, efficiency, and robustness, which are pivotal for deploying embodied robots in dynamic environments.
Autoregressive Generation Strategies
Autoregressive generation is a foundational strategy for embodied robots, where actions are produced step-by-step in a causal manner. This approach models the probability of an action sequence $a_{1:T}$ given observations $o$ and language instructions $p$ as: $$p(a_{1:T} | o, p) = \prod_{t=1}^T p(a_t | a_{1:t-1}, o, p).$$ Implemented via transformer decoders with masked self-attention, autoregressive strategies ensure temporal consistency but introduce sequential dependencies that limit inference frequency to 3–5 Hz, insufficient for real-time control in embodied robots. Models like VIMA and ChatVLA employ this strategy for tasks requiring complex reasoning, but the discretization of actions often results in jerky motions and reduced precision.
The training loss for autoregressive models typically involves cross-entropy over discretized action tokens. For a batch of sequences, the loss is: $$\mathcal{L} = -\sum_{b=1}^{|B|} \sum_{l=1}^L m(b, l) \log p(s_l^{(b)} | s_{1:l-1}^{(b)}),$$ where $m(b, l)$ is a mask indicating valid tokens. While this approach benefits from pre-trained language models, it struggles with long-horizon tasks due to error accumulation. For embodied robots, this can lead to task failures in precision-critical applications, necessitating alternative strategies.
Non-Autoregressive Generation Strategies
Non-autoregressive strategies overcome the latency of autoregressive methods by generating actions in parallel or through iterative refinement. These include CVAE-based probabilistic generation, diffusion-based iterative generation, and flow matching, each offering unique advantages for embodied robots.
CVAE-based Probabilistic Generation
CVAE-based strategies model action distributions to handle multimodality. The model encodes expert demonstrations into a latent variable $z$, and decodes actions conditioned on observations. The loss function combines reconstruction loss and KL divergence: $$\mathcal{L} = \mathcal{L}_{\text{reconst}} + \beta \mathcal{L}_{\text{reg}},$$ where $\mathcal{L}_{\text{reconst}} = \text{L1}(a, \hat{a})$ ensures action accuracy, and $\mathcal{L}_{\text{reg}} = D_{\text{KL}}(q(z|a,o) || \mathcal{N}(0, I))$ regularizes the latent space. ACT exemplifies this approach, generating smooth trajectories for bimanual manipulation. However, CVAEs require careful tuning and can be computationally intensive for complex embodied robot tasks.
Diffusion-based Iterative Generation
Diffusion models frame action generation as a denoising process. Starting from noise $A_K$, actions are refined over $K$ steps: $$A_{k-1} = \alpha (A_k – \gamma \epsilon_\theta(A_k, o, k)) + \sigma \zeta,$$ where $\epsilon_\theta$ is a noise predictor trained to minimize: $$\mathcal{L} = \mathbb{E}[\| \epsilon – \epsilon_\theta(A_k, o, k) \|^2].$$ Models like Diffusion Policy and MDT use this for high-quality trajectory generation, but the iterative process slows inference. For embodied robots, this can hinder real-time responsiveness, though techniques like distillation aim to mitigate this.
Flow Matching Generation
Flow matching learns a vector field $v_\theta$ that transitions noise to actions along a probability path. The loss is: $$\mathcal{L} = \mathbb{E}_{t, A_t}[\| v_\theta(A_t, o, t) – (A_1 – A_0) \|^2],$$ and inference involves solving an ODE: $$A_{t+\delta} = A_t + \delta \cdot v_\theta(A_t, o, t).$$ $0\pi$ and GraspVLA employ this for efficient action generation, reducing steps to ~10 compared to diffusion’s 100+. This benefits embodied robots by enabling faster inference while maintaining trajectory quality.
Hybrid Generation Strategies
Hybrid strategies integrate multiple generation methods to leverage their strengths. For example, HybridVLA combines autoregressive planning with diffusion-based control, using a unified LLM to coordinate both. The autoregressive component handles high-level task decomposition, while the diffusion component generates precise low-level actions. This approach enhances robustness for embodied robots in complex tasks but requires sophisticated architecture design to align the temporal and semantic scales of different strategies. The adaptive action ensemble in HybridVLA, for instance, dynamically weights predictions from both generators, improving performance in simulated and real-world environments.
Model Evaluation for Embodied Robots
Evaluating VLA models for embodied robots involves standardized benchmarks to assess scalability, generalization, and lifelong learning. Two prominent datasets are LIBERO and Open X-Embodiment. LIBERO focuses on knowledge transfer across tasks, measuring metrics like forward transfer, backward transfer, and area under the success curve. Open X-Embodiment aggregates diverse robotic data to test cross-embodiment generalization, primarily using task success rates. The tables below compare model performances on these benchmarks, highlighting the impact of action representation and generation strategies on embodied robot capabilities.
| Action Type | VLA Model | Average Success Rate (%) |
|---|---|---|
| Continuous | Diffusion Policy | 72.4 |
| Continuous | Octo | 75.1 |
| Continuous | DiT Policy | 82.4 |
| Continuous | OpenVLA-OFT | 95.4 |
| Continuous | $0\pi$ | 94.2 |
| Discrete | OpenVLA | 76.5 |
| Discrete | WorldVLA | 79.1 |
| Action Type | VLA Model | Average Success Rate (%) |
|---|---|---|
| Continuous | Octo-Base | 16.8 |
| Continuous | $0\pi$ | 70.1 |
| Discrete | RT-1 | 6.8 |
| Discrete | TraceVLA | 42.0 |
| Discrete | RT-1-X | 53.4 |
| Discrete | RT-2-X | 60.7 |
| Discrete | OpenVLA | 27.7 |
The results indicate that continuous action models, particularly those using flow matching or diffusion, achieve higher success rates, underscoring their superiority for precision tasks in embodied robots. However, discrete models remain competitive in scenarios requiring rapid inference or leveraging pre-trained knowledge.
Challenges and Opportunities for Embodied Robots
Despite progress, embodied robots face significant challenges in action representation and generation. Key issues include real-time inference, safety, and generalization. Conversely, emerging opportunities—such as integration with world models and cross-embodiment representations—promise to advance the field.
Integration with World Models
World models enable embodied robots to predict environment dynamics, facilitating long-horizon planning and proactive decision-making. By simulating action consequences, VLA models can avoid failures and optimize task performance. For example, predicting tool interactions before execution enhances efficiency in assembly tasks. However, training world models requires vast datasets and complex architectures, posing scalability challenges for embodied robots.
Breaking Traditional Paradigms
Traditional “perception-planning-control” pipelines in embodied robots suffer from integration gaps and limited adaptability. VLA models offer end-to-end learning, directly mapping inputs to actions, which improves robustness. Yet, this approach demands extensive data and compute resources, highlighting the need for efficient training techniques.
Efficient Generation for Real-Time Control
Real-time control in embodied robots requires high-frequency action generation (e.g., 100 Hz). Strategies like parallel decoding (e.g., in Groot N1) and token compression (e.g., FAST in $0\pi$-Fast) reduce latency by up to 60% and 15×, respectively. These innovations enable deployment on resource-constrained platforms, expanding the applicability of embodied robots.
Cross-Embodiment General Representations
Developing embodiment-agnostic action representations allows policies to transfer across robot morphologies. Abstract representations, such as task-space coordinates, facilitate quick adaptation with minimal calibration. This could revolutionize embodied robots by enabling skill sharing between manipulators, humanoids, and mobile platforms.
Safety and Reliability in Open Worlds
Embodied robots must operate safely in unpredictable environments. Current VLA models lack robust safety mechanisms, often relying on hard-coded thresholds. Improving perception robustness—e.g., handling occlusions or lighting changes—and incorporating predictive collision avoidance are critical. Additionally, reducing emergency stop delays (currently 200–500 ms) is vital for human-robot collaboration.
Compute and Energy Efficiency
Large VLA models (e.g., 7B parameters) require substantial memory and energy, limiting edge deployment. Model compression, quantization, and hardware acceleration can mitigate these issues. For instance, optimizing diffusion steps or using specialized chips enhances efficiency for embodied robots in field applications.
Conclusion
Action representation and generation strategies are pivotal to the advancement of embodied robots through VLA models. Discrete representations offer simplicity and compatibility with language models, while continuous representations provide precision and diversity. Autoregressive generation ensures sequential coherence but lags in speed, whereas non-autoregressive and hybrid strategies improve efficiency and robustness. Benchmarks like LIBERO and Open X-Embodiment highlight the superiority of continuous approaches in complex tasks. Future directions include world model integration, cross-embodiment generalization, and real-time optimization, which will empower embodied robots to perform reliably in diverse, dynamic environments. As research progresses, these innovations will drive the realization of general-purpose embodied intelligence, transforming industries from manufacturing to healthcare.
