In the pursuit of Artificial General Intelligence (AGI), the field of embodied AI has emerged as a critical pathway. It focuses on developing intelligent agents that can perceive, reason, and interact within the physical world. The advent of powerful Large Language Models (LLMs) and Vision-Language Models (VLMs) has catalyzed the development of a new class of multimodal architectures: Vision-Language-Action (VLA) models. These models aim to seamlessly integrate visual perception, language understanding, and physical action generation to solve instruction-conditioned tasks for embodied AI robots. By bridging internet-scale semantic knowledge with real-world physical interaction, VLA models demonstrate unprecedented generalization and flexibility in unstructured environments, moving beyond the limitations of traditional, fragmented robotic pipelines.
While existing surveys provide broad overviews of robotic foundation models, a systematic and in-depth analysis of the core bridge between multimodal “understanding” and physical “execution”—namely, action representation and generation strategies—is notably absent. This gap is crucial because an embodied AI robot’s ultimate value is determined not just by its perception but by its ability to translate that perception into precise, effective, and reliable physical actions. Therefore, this article provides a comprehensive review focused specifically on the evolution, methodologies, trade-offs, and future directions of action representation and generation within VLA models for embodied intelligence.
The Evolution and Architecture of VLA Models
The development of VLA models represents a paradigm shift from traditional modular robotics. A canonical VLA architecture typically consists of three core components working in concert: a visual encoder, a language encoder, and an action decoder. The visual encoder, often a pre-trained Vision Transformer (ViT), processes raw pixel input from cameras to extract structured features representing objects, spatial relationships, and scene geometry. The language encoder, based on an LLM, interprets the natural language instruction and encodes it into a contextual vector. The action decoder, the focal point of this survey, is responsible for the critical transformation of the fused visual-language understanding into executable motor commands for the embodied AI robot. This end-to-end learning framework overcomes the integration gaps inherent in traditional “sense-plan-act” pipelines, enabling more adaptive and fluent task execution.
The trajectory of VLA models has been rapid. Early works like RT-1 pioneered the use of Transformer architectures and discretized action tokens for large-scale real-world control. A significant milestone was reached with RT-2, which demonstrated that knowledge from web-scale visual-language data could be directly transferred to robotic control through co-fine-tuning, leading to emergent capabilities like visual chain-of-thought reasoning. The field has since moved towards greater openness and efficiency, exemplified by models like Octo, which leveraged large open datasets and diffusion policies, and OpenVLA, which showed that smaller, open-source models could rival larger counterparts through efficient fine-tuning. The latest frontier involves applying these models to increasingly complex systems, such as humanoid robots, indicating a move towards more general, safe, and collaborative embodied AI.

Action Representation: Bridging Symbols and Motion
Action representation is the fundamental scheme that defines the physical output of an embodied AI robot. It directly tackles the challenge of mapping high-dimensional, continuous native robot actuation spaces (e.g., joint angles, end-effector poses) into a form learnable by neural models. The design of this representation is a pivotal choice, balancing precision, diversity, and compatibility with different generation strategies. Two primary paradigms have emerged: discrete and continuous action representation.
Discrete Action Representation
Discrete representation was a foundational innovation, transforming robot control into a sequence modeling problem akin to language generation. The core idea is to quantize or “bin” each dimension of a continuous action (e.g., X, Y, Z, rotation) into a fixed set of intervals. Each interval is assigned a unique token ID, creating a finite “action vocabulary.” For instance, RT-1 uniformly divided each of its 11 action dimensions into 256 bins. This allows a standard Transformer decoder to autoregressively predict a sequence of discrete action tokens, effectively treating control as a next-token prediction task.
The discrete paradigm has enabled the powerful transfer of knowledge from pre-trained VLMs to the robotics domain. The key advantage is the unification of control with highly scalable sequence models. However, this comes with an inherent trade-off: quantization inevitably introduces precision loss, which can be detrimental for tasks requiring sub-millimeter accuracy. This limitation spurred the development of continuous representations.
Table 1 summarizes key models employing discrete action representation, illustrating the evolution in platforms, task complexity, and associated challenges.
| Year | Model | Core Paradigm | Platform | Task Domain | Action Space | Dim. | Bins | Key Challenges |
|---|---|---|---|---|---|---|---|---|
| 2022 | RT-1 | Imitation Learning | Mobile Manipulator | Kitchen Manipulation | EE Pose + Base | 11 | 256 | Imitation ceiling, generalization limits |
| 2022 | Gato | Generalist Supervised Learning | Multiple (Arm, etc.) | Multitask | EE Velocity + Gripper | 5 | 1024 | Context length, slow inference |
| 2023 | RT-2 | VLM Co-fine-tuning | Mobile Manipulator | Semantic Manipulation | EE Pose + Base | 11 | 256 | Physical skill limits, compute cost |
| 2023 | Q-Transformer | Offline Reinforcement Learning | Mobile Manipulator | Multitask Manipulation | EE Pose + Gripper | 8 | 256 | Reward design, high-dim. action |
| 2024 | OpenVLA | VLM Fine-tuning | Multiple Arms | Cross-embodiment Manipulation | EE Pose + Gripper | 7 | 256 | Single-image, inference latency |
| 2025 | Humanoid-VLA | Language-Motion Alignment | Humanoid Robot | Locomotion & Manipulation | Full-Body Pose | 24 | 1024 | Data scarcity, reliance on low-level RL |
Continuous Action Representation
Continuous representation addresses the precision limitation of discretization by directly modeling actions in their native space. A central challenge here is multimodality: for a given task (e.g., “place the cup on the table”), there exist many equally valid but subtly different action trajectories. A simple regression model trained with Mean Squared Error (MSE) loss would average these modes, leading to a blurry, ineffective “mean action” – a phenomenon known as mode collapse.
To overcome this, continuous representations model the full probability distribution over valid actions. Instead of predicting a single value, the model learns to capture the diversity of possible solutions. During inference, a specific, coherent trajectory is sampled from this learned distribution. Prominent techniques for learning such distributions include Conditional Variational Autoencoders (CVAEs), diffusion models, and flow matching.
For example, ACT employs a CVAE to learn a latent space capturing action variability for high-precision bimanual manipulation. The training involves maximizing the Evidence Lower Bound (ELBO):
$$ \mathcal{L} = \mathcal{L}_{\text{reconst}} + \beta \cdot \mathcal{L}_{\text{reg}} $$
where the reconstruction loss $\mathcal{L}_{\text{reconst}} = \text{L1}(\hat{a}_{t:t+k}, a_{t:t+k})$ ensures accuracy, and the regularization loss $\mathcal{L}_{\text{reg}} = D_{KL}(q_{\phi}(z | a_{t:t+k}, o_t) || \mathcal{N}(0, I))$ encourages a structured latent space.
Diffusion models, as used in Octo and Diffusion Policy, treat action generation as an iterative denoising process. Starting from noise $A^K_t \sim \mathcal{N}(0, I)$, the model refines it over K steps:
$$ A^{k-1}_t = \alpha (A^k_t – \gamma \epsilon_{\theta}(O_t, A^k_t, k)) + \sigma \mathcal{N}(0, I) $$
where $\epsilon_{\theta}$ is a noise prediction network trained to minimize $\mathcal{L} = \text{MSE}(\epsilon, \epsilon_{\theta}(O_t, A^0_t + \epsilon, k))$.
Table 2 contrasts models using continuous representations, highlighting the trend towards more powerful generative methods.
| Year | Model | Core Paradigm | Platform | Key Technique | Primary Challenge |
|---|---|---|---|---|---|
| 2023 | ACT | Imitation Learning | Robotic Arm | CVAE | Hardware limits, perception |
| 2024 | Octo | Imitation Learning | Multiple Arms | Conditional Diffusion | Wrist-cam processing, demo data reliance |
| 2024 | π0 | VLM Fine-tuning | Arm, Mobile Robot | Flow Matching | Dependence on massive, proprietary data |
| 2025 | DexVLA | Embodied Curriculum Learning | Arm, Dexterous Hand | Multi-head Diffusion | Limitations in complex contact-rich scenes |
Action Generation Strategies: From Understanding to Execution
The action generation strategy is the decision engine of a VLA model. It defines the algorithmic process that maps fused multimodal inputs onto the chosen action representation. The choice of strategy involves critical trade-offs between precision and efficiency, and between diversity and stability, fundamentally shaping the capabilities of the embodied AI robot.
Autoregressive Generation
Autoregressive generation is a sequential method where tokens are produced one at a time, each conditioned on all previously generated tokens. It is the natural strategy for discrete token sequences and is implemented using Transformer decoder blocks with causal masking. The probability of an action sequence $a_{1:L}$ is decomposed via the chain rule:
$$ p(a_{1:L} | o, \text{inst}) = \prod_{t=1}^{L} p(a_t | a_{<t}, $$="" $\text{inst}$="" $o$="" \text{inst})="" ai="" an="" and="" but="" challenging="" control="" direct="" embodied="" for="" high-frequency="" inference="" inherently="" instruction.="" is="" its="" leverage="" like="" limits="" making="" modeling,="" models="" nature="" o,="" observation="" of="" p="" powerful="" real-time="" robot.
Non-Autoregressive Generation
Non-autoregressive strategies aim to generate entire action sequences in parallel or with fewer sequential steps, targeting the speed bottleneck of autoregressive methods.
1. Probabilistic Generation with CVAEs: As described earlier, CVAEs like in ACT learn to sample actions from a continuous latent distribution, enabling diverse and precise trajectory generation in a single (or few) forward passes, effectively addressing multimodality.
2. Iterative Generation with Diffusion Models: Diffusion strategies have become dominant for high-quality continuous action generation. They excel at producing smooth, diverse trajectories but at a high computational cost due to iterative denoising. The core training objective is noise prediction, as formalized in the previous section.
3. Generation with Flow Matching: An emerging alternative, flow matching (used in π0) models a vector field that deterministically transports samples from a simple noise distribution to the complex data distribution. It is trained with a simpler regression loss and can generate trajectories efficiently with fewer steps. The training minimizes:
$$ \mathcal{L}_{FM} = \mathbb{E}_{t, A_1, A_0} \left[ || v_{\theta}(A_t, O, t) – (A_1 – A_0) ||^2 \right] $$
where $A_t$ is a point on the path between noise $A_0$ and data $A_1$.
Hybrid Generation Strategies
Hybrid strategies represent a meta-approach that combines different generation paradigms within a single system to leverage their complementary strengths. A common pattern is coupling an autoregressive or LLM-based high-level planner with a diffusion-based low-level controller. The planner breaks down long-horizon instructions into sub-goals, while the controller generates the precise, smooth continuous actions to achieve each sub-goal. Models like HybridVLA integrate collaborative diffusion and autoregressive heads within a unified LLM, demonstrating enhanced robustness. While promising for complex tasks, hybrid strategies introduce challenges in temporal alignment and seamless integration between the different modules.
Evaluation and Benchmarking
Systematic evaluation is crucial for advancing VLA models. Key benchmarks include:
LIBERO: A benchmark for lifelong learning in manipulation, testing knowledge transfer across tasks. Performance is measured by average success rate across its suites. Recent results show continuous-action models (e.g., OpenVLA-OFT at 95.4%) often outperforming discrete counterparts, highlighting the precision advantage of advanced generative strategies.
Open X-Embodiment: A large-scale, cross-robotic dataset for evaluating generalist policies. Success rates here test a model’s ability to generalize across unseen robot embodiments and tasks. The top-performing model on this benchmark, π0 (70.1%), uses continuous flow matching, but strong discrete models like RT-2-X (60.7%) also show competitive generalizability, indicating that both representation types can be effective with appropriate scale and architecture.
Challenges and Future Directions
The journey towards robust and general embodied AI robots via VLA models is fraught with open challenges that present rich research opportunities.
1. Integration with World Models: A key frontier is moving from reactive policies to predictive agents. Integrating VLA models with learned world models would enable mental simulation and long-horizon planning, allowing an embodied AI robot to reason about consequences before acting.
2. Efficient Generation for Real-Time Control: The computational cost of high-quality generators (especially diffusion) remains a bottleneck for real-time, high-frequency control. Research into fast sampling (e.g., distillation, few-step flows), parallel decoding, and action token compression (e.g., FAST in π0-Fast) is critical for deployment.
3. Generalizable Representations Across Morphologies: Current models are often tied to specific robot kinematics. A grand challenge is developing action representations abstracted from embodiment specifics (e.g., abstract spatial goals or motor primitives), enabling a single policy to be rapidly adapted to diverse embodied AI robot platforms.
4. Safety and Reliability in Open Worlds: Ensuring safe operation in unpredictable environments is paramount. This requires advancements in robust perception under ambiguity, predictive collision avoidance, and the development of verifiable safety layers within the action generation loop.
5. Beyond Imitation: Exploration and Self-Improvement: Current VLA models are fundamentally limited by their demonstration data. Incorporating principles from reinforcement and offline RL to enable exploration and policy improvement beyond the provided data is a vital direction for achieving super-human performance in embodied AI robots.
Conclusion
The field of Vision-Language-Action modeling represents a transformative approach to embodied intelligence. This survey has provided a detailed examination of its core technical engine: how actions are represented and generated. We traced the evolution from discrete tokenization paired with autoregressive generation towards sophisticated continuous representations powered by diffusion models and flow matching. The emerging trend of hybrid strategies underscores the ongoing quest to balance the critical trade-offs between speed, precision, and diversity. While significant challenges remain in efficiency, generalization, safety, and moving beyond imitation, the rapid progress in this area is unmistakable. By systematically addressing these challenges in action representation and generation, we move closer to the vision of truly versatile, reliable, and intelligent embodied AI robots capable of assisting in our complex physical world.
