A Survey on Action Representation and Generation Strategies in Vision-Language-Action Models for Embodied Intelligence

In recent years, the field of embodied intelligence has gained significant traction as a promising pathway toward artificial general intelligence, focusing on agents that perceive, interact, and accomplish tasks in the physical world. As a researcher deeply immersed in this domain, I have observed the emergence of Vision-Language-Action (VLA) models as a revolutionary approach that integrates multimodal perception with physical execution. These models leverage the success of large language models (LLMs) and vision-language models (VLMs) to enable robots to understand natural language instructions and generate corresponding actions. In this survey, we delve into the core aspects of action representation and generation strategies in VLA models, which serve as the critical bridge between abstract understanding and concrete execution in embodied robots. We systematically analyze the evolution, key methodologies, and future directions, emphasizing how these elements address the challenges of high-dimensional continuous spaces, action diversity, and real-time control demands. By incorporating tables and mathematical formulations, we aim to provide a comprehensive overview that highlights the trade-offs in precision, diversity, and efficiency, ultimately contributing to the development of more general and efficient embodied agents.

The development of VLA models marks a departure from traditional fragmented pipelines in robotics, where vision, language, and control modules operated independently. Early models like RT-1 pioneered the use of transformer architectures to discretize robot actions into tokens, enabling sequence-based prediction and laying the foundation for scalable embodied robot control. This was further advanced by models such as RT-2, which demonstrated the transfer of internet-scale knowledge to physical tasks through co-finetuning with VLMs. More recently, open-source initiatives like OpenVLA and diffusion-based approaches like Octo have expanded the capabilities of VLA models, making them more accessible and efficient. As we explore this landscape, it becomes evident that action representation and generation strategies are pivotal in determining the performance of embodied robots in complex, unstructured environments. For instance, discrete action representations simplify control by quantizing continuous spaces, but they often sacrifice precision, whereas continuous representations using probabilistic models capture action multimodality at the cost of computational complexity. Similarly, generation strategies range from autoregressive methods, which excel in sequential decision-making, to non-autoregressive approaches like diffusion models, which produce high-quality trajectories but require iterative processes. In this survey, we critically examine these aspects, drawing on empirical evaluations from benchmarks like LIBERO and Open X-Embodiment to underscore the progress and limitations in the field.

To structure our discussion, we begin by outlining the development and current state of VLA models, highlighting key milestones and architectural components. We then delve into action representation, categorizing it into discrete and continuous paradigms, and analyze their implications for embodied robot tasks. Following this, we explore various generation strategies, including autoregressive, non-autoregressive (e.g., CVAE-based, diffusion-based, and flow matching), and hybrid approaches, each with distinct advantages and challenges. We also present a comparative analysis using tables and mathematical formulations to illustrate performance metrics and trade-offs. Furthermore, we address the challenges and opportunities in this rapidly evolving field, such as integration with world models, efficient generation for real-time control, and safety concerns in open-world scenarios. Throughout this survey, we emphasize the role of embodied robots as testbeds for advancing AI, and we conclude with reflections on future research directions that could unlock the full potential of VLA models in creating versatile and reliable intelligent agents.

Development and Current State of VLA Models

The evolution of VLA models has been driven by the need to unify perception, language understanding, and action generation in embodied robots. Initially, models like CLIPort combined pre-trained vision-language representations with robotic manipulation, enabling semantic-driven control. However, it was RT-1 that revolutionized the field by discretizing robot actions into tokens and using a transformer architecture for large-scale real-world control. This approach allowed embodied robots to handle hundreds of tasks, such as kitchen operations, by treating action generation as a sequence prediction problem. The success of RT-1 inspired subsequent models like Gato, which proposed a generalist agent framework by tokenizing all input modalities—images, text, and actions—into a flat sequence processed by a massive transformer. This unified representation demonstrated remarkable task generality, paving the way for more integrated systems.

In 2023, RT-2 emerged as a milestone by leveraging VLMs pre-trained on internet-scale data and fine-tuning them for robotic tasks through symbol tuning. This enabled the transfer of abstract knowledge to physical control, endowing embodied robots with capabilities like visual chain-of-thought reasoning. The trend continued with models like Q-Transformer, which integrated discrete action representations with offline reinforcement learning to learn from mixed-quality data, surpassing the limitations of pure imitation learning. By 2024, the focus shifted toward openness and efficiency, with Octo utilizing diffusion policies trained on diverse datasets like Open X-Embodiment to achieve generalist robot policies. OpenVLA further demonstrated that smaller models could match or exceed the performance of larger counterparts through efficient parameter tuning. Recently, models like Groot N1 have adopted dual-system architectures, combining LLMs for high-level planning with fast diffusion policies for low-level control, highlighting the move toward more通用 and collaborative embodied robots.

The typical architecture of a VLA model consists of three core components: a visual encoder, a language encoder, and an action decoder. The visual encoder, often based on pre-trained models like ViT, processes pixel inputs to extract structured features representing objects, positions, and geometries. The language encoder, built on LLMs, interprets natural language instructions into vector representations. The action decoder, frequently implemented as a transformer-based autoregressive or diffusion-based module, generates robot control commands by fusing visual and language inputs. This end-to-end framework overcomes the inefficiencies of traditional modular systems, enabling embodied robots to adapt flexibly to dynamic environments. However, the choice of action representation and generation strategy profoundly impacts the model’s ability to handle real-world complexities, as we will explore in the following sections.

Action Representation in VLA Models

Action representation is a cornerstone of VLA models, defining how abstract perceptions and instructions are translated into physical actions for embodied robots. It addresses the challenges of high-dimensional continuous spaces and the multimodality of valid solutions for a given task. Over time, two primary paradigms have emerged: discrete action representation and continuous action representation. Each offers distinct trade-offs in terms of precision, diversity, and computational efficiency, influencing the overall performance of embodied robots in tasks ranging from precise manipulation to long-horizon planning.

Discrete Action Representation

Discrete action representation involves quantizing continuous robot actions into a finite set of tokens, enabling the use of powerful sequence models like transformers. This approach transforms robot control into a language-like generation task, where actions are predicted step-by-step as discrete classifications. For example, in RT-1, each action dimension—such as end-effector position or gripper status—is divided into 256 bins, and a continuous value is mapped to a unique integer ID. This discretization allows embodied robots to leverage pre-trained VLMs for knowledge transfer but introduces quantization errors that can limit precision in fine-grained tasks.

We summarize key models employing discrete action representation in Table 1, highlighting their core paradigms, platforms, task domains, action space details, and inherent challenges. For instance, Gato uses a unified tokenization scheme for various inputs, including robot actions, with 5 dimensions and 1024 discrete intervals, but it faces limitations in context length and inference speed. In contrast, RT-2 achieves semantic-driven manipulation by co-finetuning VLMs on robot data, using 11 action dimensions with 256 bins, though it struggles with physical skill limitations and high computational costs. More recent models like Humanoid-VLA extend this to humanoid robots with 24 dimensions for full-body motion, but they rely on underlying reinforcement learning policies and face data scarcity issues.

Table 1: Discrete Action Representation in Typical VLA Models
Year Model Core Paradigm Platform Task Domain Action Space Type Action Dimensions Discrete Intervals Key Challenges
2022 RT-1 Imitation Learning Mobile Manipulator Mobile Manipulation End-effector Pose + Base 11 256 Imitation learning ceiling, limited generalization
2022 Gato General Supervised Learning Sawyer Arm Robot Operation, Games, Dialogue End-effector Velocity + Gripper 5 1024 Context length limits, slow inference
2023 RT-2 VLM Co-finetuning Mobile Manipulator Semantic-driven Manipulation End-effector Pose + Base 11 256 Physical skill limits, high compute cost
2023 Q-Transformer Offline Reinforcement Learning Mobile Manipulator Multi-task Operation End-effector Pose + Gripper 8 256 Reward function limitations, high-dimensional action issues
2024 OpenVLA VLM Fine-tuning Various Manipulators Cross-embodiment Manipulation End-effector Pose + Gripper 7 256 Single-image support only, low inference efficiency
2025 Humanoid-VLA Language-Motion Alignment Humanoid Robot Mobile Manipulation Full-body Pose 24 1024 Limited data quality and quantity, reliance on RL policies
2025 JARVIS-VLA ActVLP Virtual Agent Game Operation Keyboard and Mouse 51 Slow inference, gap with top human players

The primary advantage of discrete representation is its compatibility with large sequence models, facilitating knowledge transfer from web-scale data to embodied robot control. However, the quantization process inherently sacrifices precision, making it unsuitable for tasks requiring sub-millimeter accuracy, such as assembly. This limitation has spurred the development of continuous action representations, which we discuss next.

Continuous Action Representation

Continuous action representation models robot actions as probability distributions in continuous spaces, addressing the multimodality of valid trajectories and avoiding mode collapse in regression tasks. Instead of predicting a single action, these methods learn a distribution that encompasses all possible effective actions, allowing embodied robots to sample diverse and precise trajectories. Early approaches like ACT used conditional variational autoencoders (CVAEs) to model action sequences, while later models adopted diffusion-based or flow-matching strategies for higher-quality generation.

In CVAE-based methods, an encoder maps expert demonstrations and observations to a latent variable $z$ following a Gaussian distribution $z \sim \mathcal{N}(\mu, \sigma^2)$, and a decoder generates action sequences $\hat{a}_{t:t+k}$ conditioned on observations $o_t$ and sampled $z$. The training objective maximizes the evidence lower bound (ELBO), combining a reconstruction loss and a regularization loss. The reconstruction loss, often an L1 norm, ensures accurate action imitation:

$$ \mathcal{L}_{\text{reconst}} = \text{L1}(\hat{a}_{t:t+k}, a_{t:t+k}) $$

The regularization loss uses KL divergence to align the latent distribution with a prior:

$$ \mathcal{L}_{\text{reg}} = D_{\text{KL}}(q_\phi(z | a_{t:t+k}, o_t) \| \mathcal{N}(0, I)) $$

The total loss is a weighted sum: $\mathcal{L} = \mathcal{L}_{\text{reconst}} + \beta \mathcal{L}_{\text{reg}}$, where $\beta$ balances the terms. This approach enables embodied robots to perform fine-grained manipulation but requires complex training and may suffer from hardware limitations.

Diffusion-based strategies, as seen in Diffusion Policy, reformulate action generation as an iterative denoising process. Starting from noise $A_K$, the model refines it over $K$ steps using a noise prediction network $\epsilon_\theta$:

$$ A_{k-1} = \alpha (A_k – \gamma \epsilon_\theta(O_t, A_k, k)) + \sigma \mathcal{N}(0, I) $$

where $O_t$ is the observation, and $\alpha, \gamma, \sigma$ are scheduler parameters. The training loss is a mean squared error (MSE) between predicted and actual noise:

$$ \mathcal{L} = \text{MSE}(\epsilon, \epsilon_\theta(O_t, A_0 + \epsilon, k)) $$

Models like Octo and MDT have advanced this by using transformer-based diffusion decoders, improving scalability and generalization for embodied robots. Flow matching, employed in $0\pi$, offers an alternative by learning a vector field $v_\theta(A_t, O_t, t)$ that regresses directly to the target, with a loss function:

$$ \mathcal{L} = \mathbb{E}_{t, A_1 \sim p(A_1), A_t \sim p(A_t)} [\| v_\theta(A_t, O_t, t) – (A_1 – A_0) \|^2] $$

This method enhances training stability and efficiency, as demonstrated in GraspVLA for grasping tasks.

Table 2 summarizes representative models using continuous action representation, showcasing their evolution toward more powerful generative methods. For example, ACT focuses on bimanual manipulation with 14 dimensions but faces perception challenges, while $0\pi$ employs flow matching for 18-dimensional tasks but relies on large-scale, proprietary data. Hybrid approaches like HybridVLA combine diffusion and autoregressive strategies for robust control, though they incur inference latency. These advancements highlight the trade-offs between action quality and computational demands in embodied robots.

Table 2: Continuous Action Representation in Typical VLA Models
Year Model Core Paradigm Platform Task Domain Action Dimensions Representation Type Key Challenges
2023 ACT Imitation Learning Manipulator Fine-grained Bimanual Manipulation 14 Conditional VAE Hardware limits, perception issues
2024 Octo Imitation Learning Manipulator Cross-embodiment General Manipulation 7/14 Conditional Diffusion Poor wrist camera handling, data dependence
2024 $0\pi$ VLM Fine-tuning Manipulator, Mobile Robot High-dexterity Long-horizon Tasks 18 Conditional Flow Matching Reliance on large-scale data, partial closed-source
2025 HybridVLA Collaborative Training Manipulator General Tabletop Manipulation 7/14 Hybrid Generation Inference speed limitations
2025 DexVLA Embodied Curriculum Learning Manipulator, Dexterous Hand Cross-embodiment Dexterous Manipulation Multi-head Diffusion Limitations in contact-rich scenes

In summary, continuous action representation excels in capturing the richness of robot behaviors but demands significant computational resources. As embodied robots tackle more complex tasks, the choice between discrete and continuous representations will depend on the specific requirements of precision, diversity, and real-time performance.

Action Generation Strategies in VLA Models

Action generation strategies are the decision-making engines of VLA models, responsible for mapping multimodal inputs to action sequences that enable embodied robots to perform tasks effectively. These strategies determine the quality, efficiency, and adaptability of robot behaviors, and they involve fundamental trade-offs between precision, speed, and diversity. We categorize them into autoregressive, non-autoregressive, and hybrid strategies, each with unique mechanisms and implications for embodied robot control.

Autoregressive Generation Strategies

Autoregressive generation produces action sequences step-by-step, where each action token depends on previous outputs. This strategy leverages the causal nature of transformer decoders, making it ideal for tasks requiring sequential reasoning in embodied robots. Formally, given a language instruction $p$, historical states $s_{\leq t}$, and past actions $a_{<t}$, $a_t$="" action="" as:

$$ a_t \sim p(a_t | s_{\leq t}, a_{<t}, $$

The joint probability of an action sequence $s_{1:L}$ is decomposed using the chain rule:

$$ \log p(s_{1:L} | \theta) = \sum_{l=1}^{L} \log p(s_l | s_{1:l-1}, \theta) $$

For training, a mask function $m(b, l)$ indicates whether token $l$ in batch $b$ is from text or actions, and the loss is computed as:

$$ \mathcal{L}(\theta) = -\frac{1}{|\mathcal{B}|} \sum_{b=1}^{|\mathcal{B}|} \sum_{l=1}^{L} m(b, l) \log p(s_l | s_{1:l-1}, \theta) $$

Models like VIMA and ChatVLA employ autoregressive decoders to generate actions based on multimodal prompts, enabling embodied robots to handle complex instructions. However, the sequential nature limits inference speed, often to 3-5 Hz, which is insufficient for real-time control in dynamic environments. Despite this, autoregressive strategies remain popular due to their stability and compatibility with discrete action representations.

Non-Autoregressive Generation Strategies

Non-autoregressive strategies generate entire action sequences in parallel or through iterative processes, addressing the speed limitations of autoregressive methods. They are particularly suited for continuous action representations and can be implemented using CVAEs, diffusion models, or flow matching.

CVAE-Based Probabilistic Generation

CVAE-based strategies model action distributions to handle multimodality. As described earlier, they use an encoder to map demonstrations to a latent space and a decoder to generate actions. This approach allows embodied robots to sample diverse trajectories but involves complex training and may not scale well to high-dimensional spaces.

Diffusion-Based Iterative Generation

Diffusion models generate actions by iteratively denoising from noise, producing high-quality, smooth trajectories. The process involves predicting noise $\epsilon_\theta$ at each step $k$ and updating the action sequence:

$$ A_{k-1} = \alpha (A_k – \gamma \epsilon_\theta(O_t, A_k, k)) + \sigma \mathcal{N}(0, I) $$

Training minimizes the MSE between predicted and actual noise. Models like MDT and RDT-1B use transformer-based diffusion decoders (DiT) for improved performance, enabling embodied robots to perform complex bimanual tasks with zero-shot generalization. However, the iterative process requires substantial computation, leading to slow inference—a critical issue for real-time embodied robot applications.

Flow Matching Generation

Flow matching strategies learn a vector field that transforms noise into actions through ordinary differential equation (ODE) integration. The loss function regresses the predicted vector field to the target:

$$ \mathcal{L} = \mathbb{E}_{t, A_1, A_t} [\| v_\theta(A_t, O_t, t) – (A_1 – A_0) \|^2] $$

Inference involves numerical integration, such as Euler’s method:

$$ A_{t+\delta} = A_t + \delta \cdot v_\theta(A_t, O_t, t) $$

Models like $0\pi$ and GraspVLA use flow matching for efficient action generation, often achieving comparable quality to diffusion models with fewer steps. This makes them promising for embodied robots requiring fast and stable control.

Hybrid Generation Strategies

Hybrid strategies combine multiple generation approaches to leverage their strengths. For example, HybridVLA integrates autoregressive planning with diffusion-based control within a unified LLM, allowing embodied robots to handle long-horizon tasks with robust low-level actions. The autoregressive component generates discrete sub-task sequences, while the diffusion component produces continuous trajectories, adapting dynamically to environmental changes. This synergy enhances robustness and generalization but introduces challenges in aligning symbolic and continuous representations and managing computational overhead.

In practice, hybrid strategies often employ asynchronous execution frameworks, where high-level planners generate plans offline while low-level controllers run at high frequencies. Future research may focus on shared representation spaces to seamless integrate these components, further advancing the capabilities of embodied robots in complex scenarios.

Model Evaluation

To assess the performance of VLA models, researchers rely on benchmarks like LIBERO and Open X-Embodiment, which provide standardized tasks and metrics for embodied robots. These evaluations highlight the impact of action representation and generation strategies on success rates, generalization, and efficiency.

LIBERO focuses on lifelong learning, with tasks designed to test knowledge transfer in manipulation. Key metrics include forward transfer, backward transfer, and area under the success curve, though success rate is commonly used for comparison. Table 3 presents results for typical VLA models, showing that continuous action models with flow matching or diffusion strategies outperform discrete ones, with OpenVLA-OFT achieving the highest success rate. This underscores the advantage of advanced continuous representations for precise control in embodied robots.

Table 3: Evaluation of Typical VLA Models on LIBERO Dataset
Action Type VLA Model Average Success Rate (%)
Continuous Diffusion Policy 72.4
Continuous Octo 75.1
Continuous DiT Policy 82.4
Continuous OpenVLA-OFT 95.4
Continuous $0\pi$ 94.2
Discrete OpenVLA 76.5
Discrete WorldVLA 79.1

Open X-Embodiment emphasizes cross-robot generalization, with success rate as the primary metric. As shown in Table 4, continuous models like $0\pi$ achieve higher success rates than discrete models, indicating their superior ability to handle diverse embodied robot platforms. However, performance varies widely based on architecture and training data, highlighting that action space type alone does not determine efficacy.

Table 4: Evaluation of Typical VLA Models on Open X-Embodiment Dataset
Action Type VLA Model Average Success Rate (%)
Continuous Octo-Base 16.8
Continuous $0\pi$ 70.1
Discrete RT-1 6.8
Discrete TraceVLA 42.0
Discrete RT-1-X 53.4
Discrete RT-2-X 60.7
Discrete OpenVLA 27.7

These evaluations reveal that while continuous action representations and advanced generation strategies enhance performance, they also demand more resources. As embodied robots evolve, balancing these factors will be crucial for real-world deployment.

Challenges and Opportunities

The advancement of VLA models for embodied robots faces several challenges, but also presents exciting opportunities for future research. In this section, we reflect on key issues and potential directions that could shape the next generation of intelligent agents.

One major challenge is the integration of world models with VLA systems. World models learn environment dynamics, enabling predictive planning and long-horizon reasoning. By incorporating them, embodied robots could simulate future outcomes before acting, enhancing decision-making in complex tasks like tool use or risk avoidance. However, this requires scalable architectures that combine perceptual inputs with physical simulations, which remains an open problem.

Another challenge lies in breaking away from traditional modular robotics paradigms. While VLA models offer end-to-end learning, they must overcome the inherent limitations of imitation learning, such as data dependence and inability to surpass demonstration quality. Opportunities exist in combining VLA with reinforcement learning for self-improvement, as seen in Q-Transformer, but this introduces stability and exploration issues in embodied robots.

Efficient generation for real-time control is a critical concern. Current strategies like diffusion models incur high latency, hindering deployment in dynamic environments. Techniques like parallel decoding in Groot N1 and action token compression in $0\pi$-Fast reduce inference times, enabling higher control frequencies. Future work could focus on hardware-software co-design and knowledge distillation to further optimize efficiency for embodied robots.

Cross-embodiment generalization aims to develop universal action representations that transcend specific robot morphologies. This would allow skills learned on one platform, such as a manipulator, to transfer seamlessly to humanoids or quadrupeds with minimal calibration. Abstract representations, like task-space coordinates, could facilitate this, but they require robust alignment between symbolic plans and continuous actions. As embodied robots diversify, this direction promises greater flexibility and scalability.

Safety and reliability in open-world settings are paramount. VLA models often lack robust failure modes and collision avoidance, with perception errors rising by 20-30% in poor lighting or cluttered scenes. Solutions include incorporating safety constraints through constrained learning, as in SafeVLA, and improving real-time monitoring. However, emergency stops still have latencies of 200-500 ms, posing risks in human-robot collaboration. Addressing these issues is essential for deploying embodied robots in safety-critical domains like healthcare and autonomous driving.

Lastly, computational and energy costs limit the deployment of large VLA models on resource-constrained platforms. Models with billions of parameters require significant memory and power, exceeding the capabilities of edge devices like NVIDIA Jetson. Research into model compression, efficient attention mechanisms, and low-power hardware could make these models more accessible for embodied robots in everyday applications.

Conclusion

In this survey, we have explored the intricate landscape of action representation and generation strategies in VLA models for embodied intelligence. From discrete tokenization to continuous probabilistic modeling, and from autoregressive sequencing to diffusion-based iteration, each approach offers unique benefits and drawbacks for embodied robots. The evolution toward hybrid strategies and efficient generation methods underscores the field’s commitment to balancing precision, diversity, and real-time performance. As we look ahead, integration with world models, cross-embodiment generalization, and enhanced safety measures will be pivotal in creating versatile and reliable embodied agents. By addressing these challenges, we can unlock the full potential of VLA models to transform how robots perceive, reason, and act in the physical world, ultimately advancing the frontier of artificial intelligence.

Scroll to Top