Decoding Embodied Actions: A Survey of Representation and Generation in Vision-Language-Action Models

In the pursuit of Artificial General Intelligence (AGI), the field of embodied AI has emerged as a critical pathway. It focuses on developing intelligent agents that can perceive, reason, and interact within the physical world. The advent of powerful Large Language Models (LLMs) and Vision-Language Models (VLMs) has catalyzed the development of a new class of multimodal architectures: Vision-Language-Action (VLA) models. These models aim to seamlessly integrate visual perception, language understanding, and physical action generation to solve instruction-conditioned tasks for embodied AI robots. By bridging internet-scale semantic knowledge with real-world physical interaction, VLA models demonstrate unprecedented generalization and flexibility in unstructured environments, moving beyond the limitations of traditional, fragmented robotic pipelines.

While existing surveys provide broad overviews of robotic foundation models, a systematic and in-depth analysis of the core bridge between multimodal “understanding” and physical “execution”—namely, action representation and generation strategies—is notably absent. This gap is crucial because an embodied AI robot’s ultimate value is determined not just by its perception but by its ability to translate that perception into precise, effective, and reliable physical actions. Therefore, this article provides a comprehensive review focused specifically on the evolution, methodologies, trade-offs, and future directions of action representation and generation within VLA models for embodied intelligence.

The Evolution and Architecture of VLA Models

The development of VLA models represents a paradigm shift from traditional modular robotics. A canonical VLA architecture typically consists of three core components working in concert: a visual encoder, a language encoder, and an action decoder. The visual encoder, often a pre-trained Vision Transformer (ViT), processes raw pixel input from cameras to extract structured features representing objects, spatial relationships, and scene geometry. The language encoder, based on an LLM, interprets the natural language instruction and encodes it into a contextual vector. The action decoder, the focal point of this survey, is responsible for the critical transformation of the fused visual-language understanding into executable motor commands for the embodied AI robot. This end-to-end learning framework overcomes the integration gaps inherent in traditional “sense-plan-act” pipelines, enabling more adaptive and fluent task execution.

The trajectory of VLA models has been rapid. Early works like RT-1 pioneered the use of Transformer architectures and discretized action tokens for large-scale real-world control. A significant milestone was reached with RT-2, which demonstrated that knowledge from web-scale visual-language data could be directly transferred to robotic control through co-fine-tuning, leading to emergent capabilities like visual chain-of-thought reasoning. The field has since moved towards greater openness and efficiency, exemplified by models like Octo, which leveraged large open datasets and diffusion policies, and OpenVLA, which showed that smaller, open-source models could rival larger counterparts through efficient fine-tuning. The latest frontier involves applying these models to increasingly complex systems, such as humanoid robots, indicating a move towards more general, safe, and collaborative embodied AI.

Action Representation: Bridging Symbols and Motion

Action representation is the fundamental scheme that defines the physical output of an embodied AI robot. It directly tackles the challenge of mapping high-dimensional, continuous native robot actuation spaces (e.g., joint angles, end-effector poses) into a form learnable by neural models. The design of this representation is a pivotal choice, balancing precision, diversity, and compatibility with different generation strategies. Two primary paradigms have emerged: discrete and continuous action representation.

Discrete Action Representation

Discrete representation was a foundational innovation, transforming robot control into a sequence modeling problem akin to language generation. The core idea is to quantize or “bin” each dimension of a continuous action (e.g., X, Y, Z, rotation) into a fixed set of intervals. Each interval is assigned a unique token ID, creating a finite “action vocabulary.” For instance, RT-1 uniformly divided each of its 11 action dimensions into 256 bins. This allows a standard Transformer decoder to autoregressively predict a sequence of discrete action tokens, effectively treating control as a next-token prediction task.

The discrete paradigm has enabled the powerful transfer of knowledge from pre-trained VLMs to the robotics domain. The key advantage is the unification of control with highly scalable sequence models. However, this comes with an inherent trade-off: quantization inevitably introduces precision loss, which can be detrimental for tasks requiring sub-millimeter accuracy. This limitation spurred the development of continuous representations.

Table 1 summarizes key models employing discrete action representation, illustrating the evolution in platforms, task complexity, and associated challenges.

Year	Model	Core Paradigm	Platform	Task Domain	Action Space	Dim.	Bins	Key Challenges
2022	RT-1	Imitation Learning	Mobile Manipulator	Kitchen Manipulation	EE Pose + Base	11	256	Imitation ceiling, generalization limits
2022	Gato	Generalist Supervised Learning	Multiple (Arm, etc.)	Multitask	EE Velocity + Gripper	5	1024	Context length, slow inference
2023	RT-2	VLM Co-fine-tuning	Mobile Manipulator	Semantic Manipulation	EE Pose + Base	11	256	Physical skill limits, compute cost
2023	Q-Transformer	Offline Reinforcement Learning	Mobile Manipulator	Multitask Manipulation	EE Pose + Gripper	8	256	Reward design, high-dim. action
2024	OpenVLA	VLM Fine-tuning	Multiple Arms	Cross-embodiment Manipulation	EE Pose + Gripper	7	256	Single-image, inference latency
2025	Humanoid-VLA	Language-Motion Alignment	Humanoid Robot	Locomotion & Manipulation	Full-Body Pose	24	1024	Data scarcity, reliance on low-level RL

Continuous Action Representation

Continuous representation addresses the precision limitation of discretization by directly modeling actions in their native space. A central challenge here is multimodality: for a given task (e.g., “place the cup on the table”), there exist many equally valid but subtly different action trajectories. A simple regression model trained with Mean Squared Error (MSE) loss would average these modes, leading to a blurry, ineffective “mean action” – a phenomenon known as mode collapse.

To overcome this, continuous representations model the full probability distribution over valid actions. Instead of predicting a single value, the model learns to capture the diversity of possible solutions. During inference, a specific, coherent trajectory is sampled from this learned distribution. Prominent techniques for learning such distributions include Conditional Variational Autoencoders (CVAEs), diffusion models, and flow matching.

For example, ACT employs a CVAE to learn a latent space capturing action variability for high-precision bimanual manipulation. The training involves maximizing the Evidence Lower Bound (ELBO):
$$ \mathcal{L} = \mathcal{L}_{\text{reconst}} + \beta \cdot \mathcal{L}_{\text{reg}} $$
where the reconstruction loss $\mathcal{L}_{\text{reconst}} = \text{L1}(\hat{a}_{t:t+k}, a_{t:t+k})$ ensures accuracy, and the regularization loss $\mathcal{L}_{\text{reg}} = D_{KL}(q_{\phi}(z | a_{t:t+k}, o_t) || \mathcal{N}(0, I))$ encourages a structured latent space.

Diffusion models, as used in Octo and Diffusion Policy, treat action generation as an iterative denoising process. Starting from noise $A^K_t \sim \mathcal{N}(0, I)$, the model refines it over K steps:
$$ A^{k-1}_t = \alpha (A^k_t – \gamma \epsilon_{\theta}(O_t, A^k_t, k)) + \sigma \mathcal{N}(0, I) $$
where $\epsilon_{\theta}$ is a noise prediction network trained to minimize $\mathcal{L} = \text{MSE}(\epsilon, \epsilon_{\theta}(O_t, A^0_t + \epsilon, k))$.

Table 2 contrasts models using continuous representations, highlighting the trend towards more powerful generative methods.

Year	Model	Core Paradigm	Platform	Key Technique	Primary Challenge
2023	ACT	Imitation Learning	Robotic Arm	CVAE	Hardware limits, perception
2024	Octo	Imitation Learning	Multiple Arms	Conditional Diffusion	Wrist-cam processing, demo data reliance
2024	π₀	VLM Fine-tuning	Arm, Mobile Robot	Flow Matching	Dependence on massive, proprietary data
2025	DexVLA	Embodied Curriculum Learning	Arm, Dexterous Hand	Multi-head Diffusion	Limitations in complex contact-rich scenes

Action Generation Strategies: From Understanding to Execution

The action generation strategy is the decision engine of a VLA model. It defines the algorithmic process that maps fused multimodal inputs onto the chosen action representation. The choice of strategy involves critical trade-offs between precision and efficiency, and between diversity and stability, fundamentally shaping the capabilities of the embodied AI robot.

Autoregressive Generation

Autoregressive generation is a sequential method where tokens are produced one at a time, each conditioned on all previously generated tokens. It is the natural strategy for discrete token sequences and is implemented using Transformer decoder blocks with causal masking. The probability of an action sequence $a_{1:L}$ is decomposed via the chain rule:
$$ p(a_{1:L} | o, \text{inst}) = \prod_{t=1}^{L} p(a_t | a_{<t}, $$="" $\text{inst}$="" $o$="" \text{inst})="" ai="" an="" and="" but="" challenging="" control="" direct="" embodied="" for="" high-frequency="" inference="" inherently="" instruction.="" is="" its="" leverage="" like="" limits="" making="" modeling,="" models="" nature="" o,="" observation="" of="" p="" powerful="" real-time="" robot.

Non-Autoregressive Generation

Non-autoregressive strategies aim to generate entire action sequences in parallel or with fewer sequential steps, targeting the speed bottleneck of autoregressive methods.

1. Probabilistic Generation with CVAEs: As described earlier, CVAEs like in ACT learn to sample actions from a continuous latent distribution, enabling diverse and precise trajectory generation in a single (or few) forward passes, effectively addressing multimodality.

2. Iterative Generation with Diffusion Models: Diffusion strategies have become dominant for high-quality continuous action generation. They excel at producing smooth, diverse trajectories but at a high computational cost due to iterative denoising. The core training objective is noise prediction, as formalized in the previous section.

3. Generation with Flow Matching: An emerging alternative, flow matching (used in π₀) models a vector field that deterministically transports samples from a simple noise distribution to the complex data distribution. It is trained with a simpler regression loss and can generate trajectories efficiently with fewer steps. The training minimizes:
$$ \mathcal{L}_{FM} = \mathbb{E}_{t, A_1, A_0} \left[ || v_{\theta}(A_t, O, t) – (A_1 – A_0) ||^2 \right] $$
where $A_t$ is a point on the path between noise $A_0$ and data $A_1$.

Hybrid Generation Strategies

Hybrid strategies represent a meta-approach that combines different generation paradigms within a single system to leverage their complementary strengths. A common pattern is coupling an autoregressive or LLM-based high-level planner with a diffusion-based low-level controller. The planner breaks down long-horizon instructions into sub-goals, while the controller generates the precise, smooth continuous actions to achieve each sub-goal. Models like HybridVLA integrate collaborative diffusion and autoregressive heads within a unified LLM, demonstrating enhanced robustness. While promising for complex tasks, hybrid strategies introduce challenges in temporal alignment and seamless integration between the different modules.

Evaluation and Benchmarking

Systematic evaluation is crucial for advancing VLA models. Key benchmarks include:

LIBERO: A benchmark for lifelong learning in manipulation, testing knowledge transfer across tasks. Performance is measured by average success rate across its suites. Recent results show continuous-action models (e.g., OpenVLA-OFT at 95.4%) often outperforming discrete counterparts, highlighting the precision advantage of advanced generative strategies.

Open X-Embodiment: A large-scale, cross-robotic dataset for evaluating generalist policies. Success rates here test a model’s ability to generalize across unseen robot embodiments and tasks. The top-performing model on this benchmark, π₀ (70.1%), uses continuous flow matching, but strong discrete models like RT-2-X (60.7%) also show competitive generalizability, indicating that both representation types can be effective with appropriate scale and architecture.

Challenges and Future Directions

The journey towards robust and general embodied AI robots via VLA models is fraught with open challenges that present rich research opportunities.

1. Integration with World Models: A key frontier is moving from reactive policies to predictive agents. Integrating VLA models with learned world models would enable mental simulation and long-horizon planning, allowing an embodied AI robot to reason about consequences before acting.

2. Efficient Generation for Real-Time Control: The computational cost of high-quality generators (especially diffusion) remains a bottleneck for real-time, high-frequency control. Research into fast sampling (e.g., distillation, few-step flows), parallel decoding, and action token compression (e.g., FAST in π₀-Fast) is critical for deployment.

3. Generalizable Representations Across Morphologies: Current models are often tied to specific robot kinematics. A grand challenge is developing action representations abstracted from embodiment specifics (e.g., abstract spatial goals or motor primitives), enabling a single policy to be rapidly adapted to diverse embodied AI robot platforms.

4. Safety and Reliability in Open Worlds: Ensuring safe operation in unpredictable environments is paramount. This requires advancements in robust perception under ambiguity, predictive collision avoidance, and the development of verifiable safety layers within the action generation loop.

5. Beyond Imitation: Exploration and Self-Improvement: Current VLA models are fundamentally limited by their demonstration data. Incorporating principles from reinforcement and offline RL to enable exploration and policy improvement beyond the provided data is a vital direction for achieving super-human performance in embodied AI robots.

Conclusion

The field of Vision-Language-Action modeling represents a transformative approach to embodied intelligence. This survey has provided a detailed examination of its core technical engine: how actions are represented and generated. We traced the evolution from discrete tokenization paired with autoregressive generation towards sophisticated continuous representations powered by diffusion models and flow matching. The emerging trend of hybrid strategies underscores the ongoing quest to balance the critical trade-offs between speed, precision, and diversity. While significant challenges remain in efficiency, generalization, safety, and moving beyond imitation, the rapid progress in this area is unmistakable. By systematically addressing these challenges in action representation and generation, we move closer to the vision of truly versatile, reliable, and intelligent embodied AI robots capable of assisting in our complex physical world.