Research on Multimodal Simulation Training System for Embodied AI Robots in Oil and Gas Stations

In high-risk industries such as petrochemicals, the training and validation of embodied AI robots rely heavily on multimodal sensory data, including vision, LiDAR, force sensing, and more. However, collecting such data in real-world oil and gas stations poses significant challenges: safety risks are elevated due to hazardous environments like flammable substances and extreme operational conditions; costs are prohibitive for extensive data acquisition; and the coverage of diverse scenarios is often limited, hindering the development of robust algorithms. To address these issues, we have developed a high-fidelity multimodal simulation training system based on Unreal Engine, which digitally replicates complex industrial environments and generates synthetic multimodal datasets. This system enables embodied AI robots to train in virtual settings, performing tasks such as inspection, obstacle avoidance, and emergency operations, thereby reducing reliance on real-world testing and accelerating deployment.

Our work focuses on creating a virtual oil and gas station that mimics real-world facilities with high precision. Through advanced 3D modeling, physics engines, and sensor simulation, we have built an immersive environment where embodied AI robots can interact dynamically. The core innovation lies in the integration of multimodal sensor simulations—such as RGB-D cameras, LiDAR, IMU, and GNSS—alongside realistic physical and weather systems. This allows for comprehensive algorithm training and validation, bridging the gap between simulation and reality. In this article, we detail the construction of the virtual environment, the technical implementations, and the benefits of this approach, emphasizing how it supports the advancement of embodied AI robots in industrial applications.

The virtual oil and gas station is constructed using Unreal Engine 5, leveraging its powerful rendering and simulation capabilities. We start by importing CAD/BIM models of real facilities via the Datasmith tool, ensuring geometric accuracy. To maintain high performance, we employ Nanite virtual geometry for handling high-polygon models and implement Level of Detail (LOD) techniques to simplify distant objects. Instance rendering is used for repetitive elements like pipes and valves, reducing draw calls and memory usage. The scene includes key components such as road networks, pipeline corridors, valve assemblies, instrument panels, and tank farms, each with detailed textures and interactive features. For example, roads feature asphalt materials with normal maps to simulate cracks and oil stains, while pipelines are color-coded based on media type (e.g., crude oil, steam, cooling water) and annotated with diameter labels. Valves are modeled as gate or ball types with interactive blueprints that control rotation animations, mimicking real operations. Instrument panels display dynamic readings for pressure, temperature, and flow, linked to blueprint variables for real-time updates. Tank areas include cylindrical and spherical storage units with safety platforms and railings, enhanced by particle systems to simulate leaks or spills.

To summarize the key elements of the virtual environment, we present the following table, which outlines the components and their simulated properties:

Component	Simulated Properties	Techniques Used
Roads	Surface texture, cracks, oil stains	High-resolution textures, normal maps
Pipelines	Media type, diameter, temperature	Color coding, blueprint animations
Valves	Rotation control, open/close states	Interactive blueprints, physics constraints
Instruments	Pressure, temperature, flow readings	Dynamic UI, variable linking
Tanks	Leak simulation, safety features	Particle systems, collision detection
Buildings	Control rooms, offices, fences	LOD optimization, instance rendering

The technical implementation relies on Unreal Engine’s blueprint system for interactivity. We place triggers at strategic locations to detect events, such as robot proximity, and use blueprint scripts to animate components like valve handles. The User Interface (UI) is built with UMG, creating control panels that display real-time data. This setup not only enhances visual fidelity but also ensures that the virtual environment operates with logic consistent with real industrial processes. For embodied AI robots, this means they can train in a context that closely mirrors actual field conditions, improving their adaptability and performance.

Virtual sensor simulation is a cornerstone of our system, enabling embodied AI robots to perceive the environment through synthetic data. We simulate multiple sensor modalities to support algorithms for navigation, perception, and decision-making. Each sensor type is implemented using Unreal Engine components and ROS (Robot Operating System) bridging for data streaming. Below, we describe the key sensors and their simulation methods.

First, RGB cameras are simulated using the SceneCapture2D component. By attaching this component to the embodied AI robot’s viewpoint, we capture color images at configurable resolutions and field-of-view angles. The output is stored as a Render Target texture and converted to ROS image messages. For multi-camera setups, such as stereo or panoramic views, multiple SceneCapture2D components are combined. This allows embodied AI robots to process visual data for tasks like object recognition and scene understanding.

Depth cameras are simulated similarly, but in SceneDepth mode. This generates depth maps representing the distance from the camera to each pixel. We convert these maps to point clouds using the formula:

$$z = \frac{f \cdot b}{d}$$

where $z$ is the depth, $f$ is the focal length, $b$ is the baseline (for stereo setups), and $d$ is the disparity. The point cloud data is published via ROS for obstacle detection and 3D reconstruction.

Infrared cameras require thermal simulation. We assign temperature properties to objects using custom materials and stencil buffers in Unreal Engine. The SceneCapture component then renders pseudo-color heat maps, with warmer regions highlighted via emissive materials. Noise and blur effects are added to mimic the low resolution of real IR sensors. This data helps embodied AI robots in thermal anomaly detection, such as overheated equipment.

LiDAR simulation involves ray-casting techniques. We emit multiple rays from the embodied AI robot’s position, scanning horizontally and vertically to simulate multi-line LiDAR (e.g., 16 or 32 lines). The collision points are recorded to generate point clouds. The number of rays $N$ and the angular resolution $\theta$ determine the point density:

$$N = \frac{360^\circ}{\theta_h} \times \frac{V_fov}{\theta_v}$$

where $\theta_h$ and $\theta_v$ are horizontal and vertical resolutions, and $V_fov$ is the vertical field of view. The point cloud is published as ROS PointCloud2 messages for SLAM and path planning.

IMU and GNSS sensors are simulated by reading the robot actor’s motion state. For IMU, we extract linear acceleration $a$ and angular velocity $\omega$ from the physics engine, adding noise models to simulate real-world errors:

$$a_{measured} = a_{true} + \epsilon_a, \quad \epsilon_a \sim \mathcal{N}(0, \sigma_a^2)$$
$$\omega_{measured} = \omega_{true} + \epsilon_\omega, \quad \epsilon_\omega \sim \mathcal{N}(0, \sigma_\omega^2)$$

where $\epsilon$ represents Gaussian noise with variance $\sigma^2$. GNSS data is generated using Unreal Engine’s georeferencing plugins, converting virtual coordinates to latitude and longitude with simulated drift errors. These sensors are crucial for localization of embodied AI robots in GPS-denied environments.

The following table summarizes the virtual sensors and their key parameters, which are essential for training embodied AI robots:

Sensor Type	Simulated Data	Key Parameters	ROS Message Type
RGB Camera	Color images	Resolution, FOV, frame rate	sensor_msgs/Image
Depth Camera	Depth maps, point clouds	Depth range, noise model	sensor_msgs/PointCloud2
Infrared Camera	Thermal images	Temperature range, emissivity	sensor_msgs/Image
LiDAR	3D point clouds	Scan lines, angular resolution	sensor_msgs/PointCloud2
IMU	Acceleration, angular velocity	Noise variance, drift rate	sensor_msgs/Imu
GNSS	Latitude, longitude, altitude	Error margin, update frequency	sensor_msgs/NavSatFix

Physical and climate simulations add another layer of realism to the virtual environment, allowing embodied AI robots to train under diverse conditions. We use Unreal Engine’s Chaos physics engine and Niagara particle systems to simulate dynamics and weather effects.

For diurnal cycles, we employ the Sky Atmosphere component and a Directional Light to mimic the sun. The light’s altitude angle changes over time, affecting sky color and illumination intensity. At night, artificial lights like floodlights are simulated with point and rectangle lights, coupled with Lumen global illumination for realistic shadows. The Sky Light captures ambient light, ensuring smooth transitions. This variability helps embodied AI robots adapt to lighting changes during tasks like night patrols.

Weather systems are simulated with Niagara particles. Rainfall involves particle emitters that generate raindrops with collision detection for splash effects, while post-processing volumes enhance wet surface reflections. Snowfall uses slow-falling snowflakes and material switching to accumulate snow on surfaces, with adjusted ambient light for overcast conditions. Fog and dust are created using exponential height fog and dynamic particle systems, respectively. These conditions test the robustness of embodied AI robots’ sensors; for example, LiDAR performance in fog can be evaluated using the attenuation model:

$$I = I_0 e^{-\beta d}$$

where $I$ is the received intensity, $I_0$ is the emitted intensity, $\beta$ is the attenuation coefficient, and $d$ is the distance.

Physics simulation involves rigid body dynamics and constraints. We enable Simulate Physics on mesh objects to simulate gravity, collisions, and rebounds—useful for scenarios like tool drops. Joints are modeled with Physics Constraint components, allowing rotational or linear movements for mechanical arms or valves. For instance, a valve’s rotation can be described by the torque equation:

$$\tau = I \alpha$$

where $\tau$ is the applied torque, $I$ is the moment of inertia, and $\alpha$ is the angular acceleration. Fluid and cloth simulations are approximated using rigid body joints and Chaos Cloth, respectively, providing realistic interactions for embodied AI robots during operations like hose handling.

The integration of these simulations creates a comprehensive training platform for embodied AI robots. By exposing them to varied physical and weather conditions, we enhance their generalization capabilities. The system generates diverse multimodal datasets that compensate for the scarcity of real-world data. For example, we can simulate rare events like sandstorms or equipment failures, allowing embodied AI robots to learn robust responses without real risks.

Our system offers significant advantages for the development and deployment of embodied AI robots in oil and gas stations. Firstly, it reduces safety risks by enabling virtual testing; embodied AI robots can practice hazardous tasks like leak inspection or emergency shutdowns without endangering personnel or infrastructure. Secondly, it lowers costs associated with physical prototypes and field trials. Thirdly, it accelerates algorithm iteration by providing abundant synthetic data for training machine learning models. We have conducted experiments where embodied AI robots trained in our virtual environment showed improved performance in real-world tests, particularly in navigation accuracy and object recognition.

To quantify the benefits, we can analyze the data generation efficiency. Suppose a real-world data collection campaign yields $D_{real}$ datasets per month with high costs $C_{real}$, while our simulation generates $D_{sim}$ datasets at cost $C_{sim}$. The cost-effectiveness ratio $R$ is:

$$R = \frac{D_{sim} / C_{sim}}{D_{real} / C_{real}}$$

In our case, $R \gg 1$, indicating superior efficiency. Additionally, the diversity of simulated scenarios enhances algorithm robustness, as measured by the generalization error $E_g$ on unseen real data:

$$E_g = \frac{1}{N} \sum_{i=1}^{N} L(f(x_i), y_i)$$

where $L$ is the loss function, $f$ is the algorithm trained on simulation data, and $(x_i, y_i)$ are real-world samples. Our results show that $E_g$ decreases as simulation fidelity increases, validating the approach.

Looking ahead, we plan to enhance the system in several directions. One focus is improving physical realism, especially in fluid dynamics and thermal simulations, to better support tasks like spill response. We aim to integrate computational fluid dynamics (CFD) models, described by Navier-Stokes equations:

$$\frac{\partial \mathbf{u}}{\partial t} + (\mathbf{u} \cdot \nabla) \mathbf{u} = -\frac{1}{\rho} \nabla p + \nu \nabla^2 \mathbf{u} + \mathbf{g}$$

where $\mathbf{u}$ is velocity, $p$ is pressure, $\rho$ is density, $\nu$ is kinematic viscosity, and $\mathbf{g}$ is gravity. Another direction is leveraging digital twin technology, where virtual and physical systems are synchronized in real-time, enabling predictive maintenance and remote operation for embodied AI robots. We also explore AI-driven scene generation, using generative adversarial networks (GANs) to create novel scenarios for training, further expanding the dataset diversity.

In conclusion, our multimodal simulation training system represents a pivotal advancement for embodied AI robots in high-risk industrial settings. By combining high-fidelity virtual environments with comprehensive sensor and physics simulations, we provide a safe, cost-effective, and scalable platform for training and validation. This approach not only addresses the challenges of real-world data collection but also fosters innovation in robot autonomy. As we continue to refine the system, we believe it will become a standard tool in the development pipeline for embodied AI robots, driving their adoption in oil and gas and beyond, ultimately enhancing safety and efficiency in critical operations.