In recent years, the application of inspection robots in coal mines has become increasingly widespread, significantly enhancing efficiency and reducing risks associated with manual inspections. Traditional configurations, such as wheeled, tracked, and crawler robots, perform well in flat terrains but face limitations in complex underground environments like steps, rails, and rugged gravel surfaces. In contrast, quadruped robots, often referred to as robot dogs, exhibit superior terrain adaptability due to their bio-inspired locomotion, making them ideal for inspection tasks in coal mines. However, the challenging conditions of underground mines, including irregular terrain, slippery surfaces, and limited visibility, pose severe challenges to motion control and state estimation for these quadruped robots. To address these issues, I propose an end-to-end neural network controller that integrates state estimation, motor adaptation, and control modules, enabling precise estimation of robot velocity, pose, motor drive parameters, and real-time dynamics. This approach leverages deep reinforcement learning trained in a physics-based simulation environment, incorporating domain randomization to enhance generalization. By employing asymmetric actor-critic algorithms, privileged observations, and recurrent neural network structures, the controller demonstrates robust performance in complex environments, significantly improving the motion control and state estimation accuracy of quadruped robots.
The control of a quadruped robot can be modeled as a partially observable Markov decision process (POMDP), where the robot interacts with its environment over discrete time steps. At each control cycle time \( t \), the system state is represented by a vector \( s_t \), and the agent receives an observation vector \( o_t \) based on the probability distribution \( O(o_t | s_t) \). The agent then computes an action \( a_t \) according to its policy \( \pi(a_t | o_t) \), influencing the environment to transition to the next state \( s_{t+1} \) with probability \( P(s_{t+1} | s_t, a_t) \), and receives a reward \( r_{t+1} \). The state vector must satisfy the Markov property, meaning \( P(s_{t+1} | s_t, a_t) = P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, \ldots) \). The goal of reinforcement learning is to find the optimal policy \( \pi^* \) that maximizes the expected discounted reward:
$$ \pi^* = \arg\max_{\pi} \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right] $$
where \( \gamma \) is the discount factor, a positive value less than 1, weighting immediate versus future rewards. In the context of quadruped robot control, the controller acts as the agent, issuing control signals as actions based on sensor data, while the robot and physical environment constitute the POMDP environment, transitioning states according to physical laws.

For this work, I utilized the Unitree Aliengo quadruped robot, which features 12 joint degrees of freedom (3 per leg × 4 legs), an onboard IMU sensor, and joint motor encoders. The robot supports position control of joint motors and provides a software library for reading sensor data and sending motor commands. The neural network controller computes target joint positions and sends them to the motors for execution. Traditional non-end-to-end controllers rely on separate state estimation algorithms, such as Kalman filters, to estimate body velocity and pose. However, these methods struggle with slip, terrain deformation, and aggressive motions outside the controlled dynamics model, limiting performance in complex terrains and agile motion tasks. In contrast, my end-to-end approach directly maps sensor data to motor target positions without explicit modeling of robot dynamics or contact relationships, such as the traditional equation:
$$ M\ddot{q} + C\dot{q} + G = \tau $$
where \( M \) is the mass matrix, \( C \) represents Coriolis and centrifugal forces, \( G \) is the gravitational vector, and \( \tau \) denotes joint torques. Instead, reinforcement learning allows the policy network to adaptively adjust foot contact forces through randomized training of friction coefficients and terrains, eliminating the need for explicit force-electrical coupling models.
The neural network controller models the policy \( \pi(a_t | o_t) \) by taking an observation vector \( o_t \) composed of robot sensor data as input and outputting a 12-dimensional action vector \( a_t \), defined as the target positions for the 12 joints. During training, to explore new actions, the output vector is used to construct a multivariate Gaussian distribution \( \mathcal{N}(a_t | \mu_t, \Sigma_t) \), with \( \mu_t \) as the mean vector, and the action \( a_t \) is sampled randomly from this distribution. During testing and real-world deployment, the output vector is directly used as the action. The reward function is computed at each control cycle as a weighted sum of multiple components, defined as follows:
| Reward Component | Mathematical Expression | Description |
|---|---|---|
| Velocity Tracking Reward | \( e^{-4 \times (v_{\text{goal}_{xy}} – v_{\text{current}_{xy}})^2} \) | Encourages tracking of desired linear velocity in the XY plane. |
| Turning Velocity Reward | \( e^{-4 \times (\omega_{\text{goal}_z} – \omega_{\text{current}_z})^2} \) | Promotes tracking of desired angular velocity around the Z-axis. |
| Vertical Vibration Penalty | \( – (v_{\text{current}_z})^2 \) | Discourages vertical oscillations of the robot body. |
| XY Rotation Penalty | \( – (\omega_{\text{current}_x})^2 – (\omega_{\text{current}_y})^2 \) | Penalizes rotations around the X and Y axes to maintain stability. |
| Joint Torque Penalty | \( – \tau^2 \) | Reduces excessive joint torques to prevent damage. |
| Joint Acceleration Penalty | \( – \dot{\theta}^2 \) | Minimizes joint acceleration for smoother motion. |
| Action Smoothness Penalty | \( – (a_t – a_{t-1})^2 \) | Encourages gradual changes in control signals. |
| Foot Air Time Reward | \( t – 0.5 \) | Rewards lifting feet off the ground to avoid sliding strategies. |
| Collision Penalty | \( f_{\text{contact}} > 1 \) | Punishes collisions between the robot body or thighs and the ground. |
The velocity tracking and turning velocity rewards are core components that guide the policy to follow target body velocity commands. The vertical vibration and XY rotation penalties help maintain body stability, while the joint torque, acceleration, and action smoothness penalties ensure stable and smooth joint movements. The foot air time reward prevents the controller from adopting sliding motions, and the collision penalty encourages proper body posture. This comprehensive reward structure is crucial for training an effective control policy for the quadruped robot.
Training the reinforcement learning agent requires extensive interaction data, which is challenging to collect on real robots due to safety risks, low sampling efficiency, and difficulty in obtaining certain data. Therefore, I used the IsaacLab framework to simulate the Aliengo robot and its environment, leveraging Isaac Sim and the PhysX physics engine for dynamics and contact mechanics simulation. This framework enables parallel simulation of thousands of robots on GPUs, with one hour of simulation data equivalent to approximately two months of continuous data collection on a single quadruped robot. The robot model is based on the official Unitree Aliengo URDF (Universal Robot Description Format), and terrain generation utilizes IsaacLab’s random terrain generator. Each simulation episode has a maximum duration of 20 seconds, with a physics step size of 0.005 seconds and a control period of 0.02 seconds. Episodes terminate if the time limit is reached or the robot body touches the ground.
Despite high-fidelity simulations, a simulation-to-real gap exists due to modeling inaccuracies in robot dimensions, mass, inertia, center of mass positions, motor output characteristics, joint dynamics, terrain material properties, and contact mechanics. To address this, I applied domain randomization techniques, varying key parameters during training to enhance the controller’s generalization. The randomization includes three aspects:
| Domain | Randomized Parameters | Range or Method |
|---|---|---|
| Robot Model | Body mass, initial joint angles and velocities, initial body pose and velocity | Uniform distribution within ±10% of nominal values |
| Terrain | Terrain morphology parameters, friction coefficients | Randomized using IsaacLab terrain generator; difficulty levels adjusted based on agent performance |
| Motor Dynamics | Spring-damper model parameters \( k_p \) and \( k_d \) in \( \tau = k_p \cdot \Delta\theta + k_d \cdot \Delta\dot{\theta} \) | Multiplied by random scaling factors from 0.8 to 1.2 |
Terrain randomization involves generating terrains with varying roughness and friction to simulate wet or slippery surfaces. Initially, lower-difficulty terrains are used, and the difficulty is adjusted based on the agent’s performance to avoid overly conservative policies. Motor randomization applies to the simplified spring-damper model used in simulation, where \( k_p \) and \( k_d \) parameters are scaled randomly to mimic real-world motor variations.
The end-to-end design requires the actor neural network to rely solely on sensor data, including linear velocity, angular velocity, body orientation, joint velocities, joint positions, target velocities, and previous actions. Since single-time-step data do not fully capture the robot’s state and lack the Markov property, I incorporated a recurrent neural network (RNN) structure, specifically a long short-term memory (LSTM) network, to process historical observation sequences. The LSTM has a single layer with 256 hidden units, a dropout rate of 0, and a sequence window length of 50 steps (equivalent to 1 second of history at 50 Hz control frequency). The LSTM output feeds into a multilayer perceptron (MLP), with the actor network outputting the action vector \( a_t \) and the critic network outputting the state value \( v(s_t) \).
Domain randomization and RNNs increase task complexity and training difficulty. To mitigate this, I employed the asymmetric actor-critic (AAC) algorithm, which provides the critic network with additional privileged observations \( o^p_t \) available only in simulation. The actor network uses only sensor-based observations \( o_t \), ensuring deployability on real robots without intermediate state estimation algorithms. The critic network leverages privileged data, such as robot center-of-mass velocity \( v_t \in \mathbb{R}^3 \), angular velocity \( \omega_t \in \mathbb{R}^3 \), and orientation \( R_t \in \mathbb{R}^3 \), for more accurate value function estimation. This asymmetric approach enhances training efficiency without requiring privileged data for the actor, resulting in a fully end-to-end controller. The network architecture is summarized as:
$$ \text{Actor: } o_t \rightarrow \text{LSTM} \rightarrow \text{MLP} \rightarrow a_t $$
$$ \text{Critic: } o_t, o^p_t \rightarrow \text{LSTM} \rightarrow \text{MLP} \rightarrow v(s_t) $$
Training was conducted using the proximal policy optimization (PPO) algorithm, with 4,096 simulated Aliengo quadruped robots running in parallel. Each task had a maximum length of 20 seconds, a simulation step of 0.005 seconds, and a control period of 0.02 seconds. Over 2,500 training iterations, approximately 240 million agent-environment interaction steps were collected. I compared training with and without privileged data. The results showed that privileged data accelerated learning, enabling the agent to quickly acquire basic control strategies and achieve higher rewards and terrain difficulty levels. Specifically, the controller with privileged data reached an average reward of 18.9 and an average terrain difficulty of 4.4, while the one without privileged data achieved only 13.1 and 3.7, respectively. This demonstrates that privileged learning significantly improves training efficiency and final policy performance for the quadruped robot.
For deployment, I tested the trained neural network controller on a real Unitree Aliengo robot in various environments, including outdoor grass, underground gravel roads, underground slopes with steps, and main inclined shaft ramps. The deployment framework was built on the ROS (Robot Operating System) framework and Unitree Legged SDK, using an onboard computer running Ubuntu 16.04 and ROS Kinetic. The neural network was executed via LibTorch at 50 Hz, with the latest IMU and encoder data concatenated into a vector as input. The computed joint target positions were sent to the motor controllers for execution. Test commands included forward, backward, lateral movement, and left-right rotation, with translation velocity set to 0.5 m/s and turning angular velocity to 0.5 rad/s.
To evaluate the zero-shot generalization capability of the neural network controller, I quantified its adaptability to load-inertia parameter changes by rigidly attaching removable counterweights to the Aliengo robot’s back, with a longitudinal offset of +10 cm from the center of mass. Loads of 1 kg, 3 kg, and 6 kg were applied (representing 5%, 15%, and 30% of the total robot mass, respectively), and standard velocity tracking tasks were performed on gravel and grass terrains (forward command 0.5 m/s, turning command 0.5 rad/s). Additionally, step terrain passing tasks (step height 5 cm) were conducted 10 times to calculate success rates. Data were collected via LiDAR for analysis.
The velocity tracking performance results indicated that with 1 kg and 3 kg loads, the forward velocity tracking error was \( 0.05 \pm 0.02 \) m/s (relative error 10%), showing no significant difference from the no-load condition (\( 0.04 \pm 0.01 \) m/s). However, with a 6 kg load, the error increased to \( 0.20 \pm 0.05 \) m/s (relative error 40%). In step passing tasks, the controller achieved a 100% success rate across all load conditions. Although gait deviations occurred under the 6 kg load, the controller adjusted subsequent gait planning and control to restore stability and successfully navigate the steps. These results highlight the robustness of the quadruped robot controller in handling payload variations and complex terrains.
In conclusion, this work addresses the control challenges of quadruped robots in coal mines by proposing a fully end-to-end neural network controller based on reinforcement learning. By integrating state estimation, motor adaptation, and control modules, and leveraging domain randomization, asymmetric actor-critic algorithms, privileged observations, and recurrent neural networks, the controller achieves robust performance in complex underground environments. The simulation and real-world tests demonstrate significant improvements in motion control and state estimation accuracy for the robot dog, enabling reliable operation in challenging conditions such as slippery surfaces, uneven terrain, and dynamic loads. Future work could focus on extending this approach to multi-robot coordination and long-term autonomy in mining applications, further enhancing the capabilities of quadruped robots in industrial settings.
