Supervised Learning-Based Ground Contact Event Detection for Bipedal Robots

In the field of robot technology, accurate state estimation is crucial for bipedal robots to perform dynamic movements and navigate unstructured environments. Traditional ground contact detection methods often rely on foot force sensors, which can be susceptible to noise and dynamic complexities, limiting their robustness and applicability. To address these challenges, we have developed a supervised learning-based algorithm for detecting ground contact events in bipedal robots. This approach enables the invariant extended Kalman filter (InEKF) to obtain precise and low-cost contact signals, thereby enhancing state estimation accuracy without the need for expensive force sensors. Our method leverages machine learning techniques to process sensor data, including inertial measurement unit (IMU) readings, encoder outputs, current measurements, and kinematics-derived foot height and velocity. By employing mutual information for feature selection and a one-dimensional convolutional neural network (1DCNN) with graph encoding for temporal feature extraction, we achieve reliable regression of foot contact forces and classification of contact events. This innovation represents a significant advancement in robot technology, offering a cost-effective and efficient solution for bipedal locomotion.

State estimation forms the backbone of robot technology for bipedal systems, as it directly impacts control and perception. We utilize the InEKF, which maps system equations onto Lie groups to reduce estimation errors. The state vector is defined as follows: $$x_t = (R_t, v_t, p_t, d_t, b_t^a, b_t^\omega)$$ where $R_t$ is the IMU orientation (represented as a rotation matrix), $v_t$ is the IMU velocity, $p_t$ is the IMU position, $d_t$ is the contact point coordinate, and $b_t^a$ and $b_t^\omega$ are biases for acceleration and angular velocity noise, respectively. The system dynamics are governed by differential equations that account for sensor inputs and kinematic constraints. For instance, the derivative of orientation is given by: $$\frac{dR_t}{dt} = R_t (\omega_t – b_t^\omega – n_t^\omega)^\times$$ where $(\cdot)^\times$ denotes the skew-symmetric matrix, $\omega_t$ is the angular velocity, and $n_t^\omega$ represents noise. The forward kinematic observation, which relates to contact detection, is expressed as: $$h_t^R(\alpha_t) = R_t^T (d_t – p_t) + J(\alpha_t) \dot{\alpha}_t$$ Here, $\alpha_t$ represents joint angles, and $J(\alpha_t)$ is the Jacobian matrix mapping joint velocities to end-effector velocities. This formulation allows us to integrate contact information into the state estimation process, improving reliability in dynamic scenarios.

To model the contact events, we designed a finite state machine (FSM) that captures the transitions between different leg states during locomotion. The FSM includes four primary states: swing, contact, support, and abnormal states (e.g., sliding or bouncing). Transitions between these states are triggered based on foot contact forces and timing, ensuring that the dynamics model switches appropriately. For example, when a foot transitions from swing to contact, it may experience impact forces, and premature switching can lead to estimation errors. The FSM helps in determining the optimal moments for updating the contact point $d_t$ in the InEKF, which is critical for accurate kinematic observations. This state-based approach enhances the robustness of robot technology in handling complex terrains and unexpected obstacles.

Our ground contact detection method begins with data collection from various sensors, including IMU, encoders, current sensors, and foot force sensors. We also compute foot height and vertical velocity using forward kinematics: $$p_{\text{foot}}^{\text{IMU}} = F(m, \alpha)$$ $$\dot{p}_{\text{foot}}^{\text{IMU}} = J(\theta) \dot{\theta}$$ where $F$ is the forward kinematics function, $m$ is the robot model, and $\theta$ denotes joint positions. To reduce dimensionality and improve efficiency, we applied mutual information for feature selection. Mutual information measures the dependency between a feature $X$ and the target variable $Y$ (foot contact force) and is defined as: $$M(X,Y) = H(X) + H(Y) – H(X,Y)$$ where $H(X)$ is the entropy of $X$, $H(Y)$ is the entropy of $Y$, and $H(X,Y)$ is the joint entropy. We selected 16 features with mutual information values above 0.4, including foot height, z-axis velocity, and joint angles/velocities/currents for hip, knee, and ankle joints. This selection not only enhances interpretability but also reduces computational load, which is vital for real-time applications in robot technology.

Feature Category	Selected Features	Mutual Information Value
Foot Kinematics	Foot Height, Z-axis Velocity	> 0.5
Joint States	Hip Angle, Knee Angle, Ankle Angle	> 0.4
Joint Dynamics	Hip Velocity, Knee Velocity, Knee Current	> 0.45

Next, we implemented a graph encoding technique to preprocess temporal features and mitigate sensor noise. We sampled historical data using a sliding window approach with a fixed length and step size, then clustered the sequences using dynamic time warping (DTW). The probability of a window $X_t$ belonging to a cluster mode $M_i$ is computed via Softmax: $$P(X_t, M_i) = \frac{e^{-D(X_t, M_i)}}{\sum_j e^{-D(X_t, M_j)}}$$ where $D(X_t, M_i)$ is the DTW distance. The edge information between consecutive states is defined as: $$E_{X_{t-1} \to X_t} = P(X_{t-1}, M_i) P(X_t, M_i)$$ This results in a graph representation that captures state transitions, which is fed into the neural network. This encoding method improves noise resilience, a key advantage in robot technology for dealing with unreliable sensor data.

For the regression and classification tasks, we designed a 1DCNN architecture with Inception blocks to handle temporal data efficiently. The network takes 16-dimensional input features and processes them through multiple convolutional layers with varying kernel sizes, followed by batch normalization, activation, dropout, and max-pooling layers. The output is passed through fully connected layers to predict foot contact forces. The loss function used is mean squared error (MSE): $$L = \frac{1}{n} \sum_{i=1}^n (y_i – \hat{y}_i)^2$$ where $y_i$ is the actual force and $\hat{y}_i$ is the predicted force. We trained the network using the Adam optimizer with a learning rate of 0.001 over 1,000 epochs, achieving convergence with a loss of 6.05. The model was deployed on an NVIDIA Orin AGX platform, with inference times under 1.7 ms, meeting real-time requirements for robot technology applications.

We conducted experiments to evaluate our method, starting with in-place stepping tests on the X02 bipedal robot. We compared our approach with current-based detection and a classifier network. The current-based method estimates contact forces using inverse dynamics and motor currents: $$F = (J(\alpha)^T)^{-1} (\tau – K_t I)$$ where $\tau$ is the computed torque, $K_t$ is the motor constant, and $I$ is the current. However, this method is noise-prone and requires threshold tuning. Our regression network accurately predicted contact forces and detected events like slips, as shown in force sensor data. For instance, during a slip event, the contact force dropped suddenly and recovered, which our method captured reliably. In contrast, the classifier network showed instability near transition points due to lack of force magnitude information. This demonstrates the superiority of our approach in robot technology for precise contact detection.

To assess generalization, we tested on the X02Lite robot in complex terrains with obstacles like planks and weights. The regression network maintained accurate force predictions despite terrain variations, while current-based methods exhibited erratic fluctuations and misjudgments. For example, in unstable conditions,右脚 forces showed sharp peaks, leading to incorrect state transitions. Our method provided stable outputs, with a misjudgment rate of 0% in tests, compared to 12.1% for the classifier network. The following table summarizes key performance metrics from our experiments, highlighting the effectiveness of our robot technology solution.

Method	Inference Time (ms)	Misjudgment Rate (%)	Loss
Our Regression Network	1.24	0	6.05
Classifier Network	1.23	12.1	0.64
Current-Based Detection	N/A	High (qualitative)	N/A

We also evaluated inference speed and latency across different networks, including LSTM and Transformer models. Our 1DCNN achieved an average inference time of 1.24 ms, with 99% of inferences under 2 ms, satisfying real-time control cycles in robot technology. The latency distribution for our method was approximately normal with mean 0.017 ms and variance 0.0019, indicating minimal delay. In contrast, LSTM networks had higher latency (mean 11.87 ms), making them unsuitable for real-time applications. This efficiency is crucial for bipedal robots, where timely responses are essential for stability and performance.

In summary, our supervised learning-based ground contact detection method offers a robust and efficient solution for bipedal robots, advancing robot technology by eliminating the need for force sensors. Through feature selection, graph encoding, and 1DCNN-based regression, we achieve accurate force prediction and event classification with low latency. Experiments confirm its superiority over traditional methods in both flat and complex terrains, with excellent generalization across robot platforms. Future work will focus on extending this approach to more diverse environments and integrating it with adaptive control strategies to further enhance the autonomy and reliability of bipedal locomotion systems.