Mapless Navigation for Wheeled Robots Using IL-TD3

In the field of autonomous systems, robot technology has made significant strides, yet challenges persist in enabling robots to navigate efficiently in dynamic and unknown environments without relying on pre-existing maps. Traditional navigation methods often depend on high-precision maps, which can be cumbersome to create and maintain, especially in unstructured or rapidly changing settings. This limitation becomes particularly evident in complex scenarios where obstacles move unpredictably, rendering static maps ineffective. To address these issues, we propose a novel approach based on Imitation Learning Enhanced Twin Delayed Deep Deterministic Policy Gradient (IL-TD3) for mapless navigation in wheeled robots. Our method leverages the strengths of both imitation learning and deep reinforcement learning to enable robots to learn navigation policies directly from sensory inputs, without the need for explicit mapping. By modeling the navigation task as a Partially Observable Markov Decision Process (POMDP) and incorporating Long Short-Term Memory (LSTM) networks to handle historical information, we enhance the robot’s ability to perceive and react to dynamic environments. This integration not only improves the robustness of navigation but also allows the robot to adapt to new situations through continuous learning. In this paper, we detail the design and implementation of our IL-TD3 framework, including the abstraction of perceptual data, the structure of the neural networks, and the reward function that guides learning. We present extensive simulation results demonstrating the method’s effectiveness in various environments, followed by real-world tests that validate its practical applicability. The advancements in robot technology showcased here highlight the potential for more autonomous and intelligent systems in real-world applications.

Robot technology has evolved to encompass a wide range of applications, from industrial automation to service robotics, where navigation is a core competency. However, many existing navigation systems rely on predefined maps, which can be a bottleneck in unknown or dynamic settings. Our work focuses on overcoming this by developing a mapless navigation strategy that allows robots to learn from experience. We frame the problem as a POMDP, which accounts for the partial observability of the environment—a common issue in real-world scenarios where sensors provide limited information. The state space in our POMDP model includes laser scan data and the relative position to the goal, while the action space consists of linear and angular velocities. The transition probabilities and reward function are designed to encourage safe and efficient navigation. To handle the sequential nature of the task, we use LSTM networks to process historical observations, enabling the robot to make informed decisions based on past experiences. This approach is crucial for dealing with dynamic obstacles and changing conditions, as it allows the robot to maintain context over time. The following sections elaborate on the key components of our method, including perceptual abstraction, the IL-TD3 system design, and the experimental validation.

Perceptual abstraction is a critical step in reducing the computational load and improving the generalization of our navigation model. In robot technology, sensors like laser rangefinders often generate high-dimensional data that can be overwhelming for real-time processing. We abstract the laser data from 1,800 points to 36 directional values, capturing the essential information about obstacles in the environment. This abstraction is performed using the following equation: $$ P[i] = \begin{cases} \min(P_i) & \text{if } \min(P_i) < P_{\text{max}} \\ P_{\text{max}} & \text{otherwise} \end{cases} $$ where $ P[i] $ represents the minimum distance in the i-th direction, $ P_i $ is the set of raw laser points in that direction, and $ P_{\text{max}} $ is the maximum range of the sensor. This simplification retains the critical obstacle information while reducing input dimensionality. Additionally, we abstract the robot’s goal-related information into distance and angle components: $$ d_{\text{goal}} = \sqrt{(x_{\text{goal}} – x_{\text{robot}})^2 + (y_{\text{goal}} – y_{\text{robot}})^2} $$ $$ \theta_{\text{goal}} = \text{atan2}(y_{\text{goal}} – y_{\text{robot}}, x_{\text{goal}} – x_{\text{robot}}) – \psi_{\text{robot}} $$ where $ d_{\text{goal}} $ is the Euclidean distance to the goal, $ \theta_{\text{goal}} $ is the angle to the goal relative to the robot’s heading, and $ \psi_{\text{robot}} $ is the robot’s yaw angle. This abstraction enables the robot to focus on relevant features, enhancing learning efficiency in complex robot technology applications.

The core of our approach lies in the IL-TD3 system, which combines imitation learning (IL) and deep reinforcement learning (DRL) to accelerate policy learning. In the initial phase, we use IL to bootstrap the robot’s navigation capabilities by leveraging expert demonstrations. We collect a dataset of state-action pairs from human-controlled navigation in simulation, where the robot moves from random start points to goals. The IL model learns to map abstracted sensor inputs to actions using a neural network with multiple fully connected layers. The loss function for IL is the mean squared error (MSE) between the predicted actions and expert actions: $$ L(\theta) = \frac{1}{T} \sum_{t=1}^{T} (\pi_{\theta}(a_t | s_t) – A_t)^2 $$ where $ \theta $ represents the network parameters, $ \pi_{\theta} $ is the policy network, $ s_t $ is the state at time t, $ a_t $ is the action, and $ A_t $ is the expert action. We optimize this loss using gradient descent: $$ \theta_{\text{new}} = \theta_{\text{old}} – \alpha \nabla_{\theta} L(\theta) $$ with $ \alpha $ as the learning rate. This IL phase provides a strong initial policy, reducing the exploration time required in DRL.

Following IL, we employ the TD3 algorithm, a state-of-the-art DRL method that addresses overestimation bias in value functions. TD3 uses an actor-critic architecture with twin Q-networks and delayed policy updates. The actor network (policy) takes the abstracted state as input and outputs continuous actions, while the critic networks (value functions) evaluate the state-action pairs. We enhance TD3 by integrating LSTM layers to process sequences of states, allowing the robot to capture temporal dependencies. The target Q-value is computed as: $$ Q_{\text{target}} = r + \gamma (1 – f_{\text{done}}) \min(Q_1′(s’, a’), Q_2′(s’, a’)) $$ where r is the immediate reward, $ \gamma $ is the discount factor, $ f_{\text{done}} $ is a termination flag, and $ Q_1′ $ and $ Q_2′ $ are the target Q-values from the twin critics. The critic loss is a smooth L1 loss: $$ L_{\text{critic}} = \text{smooth\_l1\_loss}(Q_1, Q_{\text{target}}) + \text{smooth\_l1\_loss}(Q_2, Q_{\text{target}}) $$ and the actor loss is: $$ L_{\text{actor}} = -\frac{1}{N} \sum_{i=1}^{N} Q_1(s_i, \pi(s_i)) $$ where N is the batch size. This combination ensures stable and efficient learning in dynamic environments, a key advancement in robot technology.

To guide the learning process, we design a comprehensive reward function that balances multiple objectives: reaching the goal quickly, avoiding obstacles, and maintaining smooth motion. The reward function is defined as: $$ r = \begin{cases} 2500 & \text{if SUCCESS} \\ -2000 & \text{if COLLISION} \\ -0.7 |\theta_{\text{goal}}| – 0.1 d_{\text{goal}} + r_{\text{obs}} – 0.1 (|w| + |v|) & \text{otherwise} \end{cases} $$ where SUCCESS and COLLISION are sparse rewards, $ \theta_{\text{goal}} $ and $ d_{\text{goal}} $ are the goal angle and distance, and $ r_{\text{obs}} $ is an obstacle avoidance term calculated as: $$ r_{\text{obs}} = \sum_{i=1}^{36} w_i (1 – P[i]) $$ with $ w_i $ as direction-specific weights. This reward structure encourages the robot to navigate efficiently while prioritizing safety, which is essential for real-world robot technology deployments.

In our system design, the IL and TD3 models are integrated through a Q-value-based selection mechanism. The robot’s state is fed into both models, producing candidate actions. A critic network evaluates these actions, and the one with the higher Q-value is executed. This allows the robot to leverage the quick learning of IL while benefiting from the exploratory nature of TD3. Experiences are stored in a replay buffer and used to train the TD3 networks iteratively. This hybrid approach enhances the robot’s adaptability and performance in unknown environments, pushing the boundaries of autonomous robot technology.

We conducted extensive experiments in simulation to evaluate our IL-TD3 method. The simulation environment was built using Webots 2022b, featuring a wheeled robot equipped with a 2D lidar and odometry sensors. We designed three training environments with varying complexity, including static and dynamic obstacles, to test the robot’s navigation skills. The robot was trained for 1,000 episodes, with environments switched every 100 episodes to promote generalization. Key hyperparameters are summarized in Table 1.

Table 1: Hyperparameters for the IL-TD3 Model
Parameter	Value
Discount factor γ	0.99
Learning rate μ	5.0e-4
Replay buffer size	50,000
Batch size	80
Exploration noise τ	2.0e-1
Target update rate σ	3.0e-3

During training, we monitored the cumulative reward per episode to assess learning progress. As shown in Figure 1, IL-TD3 achieved higher and more stable rewards compared to baseline methods like DDPG, TD3, and TD3-LSTM. The inclusion of IL accelerated initial learning, while TD3’s robustness ensured consistent performance across environment changes. This demonstrates the effectiveness of our approach in advancing robot technology for dynamic navigation.

For testing, we evaluated the models in an unseen dynamic environment with moving obstacles. Each model was tested 50 times from fixed start points to goals, and we measured success rate, average time, and average reward. The results, summarized in Table 2, indicate that IL-TD3 outperformed other methods in success rate and efficiency, highlighting its superiority in handling unknown scenarios.

Table 2: Navigation Performance in Test Environment
Model	Success Rate (%)	Average Time (s)	Average Reward
DDPG	64	27.05	1105.19
TD3	80	23.24	1306.55
TD3-LSTM	90	25.37	1558.74
IL	72	38.72	1152.65
IL-TD3	96	21.63	1627.06

To validate the real-world applicability, we performed Sim2Real tests by deploying the trained IL-TD3 model on a physical wheeled robot without any fine-tuning. The robot navigated in a lab setting with static and dynamic obstacles, as depicted in the image above. Using odometry data, we plotted the robot’s trajectory from start to goal, observing that it successfully avoided obstacles and adapted to dynamic changes. This confirms the robustness and generalization of our method, a significant step forward in practical robot technology.

In conclusion, our IL-TD3 framework provides an effective solution for mapless navigation in dynamic environments, combining the rapid learning of imitation with the adaptive capabilities of deep reinforcement learning. The use of POMDP modeling and LSTM networks enhances the robot’s perception and decision-making, while the reward function promotes safe and efficient behavior. Experimental results in simulation and real-world tests demonstrate the method’s superiority over existing approaches. Future work will focus on refining the reward mechanism and incorporating global mapping elements to further optimize navigation paths. This research contributes to the ongoing evolution of robot technology, enabling more autonomous and intelligent systems for diverse applications.

Robot technology continues to benefit from innovations in machine learning, and our IL-TD3 method is a testament to this progress. By enabling robots to learn navigation policies directly from sensory inputs, we reduce reliance on predefined maps and enhance adaptability. The integration of historical data processing through LSTMs allows for better context awareness, which is crucial in real-world environments where conditions change rapidly. As robot technology advances, methods like IL-TD3 will play a pivotal role in deploying autonomous systems in complex scenarios, from search and rescue to personal assistance. We believe that our work lays a foundation for future research in lifelong learning and cross-domain adaptation, further pushing the boundaries of what robots can achieve.