In recent years, the rapid advancement of robot technology has led to the widespread deployment of mobile robots equipped with various sensors in diverse fields such as industrial automation, military operations, disaster rescue, space exploration, and domestic services. These applications rely heavily on the robot’s ability to navigate autonomously and interact intelligently with their environments. A critical component enabling this capability is simultaneous localization and mapping (SLAM), which involves accurately determining the robot’s position and orientation while constructing a map of the surrounding environment. However, relying on a single sensor often proves insufficient due to challenges like illumination changes, external disturbances, reflective surfaces, and cumulative errors, which can degrade perception and pose estimation accuracy. To address these issues, we propose a nonlinear optimization-based approach for a tightly coupled multi-sensor fusion system, termed IIVL-LM, which integrates an inertial measurement unit (IMU), an infrared camera, an RGB camera, and a LiDAR at the data level. This system enhances robustness and precision in unknown and dynamic environments, particularly in scenarios with varying illumination, such as indoor rescue operations.
The core of our method lies in real-time illuminance conversion based on RGB image information, where the system dynamically adjusts the visual SLAM model using nonlinear interpolation according to measured illuminance values. We introduce a dynamic weighting mechanism to fuse feature extraction from key frames of both infrared and RGB cameras, ensuring effective coupling even in low-light conditions. By leveraging the depth information capability of the infrared camera, we optimize the visual-inertial odometry (VIO) module within the R3LIVE++ framework, resulting in improved performance under harsh lighting. Our experiments, conducted in a simulated indoor rescue scenario, demonstrate that IIVL-LM significantly outperforms several mainstream fusion localization methods, with an average root mean square error (RMSE) of absolute trajectory error (ATE) improved by 23% to 39% (0.006 to 0.013) under illumination variations, especially in low-light environments. This system ensures that at least three sensors remain operational at all times, balancing accuracy and robustness for open-world applications.
Robot technology has evolved to incorporate multi-sensor fusion as a standard approach to overcome the limitations of individual sensors. In SLAM, the integration of data from IMU, cameras, and LiDAR helps mitigate uncertainties caused by noise and environmental factors. For instance, visual SLAM methods like MonoSLAM and PTAM have laid the groundwork for feature-based tracking and bundle adjustment, while systems like ORB-SLAM3 and R3LIVE++ have advanced the field by incorporating inertial and depth data. However, these methods often struggle with rapid illumination changes or complete darkness, where visual features become unreliable. Our work builds upon these foundations by introducing an illuminance-aware fusion strategy that adapts to lighting conditions in real-time, a crucial aspect for robot technology in unpredictable environments like disaster zones where lighting can vary dramatically.
In the IIVL-LM system, we employ a nonlinear optimization framework to tightly couple sensor data. The LiDAR fusion component transforms point clouds into a global frame using iterative estimates, as described by the equation: $$G\mathbf{P}_s = G\check{\mathbf{P}}_{I_k} \left( I\mathbf{R}_L L\mathbf{P}_s + I\mathbf{P}_L \right) + G\check{\mathbf{P}}_{I_k}$$ where \(G\mathbf{P}_s\) represents the position of point \(P\) in the global coordinate system, \(G\check{\mathbf{P}}_{I_k}\) is the rotation matrix from the iterative to global frame, \(I\mathbf{R}_L\) is the rotation from local to iterative frame, \(L\mathbf{P}_s\) is the point position in the local frame, and \(I\mathbf{P}_L\) is the local frame origin in the iterative frame. To achieve precise point cloud registration, we fit a plane with normal and centroid using nearest neighbors and compute the LiDAR measurement residual: $$r_1(\check{\mathbf{x}}_k, L\mathbf{P}_s) = \mathbf{u}_s^T (G\mathbf{P}_s – \mathbf{Q}_s)$$ where \(\mathbf{u}_s\) is the normal vector, and \(\mathbf{Q}_s\) is the centroid. For visual fusion, we minimize the reprojection error using the Lucas-Kanade optical flow: $$r_c(\check{\mathbf{x}}_k, \rho_{s_k}, G\mathbf{P}_s) = \rho_{s_k} – \pi(G\mathbf{P}_s, \check{\mathbf{x}}_k)$$ with the projection function defined as: $$\pi(G\mathbf{P}_s, \check{\mathbf{x}}_k) = \pi_{\text{ph}}(G\mathbf{P}_s, \check{\mathbf{x}}_k) + \frac{I\check{\mathbf{t}}_{C_k}}{\Delta t_{k-1,k}} (\rho_{s_k} – \rho_{s_{k-1}})$$ Here, \(\pi_{\text{ph}}\) maps 3D points to the 2D image plane, and \(I\check{\mathbf{t}}_{C_k}\) is the translation vector from the intermediate to camera frame.
The illuminance conversion model is pivotal for adapting to lighting changes. We compute the illuminance \(E\) from RGB images using the formula: $$E = \frac{k_r R + k_g G + k_b B}{Y_{\text{max}}}$$ where \(R\), \(G\), and \(B\) are the RGB channel values, \(k_r = 0.299\), \(k_g = 0.587\), and \(k_b = 0.114\) are channel coefficients, and \(Y_{\text{max}} = 255\) for 8-bit images. To ensure real-time performance without abrupt changes, we sample illuminance at intervals, such as every 60 frames, and average the values after removing extremes: $$E_i = \frac{\sum_{j \in \{i-2,i-1,i,i+1,i+2\}} E_j – \max(E_j) – \min(E_j)}{3}$$ This is normalized using pre-determined maximum and minimum illuminance values from the environment: $$\hat{E}_i = \frac{E_i – E_{\text{min}}}{E_{\text{max}} – E_{\text{min}}}$$ In our experiments, \(E_{\text{max}} = 50287\) lux and \(E_{\text{min}} = 21\) lux were used based on sample data from indoor rescue scenarios.
Dynamic weighting for feature fusion between infrared and RGB cameras is determined nonlinearly based on \(\hat{E}_i\): $$(\alpha, \beta) = \begin{cases} \hat{E}_i \geq 0.85; & \alpha = 0, \beta = 1 \\ 0.15 \leq \hat{E}_i \leq 0.85; & \alpha = 1 – \hat{E}_i, \beta = \hat{E}_i \\ \hat{E}_i \leq 0.15; & \alpha = 1, \beta = 0 \end{cases}$$ Here, \(\alpha\) and \(\beta\) are the weights for infrared and RGB features, respectively. This ensures that in well-lit conditions (\(\hat{E}_i \geq 0.85\)), the system relies solely on RGB data, while in darkness (\(\hat{E}_i \leq 0.15\)), it uses only infrared. In intermediate conditions, features are weighted linearly. The fused feature \(F_f\) combines average brightness deviation (AvgDev) and texture contrast (Contrast) from both image types: $$\text{AvgDev} = \frac{1}{W \times H} \sum_{i=1}^{W} \sum_{j=1}^{H} (I(i,j) – 128)$$ $$\text{Contrast} = \sum_{i,j=0}^{G-1} (i – j)^2 P(i,j)$$ where \(W\) and \(H\) are image dimensions, \(I(i,j)\) is pixel intensity, and \(P(i,j)\) is the co-occurrence matrix. The fused feature is: $$F_f = \alpha (\text{AvgDev}_{\text{IR}} + \text{Contrast}_{\text{IR}}) + \beta (\text{AvgDev}_{\text{VL}} + \text{Contrast}_{\text{VL}})$$ This approach allows the system to maintain accurate feature extraction across varying illuminance, a key advancement in robot technology for environmental adaptation.
For the VIO module, we optimize the R3LIVE++ framework by incorporating depth information from the infrared camera. The process involves state augmentation with new IMU data, feature tracking in new image frames, and keyframe selection. If a new keyframe is identified, features are extracted and non-uniformly distributed, followed by depth filtering and 3D coordinate calculation using Newton-Gauss methods. The measurement equation is constructed as \(\mathbf{r} = \mathbf{H} \mathbf{x} + \mathbf{n}\), and the filter iteratively updates the state vector and covariance matrix. This tight coupling of visual and inertial data leverages the IMU’s responsiveness to rapid motions and the visual system’s lack of cumulative error, with velocity constraints on the y and z axes modeled as zero-mean noise to account for potential slips or jumps in mobile robot movement.
Our experimental platform consists of a multifunctional mobile robot equipped with an AGVroute chassis, a Duco collaborative robotic arm, and the IIVL-LM system. The core sensors include a Livox LiDAR, an RGB camera with embedded IMU, and an infrared camera, all integrated for 3D map reconstruction and pose estimation. The software environment runs on Ubuntu 20.04 with ROS for module coordination, and the hardware includes an NVIDIA GeForce GTX 1650 Ti GPU with 8 GB VRAM and an Intel Core i5-10300H CPU at 2.50 GHz, demonstrating the system’s feasibility on general-purpose hardware. Key sensor parameters are summarized in the table below:
| Number | Sensor | Parameter | Parameter Value |
|---|---|---|---|
| 1 | LiDAR | Wavelength | 905 nm |
| 1 | LiDAR | FOV | Horizontal 360°, Vertical -7° to 52° |
| 1 | LiDAR | Point Cloud Output and Frame Rate | 200,000 points/sec, 10 Hz |
| 2 | RGB Camera | Resolution | 752 × 480 |
| 2 | RGB Camera | Maximum Frame Rate | 60 FPS |
| 3 | IMU | Frequency | 100–500 Hz |
| 4 | Infrared Camera | IR Effective Distance | 3 m |
| 4 | Infrared Camera | Frame Rate | 60 FPS |
Experiments were conducted in a simulated indoor rescue environment covering an area of 484.3 m², with natural light ingress and controlled indoor lighting to create illuminance variations from 10 to 50,000 lux. The mobile robot traveled a cumulative distance of 4,792 meters during tests, with data collected at different times of day to cover a range of lighting conditions. We evaluated performance using absolute trajectory error (ATE) and RMSE, defined for the i-th frame as \(F_i = Q_i^{-1} S P_i\), where \(Q\) is the ground truth pose, \(S\) is the transformation matrix, and \(P\) is the estimated pose. The RMSE over \(n\) frames with interval \(\Delta\) is: $$\text{RMSE}(F_{1:n}, \Delta) = \left( \frac{1}{m} \sum_{i=1}^{m} \| \text{trans}(F_i) \|^2 \right)^{1/2}$$ where \(m = n – \Delta\) and \(\text{trans}(F_i)\) is the translational part of the relative pose error.

Comparative results against ORB-SLAM, VINS-Mono, R3LIVE, DSO, and SVO show that IIVL-LM achieves superior accuracy under low-light conditions. For example, in periods with average illuminance below 8,000 lux, the RMSE ATE values for IIVL-LM were as low as 0.018 to 0.020, outperforming other methods by 0.001 to 0.016. In contrast, methods relying solely on LiDAR and IMU exhibited higher errors, up to 0.044 in short distances, due to drift and cumulative errors. Under stable, well-lit conditions, all methods performed similarly, highlighting the robustness of our approach in challenging lighting scenarios. The table below summarizes the RMSE ATE results across different illuminance ranges:
| Time Period | Average Illuminance (lux) | Distance (m) | ORB-SLAM | VINS-Mono | R3LIVE | DSO | SVO | IIVL-LM |
|---|---|---|---|---|---|---|---|---|
| 0–4 am | 20 | 910 | 0.044 | 0.042 | 0.037 | 0.028 | 0.033 | 0.018 |
| 4–8 am | 1,200 | 800 | 0.036 | 0.040 | 0.016 | 0.021 | 0.018 | 0.016 |
| 8–12 am | 17,000 | 657 | 0.019 | 0.015 | 0.011 | 0.020 | 0.018 | 0.012 |
| 12–16 pm | 30,000 | 745 | 0.026 | 0.021 | 0.023 | 0.026 | 0.032 | 0.021 |
| 16–20 pm | 20,000 | 800 | 0.031 | 0.027 | 0.030 | 0.041 | 0.039 | 0.032 |
| 20–24 pm | 8,000 | 880 | 0.042 | 0.044 | 0.038 | 0.036 | 0.037 | 0.020 |
| Average RMSE ATE | – | – | 0.033 | 0.032 | 0.026 | 0.029 | 0.030 | 0.020 |
| Comparison to Overall Average (0.028) | – | – | -18% | -14% | +7% | -4% | -7% | +40% |
Analysis of the results indicates that IIVL-LM’s illuminance-aware fusion effectively mitigates the impact of lighting variations, a common challenge in robot technology. The dynamic weighting and nonlinear interpolation ensure that features from infrared and RGB cameras are optimally combined, reducing errors in low-light conditions where other methods fail. Additionally, the tight coupling of LiDAR, IMU, and visual data minimizes cumulative drift, as evidenced by the lower ATE values. This is particularly valuable for applications like indoor rescue, where robots must operate in dark or smoke-filled environments. However, the system’s computational demand due to high-resolution data fusion poses a challenge for real-time performance, an area for future optimization.
In conclusion, the IIVL-LM system represents a significant advancement in robot technology by enabling robust navigation and localization under varying illumination through tightly coupled multi-sensor fusion. Our contributions include an illuminance conversion model, dynamic feature weighting, and an optimized VIO module, all validated in simulated rescue scenarios. Future work will focus on enhancing real-time processing efficiency, extending the system to 3D environments with aerial robots, and incorporating advanced machine learning techniques for adaptive behavior control. By improving environmental understanding and autonomy, this approach paves the way for more reliable robot technology in complex, unknown settings.
