Research on Enhanced Multimodal SLAM for Intelligent Robots in Geriatric Care Environments

Developing robust navigation and mapping capabilities for intelligent robots operating in specialized environments like nursing homes presents a unique set of challenges. The core task, Simultaneous Localization and Mapping (SLAM), must contend with constrained spaces, variable lighting conditions, and the presence of slowly moving residents. Traditional SLAM approaches often struggle in these settings, leading to map inaccuracies such as edge misalignment, mapping gaps, and global inconsistencies that severely hinder the robot’s operational reliability. My research focuses on addressing these critical issues by designing, implementing, and validating an enhanced multimodal SLAM system specifically tailored for an intelligent robot intended for gerocomium applications. The primary objective is to fuse data from complementary sensors—a single-line LiDAR, an RGB-D camera, and an Inertial Measurement Unit (IMU)—within an optimized algorithmic framework to construct consistent, accurate, and reliable environmental maps.

The limitations of conventional methods are well-documented. Algorithms like Gmapping, which rely solely on 2D LiDAR data, are prone to increased computational load and map deformation in large or geometrically complex areas. Feature-based visual SLAM systems, such as ORB-SLAM2, offer rich scene understanding but are computationally intensive and sensitive to lighting variations and low-texture environments common in care facilities (e.g., plain walls). The RTAB-Map (Real-Time Appearance-Based Mapping) framework provides a strong foundation due to its inherent support for multi-sensor input and appearance-based loop closure. However, in its standard form, it can still suffer from trajectory drift and map inconsistencies, especially under the challenging conditions of a dynamic nursing home. Therefore, my work proposes and details a significant optimization of the RTAB-Map pipeline. The enhancements integrate a global pose graph optimization mechanism using GTSAM, formulate a nonlinear least-squares constraint model, and construct an Extended Kalman Filter (EKF) based architecture for robust multi-sensor fusion. This paper presents a comprehensive account of the system’s design, the theoretical underpinnings of the optimizations, and extensive experimental validation demonstrating substantial improvements in mapping accuracy, robustness, and operational stability for the intelligent robot.

Comparative Analysis of Foundational SLAM Algorithms

Selecting an appropriate base algorithm is crucial. A comparative study of three prevalent algorithms—Gmapping, ORB-SLAM2, and RTAB-Map—highlights their respective strengths and weaknesses for our target environment, as summarized in the table below.

Algorithm	Primary Sensor	Map Type	Key Strengths	Key Weaknesses for Geriatric Care
Gmapping	2D LiDAR	2D Occupancy Grid	Efficient for small-to-medium planar spaces; real-time performance.	Poor scaling to large areas; susceptible to odometry drift; no 3D or visual information.
ORB-SLAM2	RGB-D / Monocular / Stereo	Sparse Point Cloud / Feature Map	High precision with rich features; robust loop closing with relocalization.	High computational cost; performance degrades in low-texture or variable light; complex dynamic object handling.
RTAB-Map	Multi-sensor (LiDAR, RGB-D, etc.)	3D Dense Point Cloud & 2D Grid	Native multi-sensor support; appearance-based loop closure; generates multi-modal maps.	Standard backend optimization may be insufficient for long-term drift; requires careful sensor calibration and fusion tuning.

Based on this analysis, RTAB-Map was chosen as the foundational framework for our intelligent robot. Its inherent flexibility to integrate LiDAR for robust geometry and RGB-D for texture and 3D data aligns perfectly with the need for comprehensive environment perception. The subsequent sections detail the hardware and software architecture built around this core, followed by the specific algorithmic enhancements developed to overcome its limitations.

System Design for the Geriatric Care Intelligent Robot

The intelligent robot platform was designed with a modular hardware and software architecture to ensure reliability, processing capability, and seamless sensor integration.

Hardware Architecture

The robot’s hardware is built around a powerful yet compact processing core. A Raspberry Pi 4B, running Ubuntu 20.04 and the Robot Operating System (ROS1 Noetic), serves as the primary computing unit (the “brain”). It is responsible for sensor data processing, SLAM algorithm execution, and high-level decision-making. The sensor suite includes:

• LiDAR: A 360-degree single-line LiDAR (LD14P) with a range of 0.1-8.0m and a scan frequency of 6Hz, providing reliable 2D geometric contours of the environment.

• RGB-D Camera: A depth-sensing camera with a range of 0.6-8.0m, delivering synchronized color and dense 3D point cloud data essential for visual odometry and texture-aware mapping.

• IMU: A high-frequency inertial measurement unit providing acceleration and angular velocity data to improve short-term motion estimates and bridge gaps between other sensor updates.

For low-level motor control and actuator management, an STM32F103ZET6 microcontroller acts as the subordinate controller. It receives velocity commands from the Raspberry Pi via serial communication and directly manages motor drivers (e.g., L298N chip), wheel encoders, and other actuators, ensuring precise motion execution for the intelligent robot.

Software Architecture

The software stack is orchestrated within the ROS framework, enabling a distributed, node-based system. Key functional modules include:

• Perception Module: Dedicated ROS nodes publish raw data from each sensor (/scan for LiDAR, /camera/depth/image_raw for RGB-D, /imu for IMU) onto the ROS communication network.

• Fusion & Mapping Module: The core of the system. The optimized RTAB-Map node subscribes to all relevant sensor topics. It performs time synchronization, sensor fusion via EKF, visual-inertial-LiDAR odometry, loop closure detection, and finally, pose graph optimization to produce consistent 2D and 3D maps.

• Navigation & Control Module: This module utilizes the generated maps for autonomous operation. It maintains global and local costmaps for obstacle representation, performs global path planning (e.g., using A* or Dijkstra), and executes local trajectory planning and dynamic obstacle avoidance. Computed velocity commands are then sent to the STM32 controller via a dedicated serial node.

This modular design ensures that the intelligent robot can reliably perceive its environment, build and localize within a consistent map, and navigate safely to fulfill its service tasks.

Algorithmic Optimizations for Robust Multimodal SLAM

The standard RTAB-Map framework, while powerful, was enhanced in two critical areas: backend global optimization and frontend multi-sensor state estimation. These optimizations are essential for the intelligent robot to maintain map consistency over long operations in challenging nursing home environments.

Pose Graph Optimization with GTSAM

To address cumulative odometry drift and ensure global map consistency, I replaced RTAB-Map’s default g2o optimizer with the Georgia Tech Smoothing and Mapping (GTSAM) library. GTSAM uses a factor graph paradigm, which is more naturally suited to expressing heterogeneous constraints from multiple sensors. The pose graph consists of robot pose nodes (positions and orientations) and factors (constraints) between them. Constraints arise from odometry measurements between consecutive poses and loop closure corrections when the robot revisits a known location.

The optimization problem is formulated as a Nonlinear Least Squares (NLS) minimization. Let the set of robot poses be $$X = \{\mathbf{x}_1, \mathbf{x}_2, …, \mathbf{x}_N\}$$, where $$\mathbf{x}_i = [x_i, y_i, \theta_i]^T$$ for 2D SLAM. The set of measurements (odometry and loop closures) is $$Z = \{\mathbf{z}_{ij}\}$$, where $$\mathbf{z}_{ij}$$ is the measured transformation from pose $$i$$ to pose $$j$$. The goal is to find the most probable set of poses $$X^*$$ given the measurements $$Z$$:

$$
X^* = \arg\min_{X} \sum_{\langle i,j \rangle \in \mathcal{C}} \lVert \mathbf{e}_{ij}(X) \rVert^2_{\Sigma_{ij}}
$$

Here, $$\mathcal{C}$$ is the set of all constraint indices, $$\mathbf{e}_{ij}(X)$$ is the error function between the measured transformation $$\mathbf{z}_{ij}$$ and the predicted transformation based on the current pose estimates $$(\mathbf{x}_i, \mathbf{x}_j)$$. The Mahalanobis distance $$\lVert \cdot \rVert^2_{\Sigma_{ij}}$$ weights each error term by the inverse of its measurement covariance matrix $$\Sigma_{ij}$$, giving less certainty to noisier measurements. GTSAM efficiently solves this sparse NLS problem using algorithms like Levenberg-Marquardt, resulting in a globally consistent trajectory that corrects for drift, as visualized by the aligned pose estimates in the experimental results.

Multi-Sensor Fusion via Extended Kalman Filter (EKF)

Accurate, high-frequency state estimation is critical for the intelligent robot‘s odometry. An EKF was implemented to fuse asynchronous data from the IMU, LiDAR, and RGB-D camera, providing a robust estimate of the robot’s pose and velocity while filtering sensor noise.

1. State Vector Definition:
The robot’s state at time $$k$$ is defined as:
$$
\mathbf{x}_k = [x_k, y_k, v_{x,k}, v_{y,k}, \omega_k]^T
$$
where $$(x_k, y_k)$$ is the 2D position, $$(v_{x,k}, v_{y,k})$$ is the linear velocity, and $$\omega_k$$ is the angular velocity (yaw rate).

2. Prediction Step (IMU-Driven):
The IMU provides high-frequency acceleration $$\mathbf{a}_k$$ and angular rate $$\dot{\theta}_k$$. These are integrated to predict the state forward in time. The nonlinear state transition model is:
$$
\mathbf{x}_{k|k-1} = f(\mathbf{x}_{k-1}, \mathbf{u}_k) = \begin{bmatrix}
x_{k-1} + (v_{x,k-1}\cos(\theta_{k-1}) – v_{y,k-1}\sin(\theta_{k-1}))\Delta t \\
y_{k-1} + (v_{x,k-1}\sin(\theta_{k-1}) + v_{y,k-1}\cos(\theta_{k-1}))\Delta t \\
v_{x,k-1} + a_{x,k} \Delta t \\
v_{y,k-1} + a_{y,k} \Delta t \\
\omega_{k-1} + \dot{\theta}_k \Delta t
\end{bmatrix}
$$
where $$\mathbf{u}_k = [a_{x,k}, a_{y,k}, \dot{\theta}_k]^T$$ is the IMU control input and $$\Delta t$$ is the time step. The state covariance matrix $$P$$ is also propagated forward:
$$
P_{k|k-1} = F_k P_{k-1} F_k^T + Q_k
$$
where $$F_k = \frac{\partial f}{\partial \mathbf{x}} \big|_{\mathbf{x}_{k-1}}$$ is the Jacobian of the motion model, and $$Q_k$$ is the process noise covariance representing uncertainty in the IMU model.

3. Update Step (LiDAR / RGB-D):
When a LiDAR scan or RGB-D feature measurement $$\mathbf{z}_k$$ arrives, the EKF corrects the predicted state.
• LiDAR Update: A scan matching algorithm (like ICP or PL-ICP) provides a relative pose change $$\mathbf{z}_k^{lidar} = [\Delta x, \Delta y, \Delta \theta]^T$$. The observation model $$h_{lidar}(\mathbf{x}_{k|k-1})$$ predicts this change based on the state. The innovation and Kalman gain are computed:
$$
\mathbf{y}_k = \mathbf{z}_k^{lidar} – h_{lidar}(\mathbf{x}_{k|k-1})
$$
$$
K_k = P_{k|k-1} H_{lidar,k}^T (H_{lidar,k} P_{k|k-1} H_{lidar,k}^T + R_{lidar})^{-1}
$$
where $$H_{lidar,k} = \frac{\partial h_{lidar}}{\partial \mathbf{x}} \big|_{\mathbf{x}_{k|k-1}}$$ is the observation Jacobian and $$R_{lidar}$$ is the LiDAR measurement noise covariance.
• RGB-D Visual Odometry Update: Similarly, a visual odometry module (e.g., based on ORB features) computes a motion estimate $$\mathbf{z}_k^{vo}$$, triggering an analogous update step with its own observation model $$h_{vo}$$ and Jacobian $$H_{vo,k}$$.

The state and covariance are then updated:
$$
\mathbf{x}_k = \mathbf{x}_{k|k-1} + K_k \mathbf{y}_k
$$
$$
P_k = (I – K_k H_k) P_{k|k-1}
$$
This recursive predict-update cycle runs at a high rate, fusing the IMU’s high-frequency but drift-prone data with the absolute but lower-frequency geometric corrections from LiDAR and vision. The result is a smooth, accurate, and robust odometry estimate that is far less susceptible to individual sensor failure or noise spikes, which is vital for the stable operation of the intelligent robot.

Dynamic Object Filtering for Map Purity

A crucial requirement for the intelligent robot in a nursing home is to filter out slowly moving residents (dynamic objects) to build a static map of the environment. This was achieved by post-processing the RGB-D point clouds. A distance threshold filter first removes points beyond a practical range (e.g., > 8.0m). Subsequently, a statistical outlier removal filter or a background subtraction technique based on comparing consecutive aligned point clouds is applied. This process identifies clusters that move between frames—corresponding to walking residents or staff—and excludes them from the map-building process. While not perfect, this filtering significantly improves the persistence and reliability of the static environmental map used for long-term navigation.

Experimental Validation and Performance Analysis

The performance of the enhanced multimodal SLAM system was rigorously evaluated in a real-world laboratory environment configured to mimic a nursing home setting. The test area was an 11.79m x 8.43m room containing narrow corridors (simulated by large obstacles forming a “回” shape) and smaller, irregular obstacles. Variable lighting conditions were tested, and slowly moving personnel were introduced to simulate residents. The intelligent robot performed mapping runs using four configurations: 1) Gmapping (LiDAR-only), 2) ORB-SLAM2 (RGB-D-only), 3) Standard RTAB-Map (LiDAR+RGB-D), and 4) Our Optimized RTAB-Map (LiDAR+RGB-D+IMU with GTSAM & EKF).

Qualitative and Quantitative Results

The qualitative improvements were immediately apparent. Maps generated by the optimized system showed drastically reduced edge blurring and misalignment, especially in narrow corridors and on low-texture walls. The global consistency of the map was preserved, with loop closures correctly aligning revisited areas without introducing distortions. Crucially, the system demonstrated remarkable robustness in weaker lighting where standard visual methods faltered.

Quantitative metrics were gathered to substantiate these observations, as detailed in the following table which compares the performance of the standard and optimized RTAB-Map implementations on our intelligent robot platform.

Performance Metric	Standard RTAB-Map	Optimized RTAB-Map (Our System)	Improvement
Trajectory Drift Rate	High (Evident in map bending & misalignment)	Reduced by ~68%	68% Reduction
State Estimation Error	Subject to noise spikes and cumulative error	Reduced by ~42% (via EKF fusion)	42% Reduction
Edge Alignment Error (Narrow Space)	> 15 cm	≤ 5 cm	> 66% Improvement
Map Completeness (Low-Light)	Low (Large gaps due to lost features)	Increased by ~89%	89% Improvement
Dynamic Object Filtering Success	~30% (Many dynamic points persisted)	> 80%	> 50% Improvement
CPU Utilization	> 92% (Frequent lag)	< 45%	> 50% Reduction
Global Map Consistency Error	> 0.5 m	< 0.3 m	~40% Improvement

The mathematical underpinning of the drift reduction can be observed in the optimized pose graph. Let the final corrected pose of the robot after a loop closure be $$\mathbf{x}_N^*$$. The drift without global optimization over a trajectory length $$L$$ can be modeled as accumulating error $$\epsilon_{odom}$$ per meter. With optimization, the global error is bounded by the loop closure constraint error $$\epsilon_{loop}$$, which is typically much smaller. The effective drift rate improvement $$\eta$$ can be expressed as:

$$
\eta = 1 – \frac{\epsilon_{loop} / L}{\epsilon_{odom}} \approx 1 – \frac{\lVert \mathbf{e}_{loop}(X^*) \rVert}{L \cdot \epsilon_{odom}}
$$

Where $$\lVert \mathbf{e}_{loop}(X^*) \rVert$$ is the magnitude of the loop closure error after optimization. The experimental data confirms that our system minimizes this residual error, leading to the reported ~68% reduction in apparent drift. Furthermore, the EKF’s role in reducing state estimation error is quantified by the trace of the posterior covariance matrix $$P_k$$ being consistently lower than that from a single-sensor odometry source, validating the 42% improvement claim. These comprehensive results demonstrate that the proposed optimizations directly and effectively address the core challenges of SLAM for an intelligent robot in geriatric care settings.

Conclusion

This research presented a comprehensive solution for robust SLAM in nursing home environments using a multimodal intelligent robot. By critically analyzing the limitations of existing algorithms, an enhanced system was developed based on the RTAB-Map framework. The key contributions are the integration of GTSAM for global pose graph optimization, which rigorously enforces consistency across the entire map, and the implementation of an EKF-based frontend for tight coupling of IMU, LiDAR, and RGB-D data, resulting in a stable and accurate odometry estimate. A dedicated filtering strategy was also incorporated to mitigate the impact of slowly moving dynamic objects (residents) on map construction.

Experimental validation in a realistic mock environment conclusively demonstrated the superiority of the optimized system. Significant quantitative improvements were achieved across all critical metrics: trajectory drift was reduced by approximately 68%, state estimation error fell by 42%, map accuracy in narrow spaces reached ≤5 cm alignment error, and robustness in low-light conditions improved by 89%. Furthermore, the system’s ability to filter dynamic interference improved by over 50%, and computational efficiency was greatly enhanced.

Therefore, the proposed enhanced multimodal SLAM algorithm successfully solves the fundamental problems of edge misalignment, mapping blanks, and global inconsistency that hinder intelligent robot deployment in gerocomiums. This work provides a solid technical foundation, enabling reliable autonomous navigation and mapping which is a prerequisite for delivering effective assistive and service functions in care facilities for the elderly. Future work will focus on integrating semantic understanding to further improve dynamic object handling and enable more sophisticated human-robot interaction.