A Monocular Visual Odometry for Robots Integrated with LightGlue

In the field of robot technology, simultaneous localization and mapping (SLAM) has emerged as a critical capability for autonomous systems to navigate and understand their environment in real-time. Visual odometry (VO), a key component of SLAM, enables robots to estimate their position and orientation by analyzing sequential images captured by cameras. This approach offers significant advantages over traditional methods like GPS, inertial navigation, or LiDAR, particularly in terms of cost, adaptability, and environmental richness. However, challenges such as viewpoint changes, illumination variations, and computational efficiency persist, necessitating advanced solutions. In this article, I present a monocular visual odometry framework that integrates the LightGlue matching method to address these issues, enhancing accuracy and robustness in robot technology applications.

Traditional methods in robot technology often rely on handcrafted features like ORB or SIFT for visual odometry. While these approaches are computationally efficient, they struggle with environmental changes, leading to increased errors in pose estimation. With the advent of deep learning, feature extraction and matching have seen substantial improvements. For instance, SuperPoint provides self-supervised keypoint detection and descriptor computation, outperforming older methods in low-texture and dynamic scenes. Similarly, matching algorithms like SuperGlue and LightGlue leverage neural networks to enhance correspondence accuracy. LightGlue, in particular, introduces adaptive depth and width mechanisms, allowing it to dynamically adjust computational load based on image pair difficulty, which is crucial for real-time robot technology applications.

The proposed LG-VO algorithm consists of three main stages: feature extraction, feature matching, and pose estimation with optimization. First, SuperPoint is employed to detect keypoints and compute descriptors from input images. This network uses a shared encoder followed by separate decoders for keypoints and descriptors, producing a dense set of features. Mathematically, the keypoint decoder outputs a probability map, while the descriptor decoder generates normalized vectors. For an input image of dimensions $W \times H$, the process can be summarized as follows: the shared encoder reduces the spatial dimensions to $W/8 \times H/8$ with 128 channels, after which the keypoint decoder applies a Softmax function to identify feature points, and the descriptor decoder uses bicubic interpolation and L2 normalization to produce descriptors. This step ensures robust feature representation even under varying conditions in robot technology environments.

Next, LightGlue is utilized for feature matching between consecutive frames. This deep learning-based matcher employs self-attention and cross-attention mechanisms to establish correspondences. The self-attention mechanism allows each feature point to attend to others in the same image, enhancing contextual understanding. For a feature point $i$ with state $\mathbf{x}_i$, the attention score $a_{ij}$ with another point $j$ is computed as:

$$a_{ij} = \mathbf{q}_i^T \mathbf{H}(\mathbf{p}_j – \mathbf{p}_i) \mathbf{k}_j$$

where $\mathbf{q}_i$ and $\mathbf{k}_j$ are query and key vectors, and $\mathbf{H}(\cdot)$ is a rotational encoding matrix for relative positions. The cross-attention mechanism enables points in one image to attend to those in another, with scores given by:

$$a_{ij}^{IS} = \mathbf{k}_i^{I^T} \mathbf{k}_j^S$$

LightGlue also incorporates a confidence classifier that predicts whether to stop inference early, reducing unnecessary computations. The confidence $c_i$ for a point $i$ is calculated as:

$$c_i = \text{Sigmoid}(\text{MLP}(\mathbf{x}_i)) \in [0,1]$$

If the exit criterion is met, where a sufficient proportion of points have high confidence, the process halts. Otherwise, low-confidence points are pruned. This adaptive approach significantly improves efficiency, which is vital for real-time robot technology systems. Finally, the matcher outputs a similarity matrix $\mathbf{F}$ and matchability scores, combined into a probabilistic assignment matrix $\mathbf{G}$ to determine valid correspondences.

After matching, the pose estimation stage begins. First, the Random Sample Consensus (RANSAC) algorithm is applied to remove outliers and optimize the match set. Then, the essential matrix $\mathbf{E}$ is computed from the 2D-2D correspondences using the epipolar constraint:

$$\mathbf{y}_2^T \mathbf{E} \mathbf{y}_1 = 0$$

where $\mathbf{y}_1$ and $\mathbf{y}_2$ are normalized coordinates of matched points. The essential matrix is decomposed into rotation matrix $\mathbf{R}$ and translation vector $\mathbf{t}$ via singular value decomposition. This provides an initial pose estimate. To refine this, a minimization problem is formulated based on reprojection error. For a pose transformation matrix $\mathbf{T} = [\mathbf{R} | \mathbf{t}]$, the error $\boldsymbol{\xi}_i$ for a point $\mathbf{f}_i$ is:

$$\boldsymbol{\xi}_i = \mathbf{f}_{i+1} – \mathbf{T}_i \mathbf{f}_i$$

The optimal pose is found by minimizing the sum of squared errors:

$$\boldsymbol{\xi}^* = \arg \min_{\boldsymbol{\xi}} \frac{1}{2} \sum_{i=1}^k \| \mathbf{f}_{i+1} – \mathbf{T}_i \mathbf{f}_i \|_2^2$$

This optimization ensures accurate and stable pose estimation, which is essential for reliable navigation in robot technology.

To evaluate the LG-VO algorithm, experiments were conducted on the KITTI dataset, specifically sequences 00 to 10, which include various outdoor scenarios with ground truth trajectories. The performance was compared against ORB, SIFT, and SG-VO (SuperGlue-based VO) methods in terms of absolute trajectory error (ATE), relative pose error (RPE), and runtime. All tests were performed on identical hardware with an Intel Xeon CPU and NVIDIA T4 GPU to ensure fairness.

The absolute trajectory error measures the deviation between estimated and ground truth paths. The root mean square error (RMSE) of ATE is computed as:

$$\text{RMSE}(\mathbf{g}) = \sqrt{\frac{1}{N} \sum_{i=1}^N \| \mathbf{g}_i^{\text{est}} – \mathbf{g}_i^{\text{gt}} \|^2}$$

where $\mathbf{g}_i^{\text{est}}$ and $\mathbf{g}_i^{\text{gt}}$ are the estimated and ground truth positions at frame $i$. The results for sequence 08, as an example, show that LG-VO achieves the lowest ATE, indicating superior accuracy. The following table summarizes ATE across all sequences:

Dataset Sequence	ORB Algorithm (m)	SIFT Algorithm (m)	SG-VO Algorithm (m)	Proposed LG-VO Algorithm (m)
00	289.0833	28.6731	16.1029	12.2559
01	233.0952	35.8639	67.6806	30.3371
02	406.8774	39.6483	23.8610	22.9526
03	22.8572	3.1899	2.7207	2.5023
04	23.2418	1.3790	1.1189	1.1380
05	125.3367	24.4199	18.0627	13.4972
06	316.2606	4.0165	4.1318	3.9881
07	77.3247	11.9053	9.1095	6.1607
08	260.4064	20.3777	14.4751	8.1008
09	196.1379	26.2063	15.8226	13.6527
10	212.3166	11.7520	3.6586	2.6368

As seen, LG-VO consistently outperforms other methods in most sequences, demonstrating its robustness in robot technology applications under varying conditions. The only exception is sequence 04, where dynamic objects and short path length lead to increased error, but the difference is minimal.

Relative pose error evaluates the accuracy of pose changes over fixed intervals. It comprises relative translation error (RTE) and relative rotation error (RRE). For frames $i$ and $i+\Delta$, the relative pose error $\mathbf{Z}_i$ is:

$$\mathbf{Z}_i = (\mathbf{U}_i^{-1} \mathbf{U}_{i+\Delta})^{-1} (\mathbf{W}_i^{-1} \mathbf{W}_{i+\Delta})$$

where $\mathbf{U}_i$ and $\mathbf{W}_i$ are the ground truth and estimated poses. The RMSE of RPE is calculated as:

$$\text{RMSE}(\mathbf{Z}_{1:n}, \Delta) = \left( \frac{1}{m} \sum_{i=1}^m \| \mathbf{Z}_i \|^2 \right)^{\frac{1}{2}}$$

The results for sequence 08 show that LG-VO maintains low errors throughout, with significant improvements over ORB, SIFT, and SG-VO. The following tables provide detailed RTE and RRE across sequences:

Dataset Sequence	ORB Algorithm (m)	SIFT Algorithm (m)	SG-VO Algorithm (m)	Proposed LG-VO Algorithm (m)
00	0.7484	0.1769	0.1025	0.0770
01	0.6096	0.1659	0.2193	0.1539
02	1.4206	0.2079	0.1367	0.1253
03	0.1461	0.1082	0.0852	0.0753
04	0.5311	0.0210	0.0340	0.0371
05	0.6263	0.1453	0.0522	0.0417
06	1.1838	0.1290	0.1105	0.0920
07	0.3556	0.0815	0.0756	0.0685
08	0.6687	0.1411	0.1369	0.0813
09	0.6167	0.1585	0.0816	0.0702
10	0.7666	0.1395	0.1024	0.0986

Dataset Sequence	ORB Algorithm (°)	SIFT Algorithm (°)	SG-VO Algorithm (°)	Proposed LG-VO Algorithm (°)
00	0.4079	0.0783	0.0690	0.0596
01	0.3997	0.0623	0.0890	0.0536
02	0.4803	0.1053	0.1050	0.0936
03	0.0787	0.0528	0.0542	0.0515
04	0.0895	0.0419	0.0406	0.0428
05	0.4132	0.0890	0.0780	0.0605
06	0.4642	0.0663	0.0678	0.0659
07	0.4860	0.0649	0.0585	0.0523
08	0.6181	0.0755	0.0667	0.0591
09	0.5173	0.0789	0.0573	0.0547
10	0.6689	0.0868	0.0610	0.0557

These results highlight the consistency of LG-VO in reducing both translation and rotation errors, which is crucial for precise movement in robot technology.

Runtime efficiency is another critical aspect, especially for real-time robot technology applications. The following table compares the total runtime in seconds for processing sequences 00 to 10:

Dataset Sequence	ORB Algorithm (s)	SIFT Algorithm (s)	SG-VO Algorithm (s)	Proposed LG-VO Algorithm (s)
00	225.37	413.12	851.22	459.70
01	35.26	84.32	180.96	95.78
02	260.77	450.13	994.22	462.87
03	40.93	76.15	134.85	67.62
04	13.13	22.25	55.58	25.07
05	111.02	248.81	486.63	276.14
06	41.51	96.95	203.42	105.36
07	39.97	92.50	191.98	106.74
08	184.17	466.41	892.32	488.30
09	67.53	131.39	321.66	159.56
10	48.68	101.12	254.98	130.06

As observed, LG-VO significantly reduces runtime compared to SG-VO, approaching the efficiency of SIFT while maintaining higher accuracy. This balance is essential for deploying advanced visual odometry in resource-constrained robot technology systems.

In conclusion, the integration of LightGlue into monocular visual odometry presents a robust solution for robot technology, addressing challenges like illumination changes and viewpoint variations. The LG-VO algorithm demonstrates superior performance in terms of accuracy and efficiency, as validated through extensive experiments on the KITTI dataset. By leveraging deep learning for feature extraction and matching, along with geometric constraints for pose estimation, this approach enhances the reliability of autonomous navigation. Future work will focus on incorporating inertial measurement unit (IMU) data to further improve performance in complex environments, such as those with low texture or dynamic obstacles, thereby advancing the capabilities of robot technology in real-world applications.