Robotic Terrain Perception Using Multimodal Text-Visual Large Models

In recent years, the advancement of deep learning and robot technology has propelled embodied intelligence as a critical step toward achieving general artificial intelligence. Embodied intelligence integrates the physical entity of a robot with its intelligent system to perform tasks in complex environments. This system must perceive the environment in real-time and make optimal decisions. A fundamental challenge in this domain is the robust identification of terrain types, which is particularly difficult in diverse and dynamic settings due to factors like weather, lighting, humidity, and terrain variations. Terrains with similar geometric features may exhibit vastly different physical characteristics, necessitating more informative assistance and intelligent algorithms for reliable terrain recognition in robot technology applications.

Current robotic navigation primarily relies on perceiving geometric features of the environment, such as slope, roughness, and undulation. However, relying solely on geometric features fails to ensure robust terrain identification, especially for terrains that appear similar but have distinct physical properties. Visual systems can provide detailed texture and pixel information, extracting semantic cues from images to distinguish terrains based on color and texture features, offering an excellent solution. Given the complementary nature of geometric and visual features, some approaches fuse multiple features to identify complex terrains. Additionally, semantic segmentation networks built on deep learning methods demonstrate superior accuracy and efficiency in feature extraction. For instance, visual information can acquire semantic clues to guide robots in avoiding non-geometric obstacles, highlighting the importance of robot technology in adaptive navigation.

To address these challenges, we propose a novel algorithm framework for terrain perception based on multimodal text-visual large model information fusion. This framework enhances the intelligent perception capabilities of robots in dynamic and complex environments without requiring labeled data. Our approach integrates SLIC (Simple Linear Iterative Clustering) for image data preprocessing, CLIP (Contrastive Language-Image Pre-Training) and SAM (Segment Anything Model) for mask generation, and Dice coefficient for post-processing. The core architecture consists of two main stages: generating labeled data using large pre-trained models and validating the effectiveness of the terrain mask data. In the mask generation stage, the network leverages CLIP to associate input visual images with relevant terrain text information, utilizing its interpretability and zero-shot learning capabilities to generate terrain prompt points. SAM then receives these points to produce mask data with semantic labels. In the validation stage, a lightweight segmentation network is constructed using the generated masks as training labels and deployed on a quadruped robot’s edge computing device. The robot uses this model to predict terrain segmentation semantic maps, combining prior knowledge of physical terrain characteristics to avoid hazardous terrains and non-geometric obstacles while optimizing motion strategies.

The key contributions of our work in robot technology are as follows: First, we introduce a data generation-validation network framework suitable for terrain perception tasks, which operates solely on visual texture information without any mask labels. Second, we propose preprocessing and postprocessing methods for the mask generation network to enhance mask accuracy, validated through ablation experiments. Third, we present an innovative method for obtaining terrain segmentation mask labels by leveraging CLIP’s zero-shot learning and interpretability to generate prompt points, combined with SAM’s strong generalization to produce terrain-labeled data. Finally, we extensively evaluate the algorithm framework on the Cityscapes dataset and conduct practical experiments in outdoor environments, demonstrating the framework’s reliability in robot technology applications.

The overall network architecture comprises two independent phases: mask generation and quantitative validation. The mask generation network串联 CLIP and SAM models to process terrain images and related text vocabulary, generating terrain semantic segmentation masks for training the validation network. By applying SLIC to segment images into terrain blocks, each block is dominated by a single terrain type, improving image-text matching accuracy. CLIP matches input images with text, and through interpretability algorithms, generates relevance heatmaps to pinpoint pixel locations for each terrain. Combining text labels and location information, SAM produces predicted masks for each terrain. The quantitative validation network uses terrain images and generated masks to build a lightweight terrain segmentation model, akin to knowledge distillation, to verify the upper network’s mask data validity. In real-world experiments, this segmentation network is integrated into a quadruped robot system, assisting in navigation by predicting terrain types and adjusting motion strategies based on physical parameters.

In the mask generation network, SAM operates as a prompt-based model that processes sparse prompts, such as points, to generate corresponding image masks. CLIP is employed to generate labeled prompt information. Specifically, SLIC first segments the original image into multiple patches. CLIP then classifies these patches, identifying the terrain type for each. Using CLIP’s interpretability, the pixel indices with the highest relevance to each terrain are located as prompt point positions. These positions are combined with category information to form the prompt input for SAM. Mathematically, the response feature $\mathbf{F}_m$ is obtained by element-wise multiplication of image features and text features in CLIP, and then multiplied by classification weights $\mathbf{w}$ to compute a common redundant feature $\mathbf{F}_r$:

$$\mathbf{F}_r = \text{mean}(\mathbf{F}_m \odot \text{expand}(\mathbf{w}))$$

The pre-selected similarity map is derived by subtracting the expanded redundant feature from the response feature, and normalized to yield the final relevance heatmap $\mathbf{M}$:

$$\mathbf{M} = \text{norm}(\text{sum}(\mathbf{F}_m – \text{expand}(\mathbf{F}_r)))$$

For each text input, interpretability computation produces a heatmap for each terrain. The position indices $(x_i, y_i)$ of the maximum response value in the relevance heatmap serve as prompt points for that category:

$$(x_i, y_i) = \arg\max(\mathbf{M}_i), \quad i \in N_{\text{patch}}$$

where $N_{\text{patch}}$ is the number of terrain types. CLIP outputs the probability of the image matching each terrain, with the sum of probabilities equal to 1. If the probability for a terrain exceeds a set threshold, it is considered the correct label; otherwise, it is labeled as background:

$$c_i = \begin{cases} 0 & \text{if } \text{conf}_j > \text{thr} \\ j & \text{if } \text{conf}_j > \text{thr} \end{cases}, \quad j \in N_{\text{label}}$$

where $\text{thr}$ is a prior probability threshold, $\text{conf}_j$ is the probability for each terrain, and $N_{\text{label}}$ is the total number of terrain classes. Thus, CLIP processes terrain images and text inputs to compute matching probabilities and generate relevance heatmaps $\mathbf{M}$, determining position indices $(x_i, y_i)$ for each terrain category. These are merged with text labels $c_i$ to form the prompt point set $\mathbf{P}$:

$$\mathbf{P} = \{(x_i, y_i, c_i), \quad i \in N_{\text{patch}}\}$$

This prompt set, along with the terrain image, is input to SAM to generate terrain mask label data.

The quantitative validation network aims to verify the effectiveness of the generated masks. Due to the large parameter computations in the mask generation network based on massive foundation models, real-time detection on edge devices is infeasible. Thus, the validation network must be simplified while maintaining performance for efficient deployment on resource-constrained edge computing devices in robot technology. We adopt a U-net architecture with an encoder-decoder design. The encoder downsamples to extract image features and reduce spatial resolution, while the decoder upsamples to restore the original image size and construct the segmentation result. Skip connections fuse corresponding encoder and decoder features to enhance accuracy and detail preservation. The training employs a combined loss function of cross-entropy loss $\text{Loss}_{\text{CE}}$ and Dice loss $\text{Loss}_{\text{Dice}}$:

$$\text{Loss} = \text{Loss}_{\text{CE}} + \text{Loss}_{\text{Dice}}$$

where cross-entropy optimizes classification, and Dice loss improves segmentation performance.

During experiments, we observed that CLIP’s robustness in recognizing certain terrains was suboptimal, with notable false positives and misses. This may stem from CLIP’s design for classification tasks, where output volatility increases with multiple objects in an image. To address this, we introduced SLIC algorithm preprocessing to segment input images into sub-blocks, each containing primarily one terrain type. This approach reduces noise impact and increases prompt points, significantly improving mask accuracy. The SLIC algorithm initializes by randomly selecting pixels as superpixel centers, then computes the distance between each pixel $p$ and center $k$ using a weighted combination of Euclidean distance and CIELAB color space:

$$d(p, k) = \sqrt{\alpha \cdot \Delta C^2 + \beta \cdot \Delta S^2}$$

where $\Delta C^2$ and $\Delta S^2$ are the color space and Euclidean distances between pixel $p$ and center $k$, and $\alpha, \beta$ are weight parameters balancing color and spatial influences. A weight function $T(p, k)$ is defined for updating superpixel centers:

$$T(p, k) = e^{-\frac{d(p, k)^2}{2\sigma^2}}$$

where $\sigma$ is a regularization parameter controlling the function’s width. Centers are updated iteratively using this weight function until convergence:

$$C_k^{\text{new}} = \frac{\sum_{p \in R_k} T(p, k) \cdot I_p}{\sum_{p \in R_k} T(p, k)}$$

where $C_k^{\text{new}}$ is the new center for superpixel $k$, and $R_k$ is the set of pixels in that superpixel. Iteration stops when center changes fall below a threshold or maximum iterations are reached, outputting the image segmentation result.

Additionally, since the algorithm integrates two large models in series, cumulative errors may produce incorrect mask labels, hindering validation network convergence. We propose a postprocessing mechanism using the Dice coefficient to judge mask correctness and filter out errors. By decoupling video streams into frames, temporal relationships are preserved. Assuming minimal change in terrain spatial position and proportion over short periods, we compute the Dice coefficient between consecutive frames for the same terrain mask:

$$\text{Dice} = \frac{2|A \cap B|}{|A| + |B|}$$

where $A$ and $B$ represent segmentation masks for the same terrain in adjacent frames. After generating all masks, except for the first frame, each subsequent mask is compared with its predecessor. If the Dice coefficient is below a preset threshold, the mask is discarded; otherwise, it is retained.

For experimental setup, we used a quadruped robot platform (1m × 0.42m × 0.6m, 50kg) with six-degree-of-freedom leg mechanisms, capable of climbing stairs, traversing 30-degree slopes, and walking on grass and gravel. The robot is equipped with multiple RealSense-D435i depth cameras for environmental visual data, WIFI, and 5G modules for communication. To enhance edge computing, two Orin AGX modules are integrated, each providing 275 TOPS performance for deploying terrain segmentation models. Data collection involved a DJI MINI2 drone capturing surrounding environment images at a flight height of 1 m, pitch angle of 60°, and recording frequency of 60 Hz. The drone circled the experimental area, and images were packaged as a terrain dataset. A deep learning server automatically generated masks and built lightweight segmentation models, which were then integrated into the quadruped robot for navigation tasks.

The mask generation network focuses on terrain segmentation without labeled data, so it doesn’t rely on training datasets. To validate effectiveness, we used the Cityscapes dataset, which contains 50 urban street scene video sequences with 5000 pixel-level annotated images. Although not perfectly aligned with our terrain focus, it remains relevant for algorithm evaluation. For quantitative validation, we collected images via drone and used mask generation network outputs to create a dataset covering various terrains like gravel, grass, and concrete under different times and lighting conditions. This dataset built lightweight models for ablation experiments and outdoor terrain perception tests in robot technology.

In network parameter settings, CLIP uses VIT-B/32 pre-trained weights, and SAM uses VIT-H pre-trained weights. CLIP input image size is 224×224 pixels, while SAM uses original resolutions: 1024×2048 for Cityscapes and 360×360 for terrain segmentation experiments. Mask generation employs Dice postprocessing with a threshold of 0.6. For quantitative validation training, the network uses Adam optimizer with initial learning rate 1e-5, linear decay, batch size 8, and 10 epochs on Tesla V100 GPU. Input and mask sizes are 360×360 pixels, with data augmentation including color space transformations, random cropping, and affine transformations. The encoder downsamples to 1/2, 1/4, and 1/8 scales, and the decoder uses deconvolution to restore size.

Evaluation metrics include Intersection over Union (IoU) and mean IoU (mIoU). IoU measures similarity between predicted and true regions:

$$\text{IoU} = \frac{|A \cap B|}{|A \cup B|}$$

where $A$ is the predicted mask and $B$ is the ground truth mask. mIoU is the average IoU across all classes.

Experiments on prompt point generation show that CLIP’s interpretability produces relevance heatmaps where highly relevant regions form peaks, guiding SAM to generate semantic masks. For example, in input images, matching probabilities for terrains like “grass” may be low (e.g., 0.41) with dispersed heatmaps, while “gravel” probabilities are high (e.g., 0.96) with concentrated responses. By comparing confidence scores to thresholds, the network extracts semantic information and locations. In tests along a path from point A to B through mixed terrains like “concrete-gravel” and “concrete-grass”, probability visualizations show that without SLIC, robustness is poor in green (grass) areas with high fluctuation and misjudgment rates. With SLIC preprocessing, images are split into sub-blocks (e.g., 16 blocks), each independently classified, providing multiple prompt points and probabilities. This reduces probability fluctuations and enhances discrimination accuracy across regions.

For terrain mask generation and screening, CLIP-derived prompts enable SAM to produce mask data. In initial versions, CLIP processed raw images, yielding up to two prompts per frame as single images rarely show more than three terrains. With SLIC, more prompts are generated by controlling the number of image blocks. Comparisons show that increasing robust prompt points reduces noise and improves mask quality. For instance, in concrete terrain masks, green points represent concrete prompts, and red points other terrains; more points enhance output. Along the A-to-B path, generated masks accurately segment specified terrains and identify unlisted types like “background” for manhole covers, demonstrating foundation models’ generalization. Detailed segmentation of concrete road seams is also achieved.

Validation experiments on Cityscapes assess mask generation network performance. We compare our algorithm with state-of-the-art segmentation models under self-supervised and supervised learning frameworks. Results are summarized in Table 1.

Table 1: Evaluation Results on Cityscapes Dataset Under Different Supervision Frameworks
Network	Supervision	IoU (%)	Usable Mask Count	Usable Mask Ratio (%)
SERNet-Former	Supervised	98.82	/	/
Panoptic DeepLab	Supervised	98.88	/	/
SAC	Self-supervised	90.41	/	/
RPT	Self-supervised	89.2	/	/
Our Algorithm	No training	90.14	3446	76.58

The results demonstrate our method’s effectiveness, achieving 76.58% usable mask data without specific labels and a road mask IoU of 90.14%. While precision lags behind supervised models, it matches self-supervised approaches, confirming our framework’s ability to generate high-quality mask labels without training data in robot technology. Further analysis on Cityscapes test data across urban environments shows usable mask ratios generally exceed 70%, with most IoUs above 0.8, verifying the framework’s generalization from large pre-trained models and accurate segmentation of open-text terrains.

Ablation experiments validate the proposed mask optimization strategies. Using drone-collected outdoor images, we create mask label sets with different generation strategies and build segmentation models for evaluation. All hyperparameters follow earlier settings, and models are assessed on manually annotated masks. Table 2 details results, where SLIC and Dice denote our preprocessing and postprocessing techniques.

Table 2: Ablation Experiment Results
SLIC	Dice	mIoU (%)	Training Loss
×	×	91.42	0.9304
✓	×	93.78	0.8201
✓	✓	96.34	0.6665

The ablation shows that SLIC and Dice strategies improve mIoU by 2.36% and 2.56%, respectively, reducing training loss and accelerating convergence. Figure 5 illustrates Dice postprocessing’s positive impact on training, confirming our strategies enhance mask precision and facilitate convergence on terrain semantics and spatial information.

In quadruped robot prototype experiments, we first use drone-collected images and terrain-related text input to the algorithm framework to automatically generate terrain masks. These serve as supervision for building a lightweight segmentation model deployed on the robot for real-time image segmentation. Following parameter settings, the model accurately segments terrains like grass, concrete, and gravel. The experimental environment includes original camera images and model predictions. Based on segmentation results, the robot avoids hazardous terrains and selects optimal motion strategies: grass is marked dangerous, prompting detours; concrete allows low-step-frequency, high-stride fast movement; and gravel requires higher step frequency and reduced speed for safe passage. This approach enables safe navigation from start to destination while avoiding non-geometric obstacles like grass. Practical tests show the lightweight model delivers reliable segmentation, enhancing travel efficiency and safety in robot technology.

In conclusion, we propose a robotic terrain perception algorithm based on multimodal text-visual large models. Leveraging CLIP and SAM’s generalization and interpretability, it achieves terrain segmentation without labeled data, supporting autonomous navigation in complex environments. Cityscapes dataset evaluations show over 75% usable masks with road terrain IoU of 90.14%. For quadruped robot terrain perception, real prototype experiments using outdoor visual texture information yield a lightweight model with mIoU of 96.34% on test data, demonstrating high segmentation accuracy and computational efficiency. This work enhances environmental adaptability and safety in robot technology, offering new solutions for complex computations on edge devices. Future efforts will focus on innovating fusion algorithms to maximize multimodal data advantages, improving segmentation precision and robustness for extremely complex terrain environments in robot technology applications.