Robotic Grasp Detection in Low-Light Environments with Spatial-Fourier Domain Fusion

In recent years, the rapid advancement of automation and intelligent robot technology has led to the widespread application of robotic grasping in various fields such as industry, agriculture, healthcare, and underwater operations. However, despite the maturity of existing grasp detection methods, robots often struggle to perform stably under varying environmental lighting conditions. Particularly in low-light environments, insufficient illumination limits the information captured by cameras, resulting in sparse and weak features that degrade grasping accuracy and reliability. Traditional approaches, such as improving on-site lighting or enhancing camera capabilities, offer limited solutions due to practical constraints. In contrast, data-driven methods leveraging deep learning have emerged as promising alternatives. These include CNN-based and GAN-based techniques that enable models to adaptively process images, enhancing robustness in diverse scenarios. However, most current methods focus solely on spatial domain features (e.g., grayscale, edges, textures), neglecting the potential of Fourier domain information (e.g., amplitude, phase), which can complement feature extraction in low-light conditions. To address this gap, we propose a novel robotic grasp detection method that integrates spatial and Fourier domain information, combined with an attention mechanism, to improve performance in low-light environments. Our approach enhances feature extraction by capturing global context in the spatial domain and restoring details in the Fourier domain, while a row-column attention module prioritizes position-related information critical for grasping tasks. Through extensive experiments on low-light datasets, including our proposed C-Cornell dataset, we demonstrate that our method achieves state-of-the-art accuracy and robustness, paving the way for reliable robot technology in challenging lighting conditions.

The core of our method lies in a hybrid feature extraction framework that processes both spatial and Fourier domains. The backbone network employs an encoder-decoder structure, where deep and shallow features are fused through parallel branches. In the spatial domain, we utilize strip convolutions in horizontal and vertical directions to capture global contextual information, which is essential for identifying graspable regions. This is expressed mathematically as follows: for an input feature map $ D_i $, we compute the gathered features $ C_{\text{gather}} $ as:

$$ C_{\text{gather}} = \text{Conv}_{k \times 1}(\text{Pool}_h(D_i)) + \text{Conv}_{1 \times k}(\text{Pool}_v(D_i)) $$

where $ \text{Pool}_h $ and $ \text{Pool}_v $ denote horizontal and vertical pooling operations, respectively, and $ \text{Conv}_{k \times 1} $ and $ \text{Conv}_{1 \times k} $ represent large-kernel strip convolutions. The output is then refined through a combination of depthwise convolution, batch normalization, and a multilayer perceptron, enhanced by residual connections to promote feature reuse:

$$ C_{\text{out}} = \text{MLP}(\text{BN}(\text{Conv}_{3 \times 3}(D_i \odot C_{\text{gather}}))) + D_i $$

In the Fourier domain, we transform the input feature map into amplitude and phase components using the Fourier transform $ \mathcal{F} $:

$$ A_{\text{in}} = \mathcal{A}(\mathcal{F}(D_i)), \quad P_{\text{in}} = \mathcal{P}(\mathcal{F}(D_i)) $$

where $ A_{\text{in}} $ and $ P_{\text{in}} $ represent the amplitude and phase, respectively. These components are processed independently through dedicated modules to enhance brightness and texture details:

$$ A_{\text{out}} = \text{Mag}(A_{\text{in}}), \quad P_{\text{out}} = \text{Pha}(P_{\text{in}}) $$

The processed components are then combined via inverse Fourier transform $ \mathcal{F}^{-1} $ to reconstruct the feature map in the spatial domain:

$$ R_{\text{out}} = A_{\text{out}} \cos(P_{\text{out}}), \quad I_{\text{out}} = A_{\text{out}} \sin(P_{\text{out}}), \quad F_{\text{out}} = \mathcal{F}^{-1}(R_{\text{out}} + jI_{\text{out}}) $$

This dual-domain approach allows our model to leverage both local spatial details and global frequency information, which is particularly beneficial in low-light conditions where features are often obscured.

To further enhance the model’s focus on relevant positional information, we incorporate a Row-Column Attention (R-CoA) module. This module computes attention scores by pooling features horizontally and vertically, followed by query, key, and value transformations. The relative position encoding captures spatial relationships, which is crucial for accurate grasp detection. The attention mechanism is formulated as:

$$ C_{\text{con}} = \text{Concatenate}(\text{Pool}_h(X), \text{Pool}_v(X)) $$
$$ Q, K, V = \text{Split}(\text{Conv}_{QKV}(C_{\text{con}})) $$
$$ K = \text{softmax}(K, \text{dim}=-1) $$
$$ S = Q \cdot K^T \cdot V $$

Relative position encodings $ R_{\text{row}} $ and $ R_{\text{column}} $ are applied to emphasize row and column contributions, resulting in the final output:

$$ Y_r = V \cdot Q \cdot I_r \cdot R_{\text{row}}, \quad O_{\text{out}} = Y_r + Q \cdot I_c \cdot R_{\text{column}} \cdot S $$

where $ I_r $ and $ I_c $ are indices calculated based on image positions. This attention mechanism ensures that the model prioritizes features aligned with graspable regions, improving detection accuracy.

For training, we use a smooth L1 loss function to optimize the grasp parameters, including angle, width, and quality. The total loss is defined as:

$$ z(\alpha) = \begin{cases}
0.5(\alpha_g – \alpha_G)^2 & \text{if } |\alpha_g – \alpha_G| < 1 \\
|\alpha_g – \alpha_G| – 0.5 & \text{otherwise}
\end{cases} $$

where $ \alpha_g $ and $ \alpha_G $ represent predicted and ground-truth values for angle, width, and quality, respectively. The overall loss is a weighted sum:

$$ \text{Loss} = \lambda_1 z_{\text{angle}} + \lambda_2 z_{\text{width}} + \lambda_3 z_{\text{quality}} $$

with $ \lambda_1, \lambda_2, \lambda_3 $ set to 1.0 based on empirical validation.

We evaluate our method on three low-light datasets: low-light Cornell, low-light Jacquard, and our proposed low-light C-Cornell dataset. The low-light Cornell and Jacquard datasets are synthesized using gamma transformation to simulate varying illumination levels ($ \gamma = 1.2, 1.5, 2.0 $) and added noise types (Gaussian, salt-and-pepper, Poisson, local variance, and speckle noise). The gamma transformation is applied as $ I’ = I^\gamma $, where $ I $ is the input image. The C-Cornell dataset is generated using CycleGAN to translate normal-light images to low-light conditions, providing unpaired training data for enhanced realism. This dataset addresses the scarcity of authentic low-light grasping data and supports model generalization.

Our experimental setup involves training on an Ubuntu 20.04 system with an Intel Xeon Silver 4316 CPU and NVIDIA RTX 3090 GPU. We use a batch size of 8, 70 epochs, and the Adam optimizer with a learning rate of $ 10^{-3} $ and betas of (0.9, 0.999). Evaluation metrics include accuracy (based on rectangle metrics), FLOPs, parameters, and inference time. Accuracy is defined as the ratio of correct predictions to total instances, where a correct grasp must have an angle difference under $ 30^\circ $ and a Jaccard index over 0.25:

$$ \text{Acc} = \frac{\text{CorrectPredictions}}{\text{All}} $$

Quantitative results on the low-light Jacquard dataset, as shown in Table 1, demonstrate our method’s robustness across different noise types and gamma values, with accuracy consistently around 91-92%. Specifically, under Poisson noise with $ \gamma = 1.5 $, we achieve a peak accuracy of 92.01%, highlighting the effectiveness of Fourier domain features in handling noise.

Table 1: Grasp Detection Accuracy Under Different Gamma and Noise Conditions (Low-light Jacquard Dataset)
Gamma (γ)	Gaussian Noise	Salt-and-Pepper Noise	Local Variance Noise	Poisson Noise	Speckle Noise
1.2	91.77%	91.35%	91.55%	91.19%	91.88%
1.5	91.01%	91.02%	91.09%	92.01%	91.72%
2.0	90.73%	91.59%	91.00%	91.96%	91.39%

On the low-light Cornell dataset with $ \gamma = 1.5 $, our method outperforms existing approaches like GR-ConvNet, GR-ConvNetv2, GGCNN, and Se-ResUnet, as summarized in Table 2. For Gaussian noise, we achieve 96.62% accuracy, surpassing Se-ResUnet by 1.12 percentage points. Similarly, for salt-and-pepper noise, our method attains 96.62%, a 2.24-point improvement over Se-ResUnet. These gains underscore the advantage of integrating spatial and Fourier domains, which mitigates feature degradation in low-light conditions.

Table 2: Comparison of Grasp Detection Accuracy Under Gaussian and Salt-and-Pepper Noise (Low-light Cornell Dataset, γ=1.5)
Method	Gaussian Noise	Salt-and-Pepper Noise
GGCNN	84.00%	88.76%
GR-ConvNet	94.38%	92.13%
GR-ConvNetv2	94.38%	93.25%
Se-ResUnet	95.50%	94.38%
Our Method	96.62%	96.62%

We further validate our approach on the low-light C-Cornell dataset, as shown in Table 3. Our method achieves 95.50% accuracy, exceeding GR-ConvNet and GGCNN by 3.37 points and GR-ConvNetv2, TFgrasp, and Se-ResUnet by 2.25 points. Although our model has higher computational complexity (40.74 GFLOPs and 8.42M parameters) and an average inference time of 16.41 ms, it remains feasible for real-world robot technology applications, where accuracy is paramount.

Table 3: Performance Comparison on Low-light C-Cornell Dataset
Method	Accuracy	FLOPs	Params	Time (ms)
GR-ConvNet	92.13%	13.56G	1.90M	3.66
GR-ConvNetv2	93.25%	13.56G	1.90M	3.48
GGCNN	92.13%	1.18G	0.07M	0.63
TFgrasp	93.25%	1.50G	6.80M	12.17
Se-ResUnet	93.25%	24.88G	3.89M	4.39
Our Method	95.50%	40.74G	8.42M	16.41

Qualitative results on the low-light C-Cornell dataset illustrate that our method generates precise grasp rectangles centered on objects, with continuous quality, angle, and width maps. In contrast, methods like GGCNN and TFgrasp produce discontinuous maps or misaligned grasp boxes. For instance, GGCNN’s outputs include significant background interference, while TFgrasp’s maps exhibit discontinuities on object edges. Our approach, empowered by R-CoA attention and dual-domain fusion, effectively distinguishes objects from backgrounds, ensuring reliable detection even in dim lighting.

Ablation studies on the low-light C-Cornell dataset confirm the contribution of each module. As shown in Table 4, using only the Spatial Feature Extraction (SFE) module yields 93.22% accuracy, while the Fourier Domain Feature Extraction (FFE) module alone achieves 93.78%. The R-CoA module alone attains 92.65%. Combining any two modules improves accuracy, and the full model with all three modules reaches 95.50%. This demonstrates the synergistic effect of spatial and Fourier domain processing, along with attention mechanisms, in enhancing feature representation for low-light grasp detection.

Table 4: Ablation Study on Low-light C-Cornell Dataset
Experiment	SFE	FFE	R-CoA	Accuracy
1	✓			93.22%
2		✓		93.78%
3			✓	92.65%
4	✓	✓		93.78%
5	✓		✓	94.35%
6		✓	✓	94.35%
7	✓	✓	✓	95.50%

In conclusion, our method addresses the challenges of low-light robotic grasp detection by integrating spatial and Fourier domain information, supplemented by an attention mechanism. This approach significantly improves feature extraction and enhancement, leading to superior accuracy and robustness across diverse low-light scenarios. The advancements in robot technology demonstrated here have practical implications for applications in dimly lit environments, such as nighttime manufacturing or underwater operations. Future work will focus on optimizing computational efficiency for embedded systems, integrating multimodal sensors (e.g., depth cameras) for extreme low-light conditions, and extending the method to dynamic grasping tasks. By continuing to innovate in robot technology, we aim to enable reliable autonomous systems in even the most challenging lighting environments.