In recent years, with the rapid advancement of robotic perception, control, and artificial intelligence technologies, the grasping and manipulation capabilities of dexterous robotic hands have become a critical focus in intelligent robotics research. High-degree-of-freedom dexterous manipulation not only significantly enhances the upper limit of robot interaction with complex environments but also demonstrates broad application prospects in service robotics, industrial automation, warehouse logistics, and post-disaster rescue scenarios. This represents a key step in advancing robots from “perception” to “manipulation.” As a core aspect of high-degree-of-freedom manipulation tasks, generating high-quality, diverse, and feasible grasping strategies for dexterous robotic hands is of great importance for improving autonomous operation capabilities, task success rates, and cross-scene generalization.
The core challenge in dexterous robotic hand grasping pose generation lies in fully modeling the complex geometric structures of objects and their spatial relationships with grasping actions. Limited by the diversity of object shapes and common issues such as noise and incompleteness in point cloud data, traditional methods based on physical modeling are difficult to apply in real-world scenarios. Early research relied heavily on geometric and physical attributes, such as object contours, friction coefficients, and center of mass positions, using analytical tools like force closure and wrench space to evaluate grasping stability. These methods typically follow a classic “sampling-evaluation-screening” process, verifying candidate grasping strategies one by one. Representative works include GraspIt! proposed by Miller et al., the posterior contact map method by Ciocarlie et al., and task-constrained grasp planning by Berenson et al. Although these methods offer good physical interpretability, they often involve high computational costs when dealing with high-degree-of-freedom grasping spaces and heavily depend on the completeness and parameter accuracy of object models, making it difficult to meet dual requirements for real-time performance and robustness.
In recent years, with the widespread application of deep learning, data-driven methods based on neural networks have gradually become mainstream for dexterous robotic hand grasping pose generation. Such methods learn mapping relationships from perception to action from large-scale grasping data, effectively reducing reliance on precise physical modeling. For instance, Saxena et al. used images to train grasping classifiers, achieving high grasping accuracy; however, this method relies on extensive annotated data, ignores physical constraints, and has poor generalization ability, making it difficult to adapt to complex or unknown grasping tasks. Mahler et al. proposed the Dex-Net series, which generated large-scale data through simulation to train scoring networks, enabling efficient grasping pose scoring and optimization, significantly improving grasping success rates; yet, this method lacks handling of complex object shapes and dynamic environmental changes, limiting its performance in diverse real-world scenarios. Pas et al. introduced grasp pose detection (GPD) methods, predicting grasping poses based on local point cloud segments, allowing efficient and precise grasp detection in relatively simple scenes; however, performance degrades in complex or occluded scenes, and insufficient consideration of grasp optimization and global constraints leads to poor robustness in practical applications.
With the development of generative modeling, researchers have introduced probabilistic generative frameworks like diffusion models, further enhancing the diversity and physical consistency of strategy generation. For example, Weng et al. used diffusion processes to sample stable and diverse grasping poses, improving the flexibility and stability of grasping strategies. Liang et al. employed a two-stage diffusion process to jointly model state-action dynamics for target-adaptive dexterous manipulation; however, limitations remain in handling physical interactions and precise control. Xu et al. and Wan et al. proposed unified grasping strategy frameworks, improving generalization across multiple objects and scenes; yet, grasping success rates remain low for objects with complex shapes and variable poses, and adaptability to dynamic environments is insufficient. Wang et al. constructed a large-scale grasping dataset supporting end-to-end prediction from point clouds to grasping poses. Jiang et al. improved grasping stability and success rates through adaptive optimization; however, limitations persist in real-time performance and extreme environment adaptability. Additionally, Patzelt et al. introduced generative adversarial networks (GANs) to enhance the diversity and distribution alignment of generated strategies, improving diversity and realism; but training instability and potential lack of physical feasibility in generated strategies can lead to low success rates during actual execution.
Despite significant progress, current methods still face key challenges: 1) Point cloud data may suffer from occlusion, uneven density, or missing geometric details, making it difficult for models to accurately capture global structures and key local features of target objects. 2) Existing methods lack explicit joint modeling mechanisms among candidate grasping strategies, leaving potential complementarity and correlations underutilized, which limits the diversity and overall quality of strategy sets. 3) Further improving the physical rationality and execution stability of grasping poses while maintaining inference efficiency remains an important research direction.
To address the issues of insufficient point cloud feature representation and limited strategy modeling capability in dexterous robotic hand grasping pose generation, I propose an attention-enhanced dynamic optimization dexterous grasp pose generation (ADO-Grasp) method. This method integrates multiple attention mechanisms and dynamic optimization to enhance grasping quality and diversity.

The ADO-Grasp framework consists of three main components: 1) a learning-based point cloud feature extraction with adaptive object geometry network (LPA-Net), 2) a dexterous grasp decoding network in multi-dimensional feature space (DexTran), and 3) a two-stage prediction and offset adjustment optimization module. Below, I detail each component.
1. LPA-Net: Adaptive Point Cloud Feature Extraction
To enhance point cloud feature representation and fully capture geometric structures and spatial topological information, I designed LPA-Net. This network includes three core modules: high-dimensional feature projection, local-global attention fusion, and multi-scale feature aggregation via dilated convolutional pyramids.
First, to map raw point cloud coordinates from low-dimensional geometric space to a high-dimensional feature space with rich semantic expression, feature projection is performed. Given input point cloud $P \in \mathbb{R}^{B \times n \times C}$, where $B$ is the batch size, $n$ is the number of points, and $C$ is the feature dimension, the projection is computed as:
$$ F_0 = \text{ReLU}(\text{Conv1D}(P)) $$
$$ F_{\text{proj}} = \text{LayerNorm}(\text{Conv1D}(F_0)) $$
To alleviate the lack of spatial structure due to point cloud disorder, a learnable positional encoding mechanism is introduced. A trainable positional embedding matrix $\phi(M)$ is assigned to each point and fused with the original projected features via addition:
$$ F_{\text{proj}}’ = F_{\text{proj}} + \phi(M) $$
For feature extraction, a hierarchical feature aggregation strategy based on local-global attention fusion is proposed. Local attention is built using differentiable K-nearest neighbors (KNN) to construct neighborhoods $\Gamma_k$ for each point, while global attention employs multi-head attention to model dependencies between points. The process for layer $l$ is:
$$ \bar{F}_l = \text{LN}(F_{\text{proj}}’ + \text{LA}(F_{\text{proj}}’, \Gamma_k)) $$
$$ F_{l+1} = \text{LN}(\bar{F}_l + \text{GA}(\bar{F}_l)) $$
where LN denotes layer normalization, LA is local attention, and GA is global attention. The attention coefficient for local attention is computed as:
$$ \alpha_{ij} = \frac{\exp(Q_i^T K_j)}{\sum_{k \in \Gamma_i} \exp(Q_i^T K_k)} $$
To further enhance multi-scale perception, a dilated convolutional pyramid is incorporated. Features from projection and attention modules are concatenated and processed through three convolutional kernels of sizes $1 \times 1$, $3 \times 1$, and $5 \times 1$ to extract cross-scale features:
$$ F_{\text{fuse}} = \text{Conv1D}([F_{\text{proj}} \oplus F_{\text{LA}} \oplus F_{\text{GA}}]) $$
where $\oplus$ denotes channel concatenation. This multi-scale convolution captures features at different receptive fields, enhancing sensitivity to geometric details like surface curvature, boundaries, and normals, thereby improving generalization for complex 3D structures.
2. DexTran: Multi-Dimensional Feature Space Decoding
To efficiently remove noise and accurately aggregate spatial and feature information from target objects, I designed DexTran. This network combines multi-head dual-path differential attention and cross-attention through cascaded Transformer layers to predict a sequence of grasping poses for the dexterous robotic hand.
Each Transformer layer includes multi-path differential attention (MHDA), cross-attention (CA), and a feed-forward network (FFN). The input consists of $n$ points $P_{\text{enc}} \in \mathbb{R}^{B \times n \times 3}$ and features $F_{\text{enc}} \in \mathbb{R}^{B \times n \times D_{\text{enc}}}$ from the encoder, along with $M$ dexterous robotic hand grasping query embeddings $\{Q_i\}_{i=1}^M$. The process is:
$$ Z_0 = \text{MHDA}(Q_i) $$
$$ Z_1 = \text{CA}(\text{LN}(Z_0), Q_{\text{geo}}, F_{\text{enc}}) $$
$$ Z_2 = \text{FFN}(\text{LN}(Z_1) + Z_1) $$
Here, $Q_{\text{geo}}$ is the relative positional encoding generated from $P_{\text{enc}}$ via an MLP. The MHDA module extends standard self-attention to capture fine-grained dependencies while suppressing noise. Given input $X \in \mathbb{R}^{B \times n \times d}$, queries $Q$, keys $K$, and values $V$ are computed as:
$$ Q = \text{RoPE}(\text{GeLU}(\text{Linear}(X))) $$
$$ K = \text{RoPE}(\text{GeLU}(\text{Linear}(X))) $$
$$ V = \text{Linear}(X) $$
where RoPE denotes rotary positional encoding. $Q$ and $K$ are split into two parts along the last dimension: $Q = [Q_1, Q_2]$ and $K = [K_1, K_2]$. The output of MHDA is:
$$ \text{MHDA}(X) = \left( \text{SoftMax}\left( \frac{Q_1 K_1^T}{\sqrt{d}} \right) – \text{SoftMax}\left( \frac{Q_2 K_2^T}{\sqrt{d}} \right) \right) V $$
A scalar weight $\lambda$ balances the two attention paths, dynamically parameterized as:
$$ \lambda = \exp(\lambda_{q1} \cdot \lambda_{k1}) – \exp(\lambda_{q2} \cdot \lambda_{k2}) + \lambda_{\text{init}} $$
where $\lambda_{q1}, \lambda_{q2}, \lambda_{k1}, \lambda_{k2}$ are learnable vectors, and $\lambda_{\text{init}} \in (0,1)$ is an initialization constant. This allows each query point to perceive features of other grasp candidates, enhancing dependency modeling while reducing noise amplification.
3. Two-Stage Prediction and Offset Optimization
To improve prediction accuracy and physical feasibility, a two-stage prediction and offset optimization module is proposed. In the first stage, lightweight MLPs directly predict initial grasping poses from decoder embeddings, ensuring semantic alignment with the latent space. In the second stage, offsets are generated based on initial poses and original embeddings for refinement.
Let $q$ be the final embedding from the decoder. The initial grasp pose $P_s$ is predicted as:
$$ P_s = \text{MLP}(q) $$
Then, the offset $\Delta P$ is computed by concatenating $q$ and $P_s$:
$$ \Delta P = \text{MLP}_{\text{opt}}([q, P_s]) $$
A probability matrix $K$ is defined using SoftMax:
$$ K = \text{SoftMax}(\text{MLP}([q, P_s])) $$
The final optimized grasp pose $P$ is obtained as:
$$ P = P_s + \Delta P $$
This two-stage approach enables fine-tuning of initial predictions, enhancing stability and precision for the dexterous robotic hand.
4. Loss Functions
The training involves multiple loss components to ensure grasping quality. For translation $t$, rotation $r$, and joint angles $j$, losses are defined as:
$$ L_t = \text{MSE}(t_{\text{pred}}, t_{\text{true}}) $$
$$ L_r = \text{cosine\_sim\_loss}(r_{\text{pred}}, r_{\text{true}}) $$
$$ L_j = \text{MSE}(j_{\text{pred}}, j_{\text{true}}) $$
The parameter loss combines these with weights $\gamma_1, \gamma_2, \gamma_3$:
$$ L_{\text{param}} = \gamma_1 L_t + \gamma_2 L_r + \gamma_3 L_j $$
Additional losses include Chamfer distance $L_{\text{chamfer}}$ between predicted and true hand meshes, penetration loss $L_{\text{pen}}$ between hand and object, and self-penetration loss $L_{\text{spen}}$ within the hand. The total loss is:
$$ L = L_{\text{param}} + \gamma_4 L_{\text{chamfer}} + \gamma_5 L_{\text{pen}} + \gamma_6 L_{\text{spen}} $$
where $\gamma_4, \gamma_5, \gamma_6$ are weighting factors.
5. Experimental Evaluation
I evaluated ADO-Grasp on the DexGraspNet dataset, a large-scale dexterous grasping dataset with over 1.33 million annotated grasp poses for 133 3D object models. Evaluation metrics include grasp quality and diversity indicators.
5.1 Grasp Quality Metrics:
- $Q_1$: Average inscribed sphere radius of friction cones at contact points (higher is better).
- $\text{Pen}$: Maximum penetration depth between hand and object (lower is better).
- $\eta_{\text{success}}$: Grasping success rate in simulation.
5.2 Grasp Diversity Metrics:
- $\sigma_t$, $\sigma_r$, $\sigma_j$: Standard deviations of translation, rotation, and joint angles (higher indicates more diversity).
Comparative experiments were conducted against four representative methods: GraspTTA, DGTR, SceneDiffuser, and UniDexGrasp. The results are summarized in Table 1.
| Model | $Q_1 \uparrow$ | $\text{Pen} \downarrow$ (mm) | $\eta_{\text{success}} \uparrow$ (%) | $\sigma_t \uparrow$ | $\sigma_r \uparrow$ | $\sigma_j \uparrow$ |
|---|---|---|---|---|---|---|
| GraspTTA | 0.027 1 | 6.78 | 24.5 | 8.09 | 7.53 | 7.09 |
| DGTR | 0.051 5 | 4.21 | 41.0 | 47.77 | 51.66 | 27.81 |
| SceneDiffuser | 0.012 9 | 1.07 | 25.5 | 54.84 | 52.27 | 39.75 |
| UniDexGrasp | 0.046 2 | 1.21 | 37.1 | 9.64 | 7.49 | 29.29 |
| ADO-Grasp | 0.066 0 | 2.79 | 62.3 | 75.99 | 75.73 | 30.66 |
ADO-Grasp achieves superior performance across most metrics. Specifically, $Q_1$ improves by an average of 47.9% compared to other methods, penetration distance decreases by 0.5 mm on average, success rate increases by 30% on average, and diversity metrics improve by 45.23% on average. This demonstrates that ADO-Grasp effectively balances grasp quality and diversity for the dexterous robotic hand.
5.3 Ablation Study:
To validate the contributions of each module, ablation experiments were conducted. Results are shown in Table 2.
| LPA-Net | DexTran | Two Stages | $Q_1 \uparrow$ | $\text{Pen} \downarrow$ | $\eta_q \uparrow$ (%) | $\eta_t \uparrow$ (%) |
|---|---|---|---|---|---|---|
| 0.018 7 | 0.974 | 15.36 | 14.62 | |||
| ✓ | 0.031 0 | 0.435 | 40.03 | 40.03 | ||
| ✓ | 0.024 3 | 0.349 | 32.24 | 65.71 | ||
| ✓ | 0.079 1 | 0.695 | 66.61 | 29.21 | ||
| ✓ | ✓ | 0.034 4 | 0.426 | 40.02 | 57.73 | |
| ✓ | ✓ | 0.074 1 | 0.697 | 63.78 | 29.68 | |
| ✓ | ✓ | 0.054 6 | 0.525 | 52.66 | 47.27 | |
| ✓ | ✓ | ✓ | 0.066 0 | 0.279 | 62.81 | 68.73 |
The combined use of all modules yields the best results, with $Q_1 = 0.0660$, $\text{Pen} = 0.279$, $\eta_q = 62.81\%$, and $\eta_t = 68.73\%$, confirming the synergistic effectiveness of LPA-Net, DexTran, and the two-stage optimization for dexterous robotic hand grasping.
5.4 Real-World Experiments:
To further validate practicality, real-world grasping experiments were conducted using a Shadow Hand dexterous robotic hand mounted on a UR10e robotic arm, with point cloud data captured by a RealSense depth camera. Objects included a beverage bottle, plush toy, square box, cylindrical box, stapler, and plastic toy model. Each object was grasped 10 times, with success defined as stable grasping and lifting without dropping. Success rates are summarized in Table 3.
| Object | Plush Toy | Stapler | Cylindrical Box | Square Box | Beverage Bottle | Plastic Toy |
|---|---|---|---|---|---|---|
| Success Rate | 1/10 | 6/10 | 7/10 | 7/10 | 5/10 | 6/10 |
Success rates vary due to object shape, surface friction, and rigidity. For regular objects like boxes, rates reach 70%, demonstrating good force closure and contact stability. For flexible objects like the plush toy, the rate drops to 10%, indicating challenges with deformation and contact point instability. Overall, ADO-Grasp shows effective grasping for multiple object types, though improvement is needed for flexible objects.
6. Conclusion
I proposed ADO-Grasp, a method for dexterous robotic hand grasping pose generation that integrates attention mechanisms and dynamic optimization. The approach includes LPA-Net for adaptive point cloud feature extraction, DexTran for multi-dimensional decoding, and a two-stage optimization module. Experiments on the DexGraspNet dataset and real-world platform demonstrate that ADO-Grasp significantly improves grasping quality, success rates, and diversity compared to existing methods. Key achievements include a 47.9% average increase in $Q_1$, a 0.5 mm average reduction in penetration distance, a 30% average improvement in success rate, and a 45.23% average enhancement in diversity metrics. Ablation studies confirm the contributions of each module. While effective for various objects, challenges remain with flexible objects. Future work will incorporate multimodal perception and finer mechanical constraints to improve stability and intelligence for complex grasping tasks with dexterous robotic hands.
