Target Position-Guided In-Hand Reorientation for Dexterous Robotic Hands

In the field of robotics, achieving dexterous manipulation with anthropomorphic five-fingered dexterous robotic hands remains a significant and challenging frontier. Such dexterous robotic hands are designed to mimic human hand capabilities, enabling complex tasks like in-hand reorientation, where an object is rotated to a desired pose within the hand. This task is particularly demanding due to high-dimensional state and action spaces, dynamic contact interactions, and the need for coordinated finger movements. Traditional methods often struggle with these complexities, leading to inefficient or non-human-like strategies. In this work, we propose a novel approach that leverages target position guidance to enable dexterous robotic hands with varying degrees of actuation (DoA) to perform in-hand reorientation efficiently and in a human-like manner. Our method draws inspiration from human hand manipulation characteristics and integrates principles based on the DoA distribution of dexterous robotic hands. By incorporating target position guidance into a reinforcement learning framework, we demonstrate that dexterous robotic hands can learn policies that enhance operational efficiency and success rates. This approach not only addresses the challenges of high-dimensional control but also provides a pathway for adapting to different dexterous robotic hand designs, from high-DoA to low-DoA configurations. Through extensive simulations, we validate our method on three types of dexterous robotic hands, showing improved performance in terms of continuous success counts and reduced steps per task. The insights gained here contribute to the broader goal of making dexterous robotic hands more versatile and effective in real-world applications.

We begin by formalizing the in-hand reorientation task as a Markov Decision Process (MDP). The goal is to rotate an object from an arbitrary initial pose to an arbitrary target pose within the hand of a dexterous robotic hand. The object’s pose is represented using unit quaternions, and the success criterion is defined by an angular threshold. Specifically, if the angular difference between the current object pose $\mathbf{q}_{\text{obj}} \in \mathbb{R}^4$ and the target pose $\mathbf{q}_{\text{target}} \in \mathbb{R}^4$ is less than a threshold $\bar{\theta}$, the rotation is considered successful. The angular difference $\Delta \theta$ is computed as:

$$\Delta \theta = 2 \times \arccos(\max(1, \mathbf{q}_{\text{obj}} \cdot \mathbf{q}_{\text{target}}))$$

An episode terminates under three conditions: if the object falls from the hand (e.g., height below a threshold), if the maximum number of steps $l_{\text{max}}$ is exceeded, or if a predefined maximum number of consecutive successes $N_{\text{max}}$ is achieved. The objective is to maximize the expected cumulative reward over trajectories, expressed as:

$$J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^{T-1} \gamma^t R(\mathbf{s}_t, \mathbf{a}_t, \mathbf{g}_t) \right]$$

where $\pi$ is the policy, $\gamma$ is the discount factor, $R$ is the reward function, $\mathbf{s}_t$ is the state, $\mathbf{a}_t$ is the action, and $\mathbf{g}_t$ is the goal at time step $t$. This formulation allows us to apply reinforcement learning techniques to train policies for dexterous robotic hands.

Our method centers on a target position-guided approach, which introduces a target position $\mathbf{p}_{\text{target}}$ for the object during reorientation. This target position is designed based on principles inspired by human hand manipulation and the DoA distribution of dexterous robotic hands. The key idea is to guide the dexterous robotic hand to keep the object near this target position, thereby promoting more human-like and efficient manipulations. We propose three design principles for setting $\mathbf{p}_{\text{target}}$:

Height Principle: The target position should be above the palm center to encourage lifting the object, reducing reliance on palm support and allowing fingers to operate more freely.
Longitudinal Principle: The target position should be closer to the fingers (distal region) rather than the wrist (proximal region) to leverage finger dexterity and minimize drop risks.
Transverse Principle: Based on the DoA distribution across the hand (from the little finger side to the thumb side), the target position should be biased toward regions with higher DoA to enhance manipulation capability. For dexterous robotic hands with balanced DoA, it is centered; for those with more DoA on the thumb side, it is shifted toward the thumb.

These principles are applied to three types of dexterous robotic hands used in our study: the Shadow dexterous robotic hand (18 DoA), the BICE dexterous robotic hand (13 DoA), and the Schunk SVH dexterous robotic hand (9 DoA). Their parameters are summarized in Table 1.

Table 1: Parameter comparison of the three dexterous robotic hands.
Dexterous Robotic Hand	Total DoF	Actuated DoA	DoA Configuration	Transmission	Opposable Thumb	Mass (kg)	Load (kg)
Shadow Dexterous Robotic Hand	22	18	Thumb: 5, Other Fingers: 3 each, Palm: 1	Tendon/Rope	Yes	4.3	5
BICE Dexterous Robotic Hand	20	13	Thumb: 3, Index: 3, Middle: 3, Ring: 2, Little: 2	Linkage, Gears	Yes	1	10
Schunk SVH Dexterous Robotic Hand	20	9	Thumb: 2, Index: 2, Middle: 2, Ring: 1, Little: 1, Palm: 1	Linkage, Gears	No	1.3	0.85

The target position $\mathbf{p}_{\text{target}}$ is integrated into the reward function to guide the dexterous robotic hand. The immediate reward $r_{\text{step}}$ at each time step is composed of a target position reward $r_{\text{dist}}$ and a target orientation reward $r_{\text{rot}}$:

$$r_{\text{step}} = w_{\text{dist}} r_{\text{dist}} + w_{\text{rot}} r_{\text{rot}}$$

where $w_{\text{dist}}$ and $w_{\text{rot}}$ are weighting coefficients. The target position reward is defined as the negative Euclidean distance between the object’s current position $\mathbf{p}_{\text{obj}}$ and the target position $\mathbf{p}_{\text{target}}$:

$$r_{\text{dist}} = – \| \mathbf{p}_{\text{obj}} – \mathbf{p}_{\text{target}} \|_2$$

This encourages the dexterous robotic hand to maintain the object near $\mathbf{p}_{\text{target}}$. The orientation reward is based on the angular difference $\Delta \theta$:

$$r_{\text{rot}} = \frac{1}{|\Delta \theta| + 0.1}$$

which provides higher rewards as the object approaches the target orientation. Additionally, terminal rewards are given for success ($r_{\text{done}}$), object drop ($r_{\text{fall}}$), and exceeding maximum steps ($r_{\text{max length}}$). The values used in our simulations are: $w_{\text{dist}} = -10$, $w_{\text{rot}} = 1$, $r_{\text{done}} = 250$, $r_{\text{fall}} = -50$, and $r_{\text{max length}} = -25$.

To further enhance the human-like nature of the policies, we design a reset state generation method for the dexterous robotic hand. Inspired by the preparatory state of human hands before manipulation, we sample joint positions during each reset to mimic slightly flexed fingers and an extended thumb. The joint position $\mathbf{q}$ is set as:

$$\mathbf{q} = \kappa \times [\mathbf{q}_{\text{min}} + \epsilon \times (\mathbf{q}_{\text{max}} – \mathbf{q}_{\text{min}})]$$

where $\mathbf{q}_{\text{min}}$ and $\mathbf{q}_{\text{max}}$ are joint limits, $\kappa = 0.2$ is a scaling factor, and $\epsilon$ is a random number uniformly sampled from $[0,1]$. This initialization promotes a ready-to-manipulate posture, reducing exploration complexity and improving training efficiency for the dexterous robotic hand.

We train our policies using the Proximal Policy Optimization (PPO) algorithm with a Long Short-Term Memory (LSTM) network and an asymmetric actor-critic architecture. The actor (policy network) uses only partial observations that would be available in real-world deployment, such as joint positions, fingertip positions, and object pose relative to the target. The critic (value network) uses full state observations during training to improve learning efficiency. Both networks consist of an LSTM layer with 1024 neurons followed by a fully connected layer with 512 neurons and ReLU activation. The action space is continuous, representing absolute joint positions for the actuated DoA of the dexterous robotic hand, scaled to $[-1, 1]$ and smoothed using exponential moving average. The control frequency is set to 30 Hz.

Our simulations are conducted in the NVIDIA Isaac Gym environment, which provides a high-performance GPU-based physics simulation. We model the three dexterous robotic hands and a cube object with side length 0.04 m and density 567 kg/m³. Self-collision detection is enabled for fingers and palm to prevent penetrations. The task parameters are: maximum consecutive successes $N_{\text{max}} = 50$, angular threshold $\bar{\theta} = 0.4$ rad, maximum episode duration $\tau_{\text{max}} = 8$ s, and maximum steps per episode $l_{\text{max}} = 160$. The object’s initial and target poses are randomly sampled from the full rotation space SO(3), and the initial object position is set above the palm with uniform noise in $[-0.01, 0.01]$ m.

We evaluate our target position-guided method on all three dexterous robotic hands. The training curves, showing cumulative reward and consecutive success counts over interaction steps, indicate that all hands learn effective reorientation policies. The Shadow dexterous robotic hand and BICE dexterous robotic hand achieve near-maximum performance with fewer samples compared to the SVH dexterous robotic hand, highlighting the impact of DoA on sample efficiency. For instance, after approximately $1.2 \times 10^9$ interaction steps, the SVH dexterous robotic hand reaches a reward of around 12,900 and consecutive success counts of 49.32, while the Shadow and BICE hands achieve similar performance with about one-third of the samples. This suggests that higher DoA in dexterous robotic hands can facilitate faster policy learning.

To quantify the benefits of target position guidance, we conduct ablation studies comparing our method with a baseline that excludes the target position reward (i.e., setting $w_{\text{dist}} = 0$). The results are summarized in Table 2, which shows the average consecutive success counts and average steps per episode during inference over 100 episodes with random seeds 42, 663, and 9049.

Table 2: Performance comparison with and without target position guidance for dexterous robotic hands.
Dexterous Robotic Hand	Method	Average Consecutive Successes	Average Steps	Episodes Ended by Drop	Episodes Ended by Max Steps
Shadow Dexterous Robotic Hand	With Target Position	48.56	1299.54	3	2
Shadow Dexterous Robotic Hand	Without Target Position	47.99	1614.28	4	2
BICE Dexterous Robotic Hand	With Target Position	48.71	1386.73	2	3
BICE Dexterous Robotic Hand	Without Target Position	45.94	1601.18	8	3
SVH Dexterous Robotic Hand	With Target Position	48.98	1401.23	4	0
SVH Dexterous Robotic Hand	Without Target Position	47.57	2121.31	2	7

The data clearly demonstrates that target position guidance improves performance across all dexterous robotic hands. Specifically, the average steps per episode are reduced by approximately 19.04% for the Shadow dexterous robotic hand, 13.31% for the BICE dexterous robotic hand, and 33.75% for the SVH dexterous robotic hand. This indicates that our method enhances operational efficiency, especially for low-DoA hands like the SVH dexterous robotic hand. Moreover, the consecutive success counts are higher and more stable with target position guidance, reducing the number of episodes ending due to drops or max steps. For example, the BICE dexterous robotic hand with guidance has only 2-3 drop episodes compared to 8 without guidance, showcasing improved robustness.

We further analyze the object’s position during reorientation using Gaussian Kernel Density Estimation (KDE). The scatter density plots reveal that with target position guidance, the object is more likely to be found near the target position across height (Z-axis), longitudinal (Y-axis), and transverse (X-axis) directions. For instance, the high-probability regions (warm colors) align with the target position principles, confirming that the dexterous robotic hand actively adjusts the object’s location. Without guidance, the object tends to remain at lower heights and closer to the palm, leading to less efficient manipulations. This visual analysis underscores how target position guidance promotes human-like coordination in dexterous robotic hands.

The learned policies exhibit human-like characteristics: the dexterous robotic hand lifts the object above the palm, utilizes finger coordination for rotation, and biases manipulation toward higher-DoA regions. For the Shadow dexterous robotic hand, which has balanced DoA, the policy involves cooperation between the thumb and little finger. For the BICE dexterous robotic hand, with more DoA on the thumb side, the policy relies primarily on the thumb with other fingers assisting. For the SVH dexterous robotic hand, which has limited DoA, the policy uses thumb and little finger coordination, leveraging the palm joint for added dexterity. These behaviors align with our design principles and demonstrate the adaptability of our method to different dexterous robotic hand configurations.

In terms of mathematical formulation, our reward function can be extended to include additional terms for fine-tuning. For example, to further encourage smooth motions, we could add a penalty for large action changes. However, in this work, we focus on the core components of position and orientation rewards. The overall MDP framework for the dexterous robotic hand is defined by the state space $\mathcal{S}$, action space $\mathcal{A}$, and transition probability $\mathcal{P}$. The state includes joint angles, velocities, object pose, and target information, while actions are joint positions. The policy $\pi(\mathbf{a}_t | \mathbf{s}_t, \mathbf{g}_t)$ is optimized to maximize the expected return, with the advantage function estimated using generalized advantage estimation (GAE) in PPO.

Our training setup involves distributed environments in Isaac Gym, with 4096 parallel environments for sample collection. The PPO hyperparameters include a clip range of 0.2, learning rate of $3 \times 10^{-4}$, entropy coefficient of 0.01, and GAE parameters $\lambda = 0.95$ and $\gamma = 0.99$. We use Adam optimizer and train for up to 5000 iterations, with each iteration comprising 24 steps per environment. The LSTM hidden states are reset at the start of each episode, allowing the policy to handle the sequential nature of the reorientation task. This configuration ensures stable learning for the dexterous robotic hand across millions of interactions.

To illustrate the reward structure in more detail, we break down the terminal rewards. When the object’s angular difference $\Delta \theta$ falls below $\bar{\theta}$, the episode receives a success reward $r_{\text{done}}$. If the object height drops below a threshold $h_{\text{min}}$, a drop penalty $r_{\text{fall}}$ is applied. If the step count exceeds $l_{\text{max}}$, a timeout penalty $r_{\text{max length}}$ is given. These terminal conditions are crucial for guiding the dexterous robotic hand toward successful outcomes and avoiding inefficient behaviors. The height threshold $h_{\text{min}}$ is set relative to the palm, e.g., 0.1 m below the initial drop height, to detect falls accurately.

Our method also addresses the challenge of exploration in high-dimensional spaces. By initializing the dexterous robotic hand in a human-like preparatory state and using target position guidance, we constrain the exploration to more promising regions. This is reflected in the faster convergence of reward curves compared to random exploration. Additionally, the asymmetric actor-critic architecture allows the critic to leverage full state information during training, such as contact forces and object velocities, which are not available to the actor. This accelerates learning without compromising the deployability of the policy on real dexterous robotic hands with limited sensing.

The generalization capability of our approach is noteworthy. While we train on a cube object, the principles of target position guidance could be extended to other shapes by adjusting the target position based on object geometry. For instance, for elongated objects, the target position might be shifted to balance the center of mass. However, in this study, we focus on a standard cube to validate the core method. Future work could explore multi-object training, as seen in prior research, to enhance the versatility of dexterous robotic hands.

We also analyze the impact of DoA on policy performance. The dexterous robotic hand with higher DoA, such as the Shadow hand, achieves higher reward peaks and more stable success counts earlier in training. This is quantified by the sample efficiency metrics: the Shadow dexterous robotic hand reaches a reward of 12,984.58 after about 327 million steps, while the SVH dexterous robotic hand requires over 1.1 billion steps to reach a similar reward. This underscores the importance of DoA in complex manipulation tasks and suggests that designers of dexterous robotic hands should consider sufficient actuation for efficient learning.

In conclusion, our target position-guided method enables dexterous robotic hands with varying DoA to perform in-hand reorientation efficiently and in a human-like manner. By integrating design principles based on DoA distribution, a human-inspired reset state, and a reinforcement learning framework with LSTM networks, we demonstrate significant improvements in success rates and operational efficiency. The method reduces the average steps per task by up to 33.75% for low-DoA hands and achieves near-maximum consecutive success counts across all hands. These findings highlight the potential of target position guidance to bridge the gap between high and low-DoA dexterous robotic hands, making them more practical for real-world applications. Future directions include extending the method to diverse object shapes, incorporating tactile feedback, and deploying policies on physical dexterous robotic hands to validate simulation results. As dexterous robotic hands continue to evolve, approaches like ours will be crucial for unlocking their full potential in robotics and automation.

The mathematical formulations and experimental results presented here provide a foundation for further research. For instance, the reward function can be optimized using automated reward shaping techniques, and the target position could be dynamically adjusted based on object properties. Moreover, the use of memory-based networks like LSTM is essential for handling the sequential dependencies in dexterous manipulation, as the dexterous robotic hand must remember past actions to coordinate future movements. We hope this work inspires more innovations in the control and learning of dexterous robotic hands, ultimately leading to more adaptive and capable robotic systems.

To summarize, the key equations in our method are:

1. Angular difference: $$\Delta \theta = 2 \times \arccos(\max(1, \mathbf{q}_{\text{obj}} \cdot \mathbf{q}_{\text{target}}))$$

2. Immediate reward: $$r_{\text{step}} = w_{\text{dist}} r_{\text{dist}} + w_{\text{rot}} r_{\text{rot}}$$ with $$r_{\text{dist}} = – \| \mathbf{p}_{\text{obj}} – \mathbf{p}_{\text{target}} \|_2$$ and $$r_{\text{rot}} = \frac{1}{|\Delta \theta| + 0.1}$$

3. Reset joint position: $$\mathbf{q} = \kappa \times [\mathbf{q}_{\text{min}} + \epsilon \times (\mathbf{q}_{\text{max}} – \mathbf{q}_{\text{min}})]$$

4. Objective function: $$J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^{T-1} \gamma^t R(\mathbf{s}_t, \mathbf{a}_t, \mathbf{g}_t) \right]$$

These elements collectively enable the dexterous robotic hand to learn effective reorientation policies. Through iterative training and ablation studies, we have shown that target position guidance is a powerful tool for enhancing the performance of dexterous robotic hands, regardless of their actuation limitations. As robotics advances, such methods will be integral to achieving human-level dexterity with artificial hands.