In-hand Reorientation Control of Underactuated Dexterous Hand with MHSA-CL-PPO

The development of robotic end-effectors has progressively evolved from high integration and enhanced perception towards simplified designs and improved stability to meet market demands and increase commercialization potential. Among these, the underactuated dexterous hand represents a significant trend. By reducing the number of actuators through clever mechanical design, it achieves a favorable balance between system complexity, reliability, and cost. Compared to fully-actuated, high-degree-of-freedom (DoF) models, its underactuated principle is more conducive to miniaturization and simplifies maintenance. However, the inherent limitation in DoF often leads to insufficient precision and a constrained action space, making it challenging to perform dexterous manipulation tasks.

Dexterous in-hand manipulation, defined as the skillful change of an object’s pose through selective contact and force application, is a crucial frontier in robotics. A quintessential example of such manipulation is the in-hand reorientation or flipping of an object, a functional action ubiquitous in human daily life. This task presents several key challenges: (1) The high-dimensional state and action spaces lead to slow policy exploration. (2) The complex, non-linear motion trajectories arising from the interaction between the dexterous robotic hand and the object increase control difficulty. (3) There is a fundamental limitation in the fingertip workspace, especially pronounced for low-DoF underactuated hands. The question of how to achieve dexterous operations with limited DoF remains a significant hurdle.

Recent years have witnessed substantial progress in learning-based methods for dexterous manipulation. Pioneering work by research teams has demonstrated successful in-hand reorientation using model-free reinforcement learning (RL) on high-DoF hands, later transferred to physical systems. Other approaches have incorporated tactile sensing, monocular depth estimation, hierarchical policy architectures, and simulation-to-real adaptation to handle unknown objects and complex geometries. Nonetheless, these advanced methods primarily target dexterous robotic hands with a DoF count comparable to the human hand. Directly applying them to the underactuated dexterous hand for dexterous tasks may result in suboptimal performance due to its inherent constraints.

To address the specific challenges faced by the underactuated dexterous hand in reorientation tasks—namely sparse rewards, difficulty in learning effective policies, and low success rates—this paper proposes an enhanced policy optimization algorithm named MHSA-CL-PPO. This algorithm integrates Multi-headed Self-Attention (MHSA) and Curriculum Learning (CL) strategies into the Proximal Policy Optimization (PPO) framework. The MHSA module is incorporated into the policy network to produce a weighted, high-dimensional state representation, enabling the dexterous robotic hand to focus on state dimensions more critical to task progress. Concurrently, a CL strategy is employed to structure the reward function, progressively increasing task difficulty based on the object’s flip angle to mitigate the sparse reward problem and accelerate policy learning. The proposed method’s efficacy is validated through simulation experiments in the MuJoCo physics engine.

1. The MHSA-CL-PPO Algorithm

The core of the proposed MHSA-CL-PPO algorithm lies in its two key enhancements to the standard PPO framework, designed specifically to overcome the training difficulties associated with controlling an underactuated dexterous hand for in-hand reorientation.

1.1 Foundation: Proximal Policy Optimization (PPO)

PPO is a policy-gradient algorithm renowned for its stability and sample efficiency. Its primary objective is to maximize the expected discounted cumulative return by optimizing a parameterized policy $\pi_{\theta}(a|s)$:

$$J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \sum_{t=0}^{\infty} \gamma^{t} r_{t} \right]$$

where $\tau$ is a trajectory generated under policy $\pi_{\theta}$, $r_t$ is the immediate reward at time $t$, and $\gamma \in [0, 1]$ is the discount factor. PPO-Clip, a popular variant, maintains stability by limiting the size of policy updates through a clipped surrogate objective function. It uses the importance sampling ratio $\rho_t(\theta) = \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ to leverage data from an older policy and employs the Generalized Advantage Estimate (GAE) $\hat{A}_t$ to reduce variance:

$$\hat{A}_t = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}$$

with $\delta_t = r_t + \gamma V(s_{t+1}) – V(s_t)$. The PPO-Clip objective is then:

$$L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min\left( \rho_t(\theta) \hat{A}_t, \text{clip}(\rho_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]$$

where $\epsilon$ is a hyperparameter that clips the probability ratio, preventing excessively large policy updates. The critic network is updated to minimize the Smooth L1 loss between its value predictions and the target returns.

1.2 Enhancement I: Multi-headed Self-Attention (MHSA)

The in-hand reorientation task suffers from significant sparse reward; substantial feedback is only received upon task completion. To help the policy network discern which parts of the high-dimensional state are most relevant for progressing the task, we integrate a Multi-headed Self-Attention layer between the input normalization and the first hidden layer of the policy network.

The self-attention mechanism computes a weighted representation of the input sequence. For an input state vector $s$, it is first projected into Query ($Q$), Key ($K$), and Value ($V$) vectors using learnable weight matrices $W^Q$, $W^K$, and $W^V$:

$$Q = sW^Q, \quad K = sW^K, \quad V = sW^V$$

The attention scores are computed as the scaled dot-product of $Q$ and $K$, followed by a softmax normalization to obtain the attention weights $\alpha$:

$$\alpha = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)$$

where $d_k$ is the dimension of the key vectors. The output is the weighted sum of the value vectors: $z = \alpha V$.

Multi-headed attention extends this by performing the operation in parallel over $h$ different learned projection subspaces. Each head $i$ has its own matrices $W_i^Q, W_i^K, W_i^V$ and produces an output $z_i$. The final attention output is the concatenation of all heads’ outputs projected by a final weight matrix $W^O$:

$$\text{MHSA}(s) = \text{Concat}(z_1, z_2, …, z_h) W^O$$

This allows the dexterous robotic hand’s policy network to jointly attend to information from different representation subspaces, effectively creating a context-aware, weighted state representation that highlights features critical for the current stage of the reorientation task.

1.3 Enhancement II: Curriculum Learning (CL) Strategy

To further combat the sparse reward problem and guide the underactuated dexterous hand through progressively more challenging stages, a Curriculum Learning strategy is implemented. The core idea is to define the task success criterion not as a single, final goal but as a series of progressively stricter sub-goals.

The task progress is quantified by the flip alignment $\Delta_{\text{align}} \in [0,1]$, which measures how well the target object’s task-facing normal vector aligns with the global upward direction $[0,0,1]$:

$$\Delta_{\text{align}} = \frac{1}{2}\left(\text{dot}(\hat{n}, [0,0,1]) + 1\right)$$

where $\hat{n}$ is the current normal vector of the task face. A value of 1 indicates perfect alignment (face up), and 0 indicates the face is down.

The curriculum is divided into three phases: Initial, Intermediate, and Final. The algorithm transitions between phases based on the moving average success rate $rate_{CL}$ over recent episodes. The target alignment threshold $\Delta_{\text{align}}^{\text{target}}$ for each phase is defined progressively:

$$
\Delta_{\text{align}}^{\text{target}} =
\begin{cases}
0.7, & \text{Initial Phase} \\
\min(0.8, 0.7 + 0.1 \times (n\_epi / 200)), & \text{Intermediate Phase} \\
\min(0.9, 0.8 + 0.1 \times (n\_epi / 200)), & \text{Final Phase}
\end{cases}
$$

where $n\_epi$ is the number of episodes completed within the current phase. This design allows for a smooth increase in difficulty, enabling the dexterous robotic hand to first learn basic flipping mechanics before refining its control for precise final alignment.

2. Network Architecture and Training Setup

2.1 Policy and Critic Network Design

The agent utilizes two neural networks: a policy network ($\pi_{\theta}$) and a critic network ($V_{\phi}$). The policy network maps states to actions, while the critic network estimates the state-value function.

Policy Network ($\pi_{\theta}$): The 27-dimensional state observation is first normalized. It then passes through the MHSA layer to obtain an attentive state representation. This is followed by a fully-connected (FC) layer with ReLU activation. The network branches into two output heads: one producing the mean $\mu$ (via a Tanh activation) and the other producing the standard deviation $\sigma$ (via a Softplus activation) for a diagonal Gaussian action distribution. The final action is sampled using the reparameterization trick: $a = \mu + \sigma \cdot \xi$, where $\xi \sim \mathcal{N}(0, I)$.

Critic Network ($V_{\phi}$): The state is normalized and processed through two FC layers with ReLU activations to output a scalar state-value prediction.

Table 1: Policy Network Architecture
Network Layer	Input/Output Dimension
Input Normalization	(27, 27)
Multi-headed Self-Attention	(27, 27)
Fully Connected (ReLU)	(27, 256)
Mean Output Head (Tanh)	(256, 12)
Std Output Head (Softplus)	(256, 12)

Table 2: Critic Network Architecture
Network Layer	Input/Output Dimension
Input Normalization	(27, 27)
Fully Connected (ReLU)	(27, 128)
Fully Connected (ReLU)	(128, 256)
Value Output	(256, 1)

2.2 State, Action, and Reward

State Space (27-dim): Includes the flip alignment $\Delta_{\text{align}}$, a drop flag, the six joint positions/velocities/torques of the underactuated dexterous hand, and the object’s 6D pose (3D position and quaternion).

Action Space (12-dim): Consists of target position commands (scaled to $[-1, 1]$) and velocity commands for the six controllable joints of the dexterous robotic hand.

Reward Function: The reward is designed to guide learning based on the CL strategy.
$$
r_t = r_{\text{align}} + r_{\text{potential}} + r_{\text{terminal}}
$$
where the process rewards are:
$$
r_{\text{align}} =
\begin{cases}
0.05 \times \Delta_{\text{align}}, & \text{if } \Delta_{\text{align}} > 0.5 \\
0, & \text{otherwise}
\end{cases}, \quad r_{\text{potential}} = \lambda_P \times (\Delta_{\text{align}} – \Delta_{\text{align}}^{pre})
$$
and the terminal rewards are $r_{\text{success}} = +70$ for final success ($\Delta_{\text{align}} > 0.9$) and $r_{\text{dropped}} = -10$ for object drop. $\lambda_P$ is a scaling factor for the alignment improvement.

Table 3: Key Hyperparameter Settings
Hyperparameter	Value
Training epochs per batch (K)	8
Mini-batch size (B)	32
Policy Learning Rate ($\alpha_{\pi}$)	5e-5
Critic Learning Rate ($\alpha_{V}$)	1e-4
Discount Factor ($\gamma$)	0.99
GAE parameter ($\lambda$)	0.95
PPO Clip Range ($\epsilon$)	0.15
Number of Attention Heads ($h$)	3
Potential Reward Scale ($\lambda_P$)	8

3. Experimental Results and Analysis

The proposed algorithm is evaluated in a MuJoCo simulation environment featuring a 6-DoF underactuated dexterous hand attached to a fixed base. The task object is a 2cm cube with labeled faces. The objective is to perform a 90° flip, reorienting the cube from an initial state (face ‘A’ up) to a goal state (face ‘E’ up), corresponding to $\Delta_{\text{align}} > 0.9$.

3.1 Performance during Training

The learning curves, plotting the average episode reward against training iterations, demonstrate the effectiveness of the proposed enhancements. The baseline PPO algorithm and a comparable DDPG agent struggle to achieve high rewards due to the sparse reward and task complexity. The MHSA-PPO variant (adding only attention) and the CL-PPO variant (adding only curriculum learning) both show improved learning efficiency and final performance over the baselines. Notably, the full MHSA-CL-PPO algorithm, combining both enhancements, converges to the highest average reward, indicating it successfully learns a more effective policy for the dexterous robotic hand.

3.2 Evaluation of Learned Policies

The trained policies were evaluated over 10,000 test episodes. The success rate, defined as the percentage of episodes where the cube was successfully flipped to the target orientation without being dropped, serves as the primary performance metric.

Table 4: Success Rate Comparison (10,000 Episodes)
Algorithm	Success Rate (%)
DDPG (Baseline)	6.45
PPO (Baseline)	40.57
MHSA-PPO	48.39
CL-PPO	45.01
MHSA-CL-PPO (Proposed)	82.31

The results in Table 4 clearly show that the proposed MHSA-CL-PPO algorithm significantly outperforms all baseline and ablated versions. It achieves a success rate of 82.31%, which is more than double that of the standard PPO baseline and a drastic improvement over DDPG. This confirms that both the attention mechanism and the curriculum learning strategy contribute synergistically to solving the challenging reorientation task with an underactuated dexterous hand.

Further analysis of the efficiency of the learned policy is provided in Table 5, which shows the average and minimum number of time steps required to complete the task in successful episodes. A lower number indicates a faster, more efficient policy.

Table 5: Task Completion Efficiency
Algorithm	Average Steps	Minimum Steps
DDPG	44.2	24
PPO	44.6	24
MHSA-PPO	41.4	21
CL-PPO	43.2	23
MHSA-CL-PPO (Proposed)	40.8	20

The proposed MHSA-CL-PPO not only succeeds more often but also completes the task faster on average and has found a policy that can achieve the goal in as few as 20 steps, the best among all tested algorithms. This demonstrates that the algorithm has learned a superior and more direct manipulation strategy for the dexterous robotic hand.

4. Conclusion

This work addresses the significant challenge of enabling an underactuated dexterous hand to perform dexterous in-hand reorientation. The proposed MHSA-CL-PPO algorithm innovatively combines a Multi-headed Self-Attention mechanism with a Curriculum Learning strategy within the PPO framework. The MHSA module allows the policy network to dynamically focus on the most task-relevant state features, while the CL strategy provides structured, incremental learning goals to overcome sparse rewards. Extensive simulation experiments lead to the following conclusions:

The MHSA-CL-PPO algorithm converges to a higher average reward during training, enabling the underactuated dexterous hand to discover more rewarding action sequences.
The policies learned by MHSA-CL-PPO are more efficient, requiring fewer time steps on average to complete the reorientation task compared to baseline methods.
Most importantly, MHSA-CL-PPO achieves a dramatically higher task success rate of 82.31% in rigorous testing, more than doubling the performance of the standard PPO baseline and decisively outperforming other variants.

These results validate the effectiveness of the proposed algorithmic enhancements and, crucially, demonstrate the feasibility of using a low-DoF underactuated dexterous hand for complex dexterous manipulation tasks like in-hand reorientation. This opens promising avenues for deploying more affordable, robust, and yet capable robotic hands in real-world applications.