Vision-based Teleoperation System for Dexterous Robotic Hands via Motion Retargeting

In recent years, the rapid advancement of artificial intelligence and robotics has unveiled immense potential for dexterous robotic hands in fields such as precision assembly, medical surgery, and social services. As end-effectors with high degrees of freedom and biomimetic structures, dexterous robotic hands can mimic human hands to perform complex manipulation tasks, serving as a crucial platform for achieving fine robotic operations and human-robot interaction. The development of dexterous robotic hands not only reflects the trend toward structural biomimicry and control intelligence in robotics but also signifies the evolution of intelligent robots toward greater autonomy and multi-task adaptability.

However, controlling dexterous robotic hands with high precision remains a significant challenge, especially in unstructured environments. Traditional control methods often fail to replicate the complex behaviors of human hands, such as multi-degree-of-freedom coordination and flexible adjustments, due to inherent differences in kinematics, size, and flexibility between human hands and dexterous robotic hands. To address this, teleoperation techniques have emerged, allowing operators to remotely control robots in real-time, combining human intelligence with robotic execution to enhance adaptability and operational flexibility in non-structured settings.

Current teleoperation systems primarily rely on wearable sensor devices like data gloves, motion capture systems, or virtual reality (VR) equipment for hand pose acquisition. While these devices offer high accuracy, they are often expensive, cumbersome, and restrictive, limiting the operator’s hand freedom and flexibility. Moreover, direct mapping between human hand poses and dexterous robotic hand poses suffers from inaccuracies due to kinematic disparities, reducing action naturalness and success rates. To overcome these limitations, I focus on vision-based teleoperation, which uses standard or depth cameras to achieve high action recognition accuracy at low cost. Additionally, to tackle the heterogeneity between human and robotic hands, I design an optimization objective function that combines vector angle and distance errors, enabling natural and precise pose mapping.

In this article, I present a vision-based teleoperation system for dexterous robotic hands using motion retargeting. This system integrates three main modules: hand keypoint detection, hand pose retargeting, and robotic action generation. It leverages RGB cameras for high-precision tracking of operator hand keypoints and poses, and incorporates environmental information for real-time interaction. A retargeting optimization algorithm based on task-space consistency is proposed to map human hand motions to dexterous robotic hand actions effectively. This algorithm relies solely on objective function optimization without requiring neural network training, ensuring good mapping performance. Through integration with other functional modules, the system can execute various grasping tasks with high precision, demonstrating strong control stability and task adaptability.

The core of this system lies in the hand pose retargeting algorithm, which translates human hand keypoints into joint angles for the dexterous robotic hand. Traditional kinematic retargeting methods often struggle with fine grasping due to structural differences. For instance, some approaches match joint positions between human and robotic hands using solvers like BioIK, but they are not suitable for delicate operations. Others employ unsupervised deep recurrent neural networks for motion mapping across different skeletal structures, yet their effectiveness in real-world scenarios is limited.

To address these issues, I prioritize the fingertip regions during retargeting, as they are critical for grasping and fine manipulation tasks. This is supported by metrics such as contact frequency, neural distribution density, and controllability in fine hand manipulations. The optimization focuses on minimizing errors in the task space, specifically the angle and distance between vectors representing fingertip relationships. The retargeting objective function is defined as follows:

$$ C(\mathbf{q}_h, \mathbf{q}_a) = \sum_{i=0}^{N} w(d_i) \| \mathbf{r}_i(\mathbf{q}_a) – f(d_i) \mathbf{r}_i^*(\mathbf{q}_h) \|^2 + \lambda \theta(\mathbf{r}_i(\mathbf{q}_a), \mathbf{r}_i^*(\mathbf{q}_h)) + \gamma \|\mathbf{q}_a\|^2 $$

In this equation, $\mathbf{q}_h$ represents the joint angles of the human hand, and $\mathbf{q}_a$ represents the joint angles of the dexterous robotic hand. The term $\mathbf{r}_i$ is a 3D vector pointing from one coordinate origin to another, where $\mathbf{r}_i(\mathbf{q}_a)$ is computed from the dexterous robotic hand joint angles, and $\mathbf{r}_i^*(\mathbf{q}_h)$ is the normalized direction vector from the human hand joint angles. The function $f(d_i)$ is a dynamic scale scaling function related to vector magnitude, and $\theta(\mathbf{r}_i(\mathbf{q}_a), \mathbf{r}_i^*(\mathbf{q}_h))$ denotes the angle between the reference vector and the dexterous robotic hand vector, with smaller angles indicating better directional alignment. The parameter $\lambda$ weights the angle error, $w(d_i)$ is a dynamic weight function for distance error, and $\gamma \|\mathbf{q}_a\|^2$ is a regularization term to reduce solution redundancy and prevent abnormal poses in the dexterous robotic hand.

This objective function jointly optimizes angle and distance errors between reference and dexterous robotic hand vectors, incorporating dynamic scaling and weighting mechanisms. By optimizing this function, appropriate joint angles for the dexterous robotic hand are derived from human hand keypoint inputs, which are then sent to the action generation module to drive joint motors, replicating human-like motions.

To efficiently control the dexterous robotic hand, the vector space is divided into two subspaces, $S_1$ and $S_2$, focusing on fingertip control for grasping and fine manipulation. The definitions of these subspaces are summarized in Table 1.

Table 1: Definitions of Subspaces $S_1$ and $S_2$
Subspace	Definition
$S_1$	Vectors from primary fingers (excluding the thumb) pointing to the thumb, representing spatial relationships critical for pinch grasps.
$S_2$	Vectors between primary fingers pointing to each other, where the distance to the thumb is below a threshold, important for multi-finger coordination.

Subspace $S_1$ captures the spatial relationship between primary fingers and the thumb, which is essential for precise pinch grasps. Optimizing vectors in $S_1$ enhances contact accuracy, improving action stability and success rates. Subspace $S_2$ addresses multi-finger coordination, ensuring proper relative positions and directions among fingers during complex manipulations to avoid errors and enhance stability.

The joint optimization strategy for vector angle and distance errors is crucial for achieving task-space consistency between human and dexterous robotic hands. Solely minimizing distance or angle errors often fails to preserve natural motion coherence. The angle error $\theta$ measures directional differences, ensuring that the dexterous robotic hand maintains a natural posture aligned with human intent, which is vital for fluid and intuitive movements. The distance error $\| \mathbf{r}_i(\mathbf{q}_a) – f(d_i) \mathbf{r}_i^*(\mathbf{q}_h) \|$ focuses on positional accuracy, critical for precise fingertip placement in tasks like grasping and contact. By combining these errors, the system balances naturalness and precision, enabling the dexterous robotic hand to perform actions that are both human-like and effective.

Dynamic scale scaling and weight functions are introduced to adapt to varying task demands and enhance optimization performance. The dynamic scale scaling function $f(d_i)$ adjusts reference vector magnitudes based on fingertip distances, defined as:

$$ f(d_i) = \begin{cases}
\beta d_i & (d_i > \varepsilon) \\
s(d_i) \cdot d_i & (d_i \leq \varepsilon \cap \mathbf{r}_i(\mathbf{q}_h) \in S_1) \\
\eta & (d_i \leq \varepsilon \cap \mathbf{r}_i(\mathbf{q}_h) \in S_2)
\end{cases} $$

Here, $\beta$ is a linear scaling factor (set to 1.6 in this work), $\varepsilon$ is a distance threshold, and $\eta$ is a constant for fine control. For $d_i \leq \varepsilon$, a dynamic scaling factor $s(d_i)$ based on a Sigmoid function ensures smooth transitions:

$$ s(d_i) = s_{\text{min}} + (s_{\text{max}} – s_{\text{min}}) \cdot \frac{1}{1 + e^{-k(d_i – c)}} $$

Where $s_{\text{min}}$ and $s_{\text{max}}$ are lower and upper bounds, $k$ controls the steepness, and $c$ is the center point. This design prevents abrupt scale changes, promoting natural motion generation. For instance, with $s_{\text{min}} = 0$ and $s_{\text{max}} = 1$, the function produces smooth curves as shown in conceptual plots, though no figures are referenced here.

The weight function $w(d_i)$ assigns priorities based on fingertip proximity and subspace membership, formulated as:

$$ w(d_i) = \begin{cases}
1 & (d_i > \varepsilon) \\
200 & (d_i \leq \varepsilon \cap \mathbf{r}_i(\mathbf{q}_h) \in S_1) \\
400 & (d_i \leq \varepsilon \cap \mathbf{r}_i(\mathbf{q}_h) \in S_2)
\end{cases} $$

This piecewise function emphasizes critical interactions: when fingertips are far apart ($d_i > \varepsilon$), a low weight of 1 is used; when they are close and in $S_1$, the weight increases to 200 to prioritize pinch grasps; and when in $S_2$ with close proximity, the weight peaks at 400 to focus on multi-finger coordination. This weighting scheme ensures that the optimization prioritizes fingertip regions, enhancing precision for key actions while tolerating minor deviations in less critical areas like the wrist or palm.

Building on this algorithm, I develop a vision-based teleoperation system for dexterous robotic hands using motion retargeting. The system adopts a modular design for generality, comprising three main modules: hand keypoint detection, hand pose retargeting, and robotic action generation. This architecture leverages the low-cost, low-restraint advantages of visual input and the high-speed, low-latency benefits of the retargeting optimization algorithm.

The hand keypoint detection module utilizes the lightweight MediaPipe framework to track 21 hand keypoints in real-time from monocular RGB images. This module operates without wearable devices, relying solely on standard RGB cameras, and balances detection accuracy with computational efficiency, meeting the low-latency and high-continuity requirements of teleoperation. In experiments, it reliably detects hand keypoints as demonstrated in practical setups.

The hand pose retargeting module is the core, translating human hand keypoints into joint angles for the dexterous robotic hand. It addresses challenges from differences in degrees of freedom (DoF), size, and motion constraints by employing the task-space consistency optimization strategy described earlier. The module incorporates vector space partitioning and dynamic weighting to prioritize fingertip actions, enhancing control precision in fine manipulation scenarios. The dynamic scale scaling function further smoothens reference vector adjustments, reducing motion jerks and trajectory oscillations.

The robotic action generation module converts pose information into executable trajectories for the dexterous robotic hand. To achieve high-precision, low-latency, and collision-free motion control, I use the GPU-accelerated motion optimization library CuRobo as the core engine. CuRobo employs gradient-based continuous trajectory optimization, avoiding discrete sampling or path stitching that can cause unnatural motions. By leveraging GPU parallel computing, it enhances real-time evaluation and optimization capabilities, ensuring responsive and robust teleoperation.

This loosely coupled design maintains independence between modules, making the system versatile and easy to deploy. It supports bare-hand input, real-time response, and adaptability to different hardware, offering a practical solution for controlling dexterous robotic hands in diverse applications.

To comprehensively evaluate the system’s performance, I design experiments involving three categories of typical tasks based on the Allegro four-fingered dexterous robotic hand: four-finger cooperative grasping, three-finger cooperative grasping, and two-finger precision pinching. Each category varies in manipulation complexity and grasping methods, using objects of different shapes and materials to test control accuracy and adaptability. The tasks cover a range from high-DoF协同抓握 to low-DoF精细捏持, providing a thorough assessment of stability, operational fluency, and retargeting precision.

The experimental setup includes an Intel RealSense D435 RGB-D camera for input, fixed in front of the operator to capture hand movements. Lighting is controlled with uniform LED illumination to ensure stable image quality. The dexterous robotic hand used is the Allegro Hand with four fingers, each with 4 joints, totaling 16 DoF. The system runs on a host with an Intel Core i7-8700K CPU, NVIDIA RTX 2070 GPU, Ubuntu 20.04, and Python 3.8. Each task is repeated 10 times under fixed conditions, recording success rates and average completion times as metrics.

The tasks and objects are summarized in Table 2, with performance data presented in Table 3.

Table 2: Experimental Tasks and Objects
Task Category	Number of Fingers	Target Objects	Object Characteristics
Four-finger grasping	4	Apple model, peach model, building block, mouse	Spherical, cubic, rigid irregular structures
Three-finger grasping	3	Plush toy, banana model	Flexible irregular, irregular cylindrical structures
Two-finger pinching	2	Plastic water bottle, tissue paper	Hollow structure, soft and lightweight material

Table 3: Performance Data for Teleoperation Tasks
Task ID	Number of Fingers	Object	Success Rate (%)	Average Completion Time (s)
T1	4	Apple model	100	2.72
T2	4	Peach model	100	2.42
T3	4	Building block	90	2.77
T4	4	Mouse	80	3.32
T5	3	Plush toy	90	2.58
T6	3	Banana model	100	2.68
T7	2	Plastic water bottle	90	2.88
T8	2	Tissue paper	60	5.35

Overall, the system demonstrates stable performance for regularly shaped objects. In four-finger grasping tasks (T1-T4), success rates are 90% or higher, with 100% for spherical objects like apples and peaches, indicating high stability for such shapes. These objects offer symmetry and ample contact surfaces, facilitating coordinated force distribution by the dexterous robotic hand. Task T4 shows a lower success rate of 80% due to the mouse’s smooth, irregular surface lacking natural grooves, making initial stabilization and grip maintenance challenging.

For three-finger grasping tasks (T5-T6), success rates exceed 90%, showcasing the system’s adaptability to irregular structures. Although these objects lack obvious grasp points, semi-enveloping pinches and strategic positioning (e.g., near the center of mass) enable stable operations.

In two-finger pinching tasks (T7-T8), the system performs well with hollow objects like plastic bottles but requires careful control of force and posture. The tissue paper task (T8) is most challenging, with a 60% success rate and the longest average completion time. This is attributed to the paper’s thin, slippery nature, where minor keypoint detection errors hinder effective pinching, revealing limitations in fine control for extreme cases.

Average completion times range from 2.42 to 3.32 seconds for most tasks, except T8, indicating efficient response in keypoint detection, retargeting, and execution, meeting real-time requirements. Regarding finger involvement, four-finger tasks benefit from larger contact areas and balanced force distribution, performing well with regular objects. Three-finger tasks balance flexibility and stability, suitable for moderately complex targets. Two-finger tasks are effective for rigid objects but struggle with extremely lightweight or deformable items, highlighting areas for improvement in precision control.

In conclusion, I propose a vision-based teleoperation system for dexterous robotic hands using motion retargeting to address limitations of traditional systems in unstructured environments. The system integrates hand keypoint detection, pose retargeting, and action generation modules in a loosely coupled manner. Through an optimized retargeting approach, it naturally maps human hand motions to dexterous robotic hand actions using low-cost RGB-D cameras. Experimental results show that the system exhibits robustness, real-time performance, and control accuracy in most grasping tasks, demonstrating strong generality and practical value. Key contributions include:

1. Development of a complete vision-based teleoperation system for dexterous robotic hands via motion retargeting, covering the pipeline from keypoint detection to action generation, with low-cost, low-latency, and high-compatibility features.

2. Design of an optimization objective function combining vector angle and distance errors, ensuring task-space consistency between human and dexterous robotic hands at the fingertip level.

3. Introduction of a dynamic scale scaling function based on Sigmoid curves and a task-oriented weight allocation mechanism, enabling adaptive control for different operation ranges and prioritized optimization of critical fingertip regions.

4. Performance evaluation through typical grasping tasks, validating the system’s success rates and efficiency in multi-finger coordination and precision pinching, confirming its practicality and stability.

However, challenges remain with extremely lightweight, deformable, or slippery objects, indicating limitations in fine force control and pose adjustment. Future work could incorporate deep learning strategies or multi-sensor fusion mechanisms to enhance adaptability in complex tasks. This research provides an effective pathway for building low-cost, high-precision, and natural teleoperation systems for dexterous robotic hands, laying groundwork for applications in healthcare, commerce, hazardous operations, and beyond. The focus on dexterous robotic hands throughout this study underscores their importance in advancing robotic manipulation capabilities.