Code repository for the paper: DeltaDorsal: Enhancing Hand Pose Estimation with Dorsal Features in Egocentric Views
This work is licensed under a [Creative Commons Attribution 4.0 International License][cc-by].
DeltaDorsal is a 3D hand pose estimation model using dorsal features of the hand. DeltaDorsal uses a delta encoding approach where features from a base (neutral) hand image are compared with features from the current hand image to predict 3D hand pose and force. Our repo consists of the following components:
- DeltaDorsalNet: A backbone network using DINOv3 features with our delta encoding
- DINOv3 Backbone: Extracts features from input images
- Change Encoder: Computes delta features between current and base images
- Residual Pose Head: Predicts pose parameter residuals from prior
- Force Head: Classifies finger activation and force levels
- Pose Estimation: Predicts 3D hand pose parameters
- Force Estimation: Classifies finger activation and force levels
- Python >= 3.10
- uv or any other package manager of your choice
- Clone the repository:
git clone --recursive [REPO_URL]
cd wrinklesense- Create a virtual environment and install PyTorch:
uv pip install torch --torch-backend=auto- Install all dependencies:
uv sync- Download DINOv3
Make sure to follow all the installation instructions for DINOv3 found here.
Download all model weights and place them into _DATA/dinov3
- Download MANO
Our repo requires the usage of the MANO hand model. Request access here.
You only need to put MANO_RIGHT.pkl under the _DATA/mano folder.
- (Optional) Set up Weights & Biases for logging:
wandb loginWe organize our data as follows according to the following setup which can interface with our prewritten dataset modules found in src/datasets/. If you want to use your own data, please feel free to write your own modules.
. # ROOT
├── bases.json # Bases metadata
├── trials.json # Metadata of each trial
├── frames.json # Metadata of all captured frames
├── train.json # (OPTIONAL) subset of frames.json for train split
├── val.json # (OPTIONAL) subset of frames.json for val split
├── test.json # (OPTIONAL) subset of frames.json for test
├── trials/ # all captured data
│ ├── PARTICIPANT_XXX/
│ │ ├── TRIAL_XXX/
│ │ │ ├── anns/
│ │ │ │ ├── frame_XXX.npy
│ │ │ │ ├── frame_XXX.npy
│ │ │ │ └── frame_XXX.npy
│ │ │ ├── hamer/ # initial pose prediction
│ │ │ │ ├── frame_XXX.npy
│ │ │ │ ├── frame_XXX.npy
│ │ │ │ └── frame_XXX.npy
│ │ │ ├── imgs/ # captured images
│ │ │ │ ├── frame_XXX.jpg
│ │ │ │ ├── frame_XXX.jpg
│ │ │ │ └── frame_XXX.jpg
│ │ │ ├── cropped_images/ # (OPTIONAL) Precropped images that are aligned to bases
│ │ │ │ ├── frame_XXX.jpg
│ │ │ │ ├── frame_XXX.jpg
│ │ │ │ └── frame_XXX.jpg
│ │ │ └── cropped_bases/ # (OPTIONAL) Precropped bases that are aligned to each frame
│ │ │ ├── frame_XXX.jpg
│ │ │ ├── frame_XXX.jpg
│ │ │ └── frame_XXX.jpg
│ │ └── ...
│ └── ...
└── bases/ # all captured reference images
├── PARTICIPANT_XXX/
│ ├── hamer/ # initial pose prediction
│ │ ├── frame_XXX.npy
│ │ ├── frame_XXX.npy
│ │ └── frame_XXX.npy
│ └── imgs/
│ ├── frame_XXX.jpg
│ ├── frame_XXX.jpg
│ └── frame_XXX.jpg
└── ...
Each frame should have:
- Image file
- HaMeR pose predictions or some initial 2d pose prediction for alignment
- Ground truth pose annotations
- Force sensor readings for force estimation (optional)
base.json
- bases_dir (str) - path to bases dir (default to "bases")
- participants (array)
- item (object)
- p_id (int) - participant id
- bases (array)
- item (object)
- base_id (int) - base id
- img_path (str) - relative path from bases_dir to the base image .jpg
- hamer_path (str) - relative path from bases_dir to the initial annotations .npy
- item (object)
- item (object)
trials.json
- trials (array)
- item (object)
- trial_id (int) - id of this trial
- p_id (int) - participant id for this trial
- motion_type (str) - type of gesture
- hand_position (str) - orientation of hand
- K (array(float)) - 3x3 camera intrinsic matrix (Represented as A here)
- d (array(float)) - 1x15 camera intrinsic distortion (output from OpenCV)
- world2cam (array(float)) - 4x4 camera extrinsic matrix.
- item (object)
frames.json, train.json, val.json, test.json
- frames_dir (str) - path to frames dir (default to "trials")
- split (str) - training split. one of full, train, val, test
- frames (array)
- item (object)
- trial_id (int) - id of corresponding trial
- timestamp (float) - timestamp within the video capture
- frame_no (int) - index of frame in the video
- img_path (str) - relative path from frames_dir to the captured image .jpg
- cropped_img_path (str) - (OPTIONAL) relative path from frames_dir to a precropped and aligned image .jpg
- cropped_base_path (str) - (OPTIONAL) relative path from frames_dir to a precropped and aligned reference image .jpg
- cropped_base_idx (int) - index of base image taken and prealigned for this frame
- ann_path (str) - relative path from frames_dir to the ground truth annotation .npy
- hamer_path (str) - relative path from frames_dir to the initial annotations .npy
- fsr_reading (float) - (OPTIONAL) fsr reading for this frame for force predictions
- tap_label (float) - (OPTIONAL) assigned label for force action type
- frame_id (int) - id for this frame data
- item (object)
annotation.npy (all annotation files)
- betas (array) - 1x10 shape parameters for MANO
- global_orient (array) - 1x3 global orientation for MANO (commonly represented as the first three terms of pose parameters)
- hand_pose (array) - 1x15x3 pose parameters in axis-angle representation for MANO. (Can be 1x15x3x3 if in rotation matrix form)
- cam_t (array) - 1x3 camera projection parameters to convert from 3D to 2D camera frame annotations
- keypoints_3d (array) - 21x3 openpose keypoint locations in xyz format
- keypoints_2d (array) - 21x2 openpose keypoint locations projected into camera frame
For processing, finger labels are mapped as follows:
- 0: index
- 1: middle
- 2: ring
- 3: pinky
- 4: thumb
- 5: dorsal
- 6: palm
- 7: other
where the assignment of mano face labels to each finger label can be found in assets/mano_face_labels.json.
DeltaDorsal uses Hydra for configuration management. Config files are located in configs/:
You can override any config parameter via command line:
python src/train_pose.py model.optimizer.lr=0.0001 data.batch_size=32 max_epochs=100Or create a new config file in configs/experiments/ and reference it:
python src/train_pose.py experiment=my_experimentTrain the pose estimation model:
python src/train_pose.pyOverride config parameters via command line:
# Change number of epochs
python src/train_pose.py max_epochs=50
# Change batch size
python src/train_pose.py data.batch_size=32
# Change learning rate
python src/train_pose.py model.optimizer.lr=1e-4
# Resume from checkpoint
python src/train_pose.py ckpt_path=path/to/checkpoint.ckpt
# Train only (skip testing)
python src/train_pose.py test=False
# Test only (skip training)
python src/train_pose.py train=False test=True ckpt_path=path/to/checkpoint.ckpt# Use leave-one-out cross-validation
python src/train_pose.py data.leave_one_out=True data.val_participants=[1] data.test_participants=[3]
# Train on specific participants
python src/train_pose.py data.all_participant_ids=[1,2,3,4,5]Train the force estimation model:
python src/train_force.py# Change number of epochs
python src/train_force.py max_epochs=50
# Change batch size
python src/train_force.py data.batch_size=64
# Resume from checkpoint
python src/train_force.py ckpt_path=path/to/checkpoint.ckpt
# Train only
python src/train_force.py test=False
# Test only
python src/train_force.py train=False test=True ckpt_path=path/to/checkpoint.ckptAfter training, evaluate on the test set:
# Pose model evaluation
python src/train_pose.py train=False test=True ckpt_path=path/to/checkpoint.ckpt
# Force model evaluation
python src/train_force.py train=False test=True ckpt_path=path/to/checkpoint.ckptThe checkpoint path can be:
- Explicit path:
ckpt_path=outputs/2024-01-01_12-00-00/checkpoints/best.ckpt - Best checkpoint from training: If
ckpt_pathis not specified and training was run, it will use the best checkpoint automatically
Test results are saved to the Hydra output directory (typically outputs/YYYY-MM-DD_HH-MM-SS/):
-
Pose model:
test_outputs_epoch{N}.npycontaining:- frame_id (int) - frame identifiers
- pose (array) - predicted pose parameters
- shape (array) - hand shape parameters
- pred_kp_3d (array): predicted 3D keypoints
- gt_kp_3d (array): ground truth 3D keypoints
- hamer_kp_3d (array): prior (HaMeR) 3D keypoints
-
Force model:
test_outputs_epoch{N}.npycontaining:- frame_id (int) - frame identifiers
- pred_tap_labels (int) - Predicted force labels
- gt_tap_labels (int) - Ground truth force labels
During validation and testing, the following metrics are logged:
Pose Model:
val/mpjpe: Mean Per Joint Position Error (mm)val/mpjpe-prior: MPJPE of the prior (HaMeR) predictionsval/loss: Total loss (keypoint + pose parameter loss)val/val-inference-time: Inference time per sample (ms)
Force Model:
val/acc: Classification accuracyval/prec_w: Weighted precisionval/recall_w: Weighted recallval/f1_w: Weighted F1 scoreval/loss: Cross-entropy lossval/inference-time: Inference time per sample (ms)
Training:
max_epochs: Maximum training epochsmin_epochs: Minimum training epochstrain: Whether to run trainingtest: Whether to run testingckpt_path: Path to checkpoint for resuming/testingseed: Random seed
Data:
data.data_dir: Path to dataset directorydata.batch_size: Batch size per devicedata.out_img_size: Input image sizedata.leave_one_out: Use leave-one-out cross-validationdata.val_participants: Participant IDs for validationdata.test_participants: Participant IDs for testing
Model:
model.backbone.model_name: DINOv3 model variantmodel.backbone.n_unfrozen_blocks: Number of unfrozen transformer blocksmodel.optimizer.lr: Learning ratemodel.scheduler: Learning rate scheduler config
The model uses a residual prediction approach where it predicts corrections to a prior pose estimate (from HaMeR), enabling more stable training and better generalization.
Parts of the code are taken or adapted from the following repositories:
If extending or using our work, please cite the following papers:
PLACEHOLDER
@article{MANO:SIGGRAPHASIA:2017,
title = {Embodied Hands: Modeling and Capturing Hands and Bodies Together},
author = {Romero, Javier and Tzionas, Dimitrios and Black, Michael J.},
journal = {ACM Transactions on Graphics, (Proc. SIGGRAPH Asia)},
publisher = {ACM},
month = nov,
year = {2017},
url = {http://doi.acm.org/10.1145/3130800.3130883},
month_numeric = {11}
}
@misc{simeoniDINOv32025,
title = {{{DINOv3}}},
author = {Sim{\'e}oni, Oriane and Vo, Huy V. and Seitzer, Maximilian and Baldassarre, Federico and Oquab, Maxime and Jose, Cijo and Khalidov, Vasil and Szafraniec, Marc and Yi, Seungeun and Ramamonjisoa, Micha{\"e}l and Massa, Francisco and Haziza, Daniel and Wehrstedt, Luca and Wang, Jianyuan and Darcet, Timoth{\'e}e and Moutakanni, Th{\'e}o and Sentana, Leonel and Roberts, Claire and Vedaldi, Andrea and Tolan, Jamie and Brandt, John and Couprie, Camille and Mairal, Julien and J{\'e}gou, Herv{\'e} and Labatut, Patrick and Bojanowski, Piotr},
year = 2025,
month = aug,
number = {arXiv:2508.10104},
eprint = {2508.10104},
primaryclass = {cs},
publisher = {arXiv},
doi = {10.48550/arXiv.2508.10104},
urldate = {2025-08-25},
archiveprefix = {arXiv}
}
