This repository contains the official implementation of Extractor, a system for estimating 11 anthropometric features from multi-view RGB-D images.
The solution decomposes this complex problem into two sub-tasks:
- Pinna (Ear) Feature Estimation: Extracts 10 features (5 per ear) from left and right side-view images.
- Head Width Estimation: Extracts 1 feature (head width) from a frontal-view image.
Our approach combines a YOLO-based object detector for robust localization with a lightweight, camera-aware regression network called DepthNet to achieve efficient and accurate feature estimation.
The complete system, Extractor, is built from three main components:
- Pinna Estimator: Processes left and right ear regions to extract 10 features.
- Head Estimator: Processes the frontal head region to extract 1 feature.
- Pre-loader: A utility that selects the optimal views (left-most, right-most, front-most) from a given input set (supporting 3, 36, or 72 images).
Pinna estimation is a two-stage process.
- Stage 1: Ear Detection: A fine-tuned YOLO model detects the ear region in the RGB image. This localization step focuses all subsequent processing on the relevant anatomical structure.
-
Stage 2: Feature Regression: The detected bounding box is used to crop the corresponding region from the depth map. This depth crop is resized to
$64 \times 64$ .
A critical step is to update the camera intrinsics to account for this cropping and resizing. The new intrinsics (
For pinna estimation, we also compute surface normals (
Input: RGB-D image I_rgb, I_depth, camera intrinsics K = {f_x, f_y, c_x, c_y}
Output: 5 pinna features f_pinna
1. Detect ear bounding box using YOLO: bbox <- YOLO(I_rgb)
2. Crop depth region: D_crop <- I_depth[bbox]
3. Resize to fixed size: D_64 <- Resize(D_crop, 64x64)
4. Update camera intrinsics:
5. c_x' <- (c_x - bbox.x) * (64 / bbox.w)
6. c_y' <- (c_y - bbox.y) * (64 / bbox.h)
7. f_x' <- f_x * (64 / bbox.w)
8. f_y' <- f_y * (64 / bbox.h)
9. K' <- {f_x', f_y', c_x', c_y'}
10. For i = 1 to 5:
11. f_i <- DepthNet_i(D_64, K') // Note: D_64 includes normals
12. EndFor
13. f_pinna <- (1/5) * sum(f_i) // Ensemble average
14. Return f_pinna
Head width estimation follows a similar pipeline using frontal images.
- Stage 1: Head Detection: A YOLO model detects the entire head region.
-
Stage 2: Feature Regression: The depth map is cropped, resized to
$64 \times 64$ , and the camera intrinsics ($K'$ ) are updated. -
Ensemble Prediction: The
$64 \times 64$ depth map (1-channel, no normals needed) and$K'$ are passed to an ensemble of 5 DepthNet models. The final head width is the average of their predictions.
Input: RGB-D image I_rgb, I_depth, camera intrinsics K
Output: Head width feature f_head
1. Detect head bounding box: head_region <- YOLO(I_rgb)
2. Extract depth: D_crop <- I_depth[head_region]
3. D_64 <- Resize(D_crop, 64x64)
4. Update intrinsics K' (similar to Algorithm 1)
5. For i = 1 to 5:
6. f_i <- DepthNet_i(D_64, K')
7. EndFor
8. f_head <- (1/5) * sum(f_i) // Ensemble average
9. Return f_head
The full Extractor pipeline combines these components. To handle both ears with a single model, the right ear image is horizontally flipped before processing, effectively turning it into a left ear.
| Pinna RGB Crop | Pinna Depth Crop | Head RGB Crop | Head Depth Crop |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
View Complete Extraction Pseudocode (Algorithm 3)
Input: Image set {I_1, ..., I_n} where n in {3, 36, 72}
Output: 11 features F
1. I_left, I_right, I_front <- SelectViews({I_1, ..., I_n})
2. f_left <- PinnaEstimator(I_left) // 5 features
3. I_right_flip <- HorizontalFlip(I_right)
4. f_right <- PinnaEstimator(I_right_flip) // 5 features
5. f_head <- HeadEstimator(I_front) // 1 feature
6. F <- [f_head, f_left, f_right]
7. Return F
The system is grounded in the pinhole camera model and a key scale-invariance property used for data augmentation.
The relationship between 2D pixel coordinates
A key insight is that scaling a depth map by a factor
Given two 3D points
If we scale all depth values by
The new distance
This property is the foundation of our data augmentation strategy: we can scale depth maps by
DepthNet is a lightweight convolutional network ($\sim$160KB per model) designed for efficiency. It takes the
DepthNet architecture: depth features processed horizontally with FiLM conditioning (orange dashed) from intrinsics branch. SE = Squeeze-and-Excitation for channel attention.
The architecture uses:
- Residual Blocks with depthwise separable convolutions.
- Feature-wise Linear Modulation (FiLM) layers to condition the convolutional features on the camera intrinsics. This allows the network to adapt its spatial reasoning based on the viewing geometry.
- Squeeze-and-Excitation (SE) modules for channel-wise attention.
We avoid spatial augmentations (like rotation) because they require complex updates to the intrinsic matrix. Instead, we use two depth-aware augmentations:
-
Depth Scaling: Based on the scale-invariance property, we sample
$\alpha \sim \mathcal{U}[0.8, 1.2]$ and apply:$$z_{aug} \gets \alpha \cdot z$$ $$y_{aug} \gets \alpha \cdot y$$ where
$y$ represents the ground-truth measurements. -
Depth Noise: We add Gaussian noise to the depth map to improve robustness to sensor noise:
$$z_{aug} \gets z + \mathcal{N}(0, \sigma^2)$$
To reduce prediction variance and improve robustness to noisy depth inputs, we train an ensemble of 5 DepthNet models for each task (pinna and head) using different random initializations. At inference, the predictions from all 5 models are averaged to produce the final, stable estimate.
-
Why a Two-Stage (Detection + Regression) approach?
- Initial attempts at direct landmark detection on RGB followed by 3D geometry calculations failed. Depth maps are often too noisy, and small errors in 2D landmark positions lead to large errors in 3D-space, corrupting geometric estimates. Our two-stage approach isolates robust 2D detection (YOLO) from noise-tolerant depth regression (DepthNet).
-
Why DepthNet instead of direct geometric formulas?
- A learning-based regression network can implicitly handle depth noise, sensor artifacts, and complex 3D-to-feature mappings that are difficult to encode in an explicit formula.
-
Why an Ensemble of 5 Models?
- Ensemble averaging is a standard technique to reduce prediction variance. Each model learns slightly different representations, and averaging their outputs provides a more stable and accurate estimate than any single model.
-
Why the small 64x64 input size?
- This provides a balance between computational-efficiency and information preservation. It standardizes the variably-sized crops from YOLO, simplifying the network architecture and enabling fast inference suitable for real-time applications.
-
Why flip the right ear?
- Human ears are largely symmetric. By flipping the right ear, we can use a single model to process both left and right ears. This simplifies the architecture, reduces the number of trained models, and effectively doubles the size of our training dataset for the pinna estimator.
The goal of this challenge is to extract anthropometric data of the human head and ears from a series of RGBD images of a human subject. Anthropometric data refers to measurements of the human anatomy, in this case the human head and ears. The images of the human head from which the anthropometric measurements should be extracted are taken from various angles around the subject and include both RGB data as well as depth information. Here is an overview of the process (just for illustration, images of an artificial instead of a real human head are displayed):
The submitted models are expected to accept a fixed number of subject images taken from predefined camera positions and to output specific anthropometric measurements of the human head and pinnas.
For model training, a dataset of subject images and corresponding anthropometric measurements is provided. The subject images are taken from horizontal camera positions around the subject at a 5 degree resolution (72 images per subject). These RGBD images are provided in HEIC format. The anthropometric data contains 1 measurement of the head and 5 measurement of each pinna (left and right), resulting in an 11-dimensional vector. Details about the anthropometric features can be found below.
The model should be able to work with 3 different sets of input images:
- images from all directions (72 images)
- front images only (36 images)
- only front, left and right images (3 images)
The order of the images, i.e. their camera positions relative to the subject, will remain fixed.
The default data source for this challenge is the SONICOM dataset. While the anthropometric measurements for all subjects are provided as part of this project, the subject images need to be obtained from SONICOM.
To obtain access to the images, a data sharing permission form needs to be signed by all team members. The form can be obtained by navigating to the submission section of your team. Please download the form, fill in the names of all team members as well as their signatures, and upload the signed document on the same page. Please also provide a single email address of one of the team members to which the download information will be sent.
After submission of the signed form, you will receive an email giving you access to the SONICOM images. We try to keep the time between submission and access as short as possible, but since there is manual work involved, it might take several days before you can access the data.
Once your team has access, open the download platform and download all image folders (P0001, P0002, ..., P0317). Should the downloaded .zip file be corrupt, download the folders in smaller batches.
(Disclaimer: the images are actually mirror images of the subjects, due to the way the pictures were taken. The provided antropometrics account for this, so left and right pinna anthropometrics correspond to the left and right ears as seen in the images. Should you consider using the SONICOM mesh data, be aware that those are flipped w.r.t. the images.)
Set up your python environment and install all required packages.
pip install -r requirements.txtIf you intend to access the image metadata using PyExifTool, the exiftool command-line tool needs to be installed and made available on the PATH.
For details on how to use the provided code resources, see the Jupyter notebook.
The submitted systems will be evaluated on an undisclosed set of subjects. For each subject, all sets of input images (full circle, front only and front/left/right) will be used as input to the model. The outputs will be compared against the ground truth anthropometric measurements.
To compare the model output to the ground truth anthropometric measurements, the Euclidean distance between the standardized feature vectors is computed. For an metrics.py module.
The overall score will be computed as the mean across all subjects and input image sets. Since the applied metric is a distance, it means that a lower score corresponds to a better result.
Systems need to be submitted through the challenge platform, and can be updated at any time before the end of the challenge. Only the latest submission will be considered for each team and will be displayed on the leaderboard of the challenge.
Each submitted model needs to implement the Extractor class in the extractor.py module. This class will be used for automatic evaluation on a hidden test data set, and the score will be reported on the leaderboard.
The final submission at the end of the challenge must include:
- the source code to extract anthropometric parameters from RGBD images
- a brief documentation of the algorithm
- for AI-based solutions: the training code as well as a reference to any additional datasets used
Binaural audio rendering is the process of simulating sound sources in 3D space around a listener. It is not only used for virtual reality and augmented reality applications, but has made its way into mobile devices to provide immersive user experiences when listening to audio content (music, movies, radio play, etc.).
In order to provide the illusion of sounds coming from various directions, sound source signals are convolved with so-called head-related transfer functions (HRTFs). Those HRTFs encode the relevant binaural cues that let the listener perceive the sound from a certain direction.
However, HRTFs are influenced by the human anatomy. The pinna for example causes direction-dependent sound reflections and the head causes frequency-dependent sound attenuation due to shadowing effects. Therefore, HRTFs of individuals differ due to anatomic differences. Listening to rendered audio content using HRTFs of a different individual can have a detrimental effect on the perceived sound quality, and can lead to inaccurate localization and an undesired sound color. Hence there is a demand for obtaining individual HRTFs to provide personalized audio rendering.
Obtaining accurate individual HRTFs usually requires time-consuming acoustic measurements and does not scale to a large user group. A cost-effective alternative would be to simulate these HRTFs based on anthropometric measurements extracted from user-provided images or videos.
The goal of this challenge is therefore to obtain these anthropometric measurements from a set of RGBD images.
The provided anthropometrics are based on a number of landmark points on the subject's head and pinna. Each dimension in the anthropometric feature vector represents the Euclidean distance between two landmark points. The landmark points for each pinna are defined as follows:
| Landmark | Description |
|---|---|
| A | point on the ridge of the tragus at the height of the ear canal entrance |
| B | lowest point of the intertragal notch |
| C | midpoint on the upper ridge of the antihelix |
| D | point on the ridge of the antihelix at the same height as the ear canal entrance |
| E | point on the outer ridge of the helix at the same height as the ear canal entrance |
| F | top point of the inner ridge of the helix |
| G | top point of the outer ridge of the helix |
| H | lowest point of the ear lobe |
| I | point at right ear canal entrance |
| J | point at left ear canal entrance |
The anthropometrics are then defined as:
The overall anthropometrics feature vector is formed as:






