Depth-Based Anthropometric Feature Estimation: A Two-Stage Detection and Regression Approach

This repository contains the official implementation of Extractor, a system for estimating 11 anthropometric features from multi-view RGB-D images.

The solution decomposes this complex problem into two sub-tasks:

Pinna (Ear) Feature Estimation: Extracts 10 features (5 per ear) from left and right side-view images.
Head Width Estimation: Extracts 1 feature (head width) from a frontal-view image.

Our approach combines a YOLO-based object detector for robust localization with a lightweight, camera-aware regression network called DepthNet to achieve efficient and accurate feature estimation.

System Architecture

The complete system, Extractor, is built from three main components:

Pinna Estimator: Processes left and right ear regions to extract 10 features.
Head Estimator: Processes the frontal head region to extract 1 feature.
Pre-loader: A utility that selects the optimal views (left-most, right-most, front-most) from a given input set (supporting 3, 36, or 72 images).

1. Pinna (Ear) Feature Estimation

Pinna estimation is a two-stage process.

Stage 1: Ear Detection: A fine-tuned YOLO model detects the ear region in the RGB image. This localization step focuses all subsequent processing on the relevant anatomical structure.
Stage 2: Feature Regression: The detected bounding box is used to crop the corresponding region from the depth map. This depth crop is resized to $64 \times 64$.

A critical step is to update the camera intrinsics to account for this cropping and resizing. The new intrinsics ($K'$) are calculated as:

$$\begin{aligned} c_x' &\gets (c_x - bbox.x) \cdot \frac{64}{bbox.w} \\ c_y' &\gets (c_y - bbox.y) \cdot \frac{64}{bbox.h} \\ f_x' &\gets f_x \cdot \frac{64}{bbox.w} \\ f_y' &\gets f_y \cdot \frac{64}{bbox.h} \end{aligned}$$

For pinna estimation, we also compute surface normals ($N_x, N_y, N_z$) from the depth map. The final input to the regression network is a 4-channel tensor (depth, $N_x, N_y, N_z$), plus the adjusted intrinsics $K'$. This is fed to an ensemble of 5 DepthNet models, and their predictions are averaged for a robust final estimate.

Pinna feature estimation algorithm:

Input: RGB-D image I_rgb, I_depth, camera intrinsics K = {f_x, f_y, c_x, c_y}
Output: 5 pinna features f_pinna

1.  Detect ear bounding box using YOLO: bbox <- YOLO(I_rgb)
2.  Crop depth region: D_crop <- I_depth[bbox]
3.  Resize to fixed size: D_64 <- Resize(D_crop, 64x64)
4.  Update camera intrinsics:
5.      c_x' <- (c_x - bbox.x) * (64 / bbox.w)
6.      c_y' <- (c_y - bbox.y) * (64 / bbox.h)
7.      f_x' <- f_x * (64 / bbox.w)
8.      f_y' <- f_y * (64 / bbox.h)
9.  K' <- {f_x', f_y', c_x', c_y'}
10. For i = 1 to 5:
11.     f_i <- DepthNet_i(D_64, K')  // Note: D_64 includes normals
12. EndFor
13. f_pinna <- (1/5) * sum(f_i)      // Ensemble average
14. Return f_pinna

2. Head Width Estimation

Head width estimation follows a similar pipeline using frontal images.

Stage 1: Head Detection: A YOLO model detects the entire head region.
Stage 2: Feature Regression: The depth map is cropped, resized to $64 \times 64$, and the camera intrinsics ($K'$) are updated.
Ensemble Prediction: The $64 \times 64$ depth map (1-channel, no normals needed) and $K'$ are passed to an ensemble of 5 DepthNet models. The final head width is the average of their predictions.

Head width feature estimation algorithm:

Input: RGB-D image I_rgb, I_depth, camera intrinsics K
Output: Head width feature f_head

1.  Detect head bounding box: head_region <- YOLO(I_rgb)
2.  Extract depth: D_crop <- I_depth[head_region]
3.  D_64 <- Resize(D_crop, 64x64)
4.  Update intrinsics K' (similar to Algorithm 1)
5.  For i = 1 to 5:
6.      f_i <- DepthNet_i(D_64, K')
7.  EndFor
8.  f_head <- (1/5) * sum(f_i)      // Ensemble average
9.  Return f_head

3. Complete Extraction Pipeline

The full Extractor pipeline combines these components. To handle both ears with a single model, the right ear image is horizontally flipped before processing, effectively turning it into a left ear.

Pinna RGB Crop	Pinna Depth Crop	Head RGB Crop	Head Depth Crop

View Complete Extraction Pseudocode (Algorithm 3)

Complete feature extraction algorithm:

Input: Image set {I_1, ..., I_n} where n in {3, 36, 72}
Output: 11 features F

1.  I_left, I_right, I_front <- SelectViews({I_1, ..., I_n})
2.  f_left <- PinnaEstimator(I_left)              // 5 features
3.  I_right_flip <- HorizontalFlip(I_right)
4.  f_right <- PinnaEstimator(I_right_flip)       // 5 features
5.  f_head <- HeadEstimator(I_front)              // 1 feature
6.  F <- [f_head, f_left, f_right]
7.  Return F

Mathematical Foundation

The system is grounded in the pinhole camera model and a key scale-invariance property used for data augmentation.

Pinhole Camera Model

The relationship between 2D pixel coordinates $(x, y)$ and 3D world coordinates $(X, Y, Z)$ is defined by the camera's intrinsic parameters: focal lengths $(f_x, f_y)$ and principal point $(c_x, c_y)$. Given a depth value $z$ from the depth map:

$$X = \frac{(x - c_x) \cdot z}{f_x}$$

$$Y = \frac{(y - c_y) \cdot z}{f_y}$$

$$Z = z$$

Scale Invariance for Data Augmentation

A key insight is that scaling a depth map by a factor $\alpha$ proportionally scales all 3D distances.

Given two 3D points $\mathbf{P}_1$ and $\mathbf{P}_2$ derived from depth values $z_1$ and $z_2$, the Euclidean distance is:

$$D = \sqrt{(X_1 - X_2)^2 + (Y_1 - Y_2)^2 + (Z_1 - Z_2)^2}$$

If we scale all depth values by $\alpha$ (i.e., $z' = \alpha z$), the new 3D coordinates become $\mathbf{P}_1' = (\alpha X_1, \alpha Y_1, \alpha Z_1)$ and $\mathbf{P}_2' = (\alpha X_2, \alpha Y_2, \alpha Z_2)$.

The new distance $D'$ is:

$$D' = \sqrt{(\alpha X_1 - \alpha X_2)^2 + (\alpha Y_1 - \alpha Y_2)^2 + (\alpha Z_1 - \alpha Z_2)^2}$$

$$D' = \alpha \sqrt{(X_1 - X_2)^2 + (Y_1 - Y_2)^2 + (Z_1 - Z_2)^2}$$

$$\mathbf{D' = \alpha D}$$

This property is the foundation of our data augmentation strategy: we can scale depth maps by $\alpha$ and simply scale the ground-truth 3D measurements by the same $\alpha$.

Training Methodology

DepthNet Architecture

DepthNet is a lightweight convolutional network ($\sim$160KB per model) designed for efficiency. It takes the $64 \times 64$ depth map and the 4 camera intrinsic parameters $K = (f_x, f_y, c_x, c_y)$ as input.

DepthNet architecture: depth features processed horizontally with FiLM conditioning (orange dashed) from intrinsics branch. SE = Squeeze-and-Excitation for channel attention.

The architecture uses:

Residual Blocks with depthwise separable convolutions.
Feature-wise Linear Modulation (FiLM) layers to condition the convolutional features on the camera intrinsics. This allows the network to adapt its spatial reasoning based on the viewing geometry.
Squeeze-and-Excitation (SE) modules for channel-wise attention.

Data Augmentation

We avoid spatial augmentations (like rotation) because they require complex updates to the intrinsic matrix. Instead, we use two depth-aware augmentations:

Depth Scaling: Based on the scale-invariance property, we sample $\alpha \sim \mathcal{U}[0.8, 1.2]$ and apply:

$$z_{aug} \gets \alpha \cdot z$$

$$y_{aug} \gets \alpha \cdot y$$

where $y$ represents the ground-truth measurements.
Depth Noise: We add Gaussian noise to the depth map to improve robustness to sensor noise:

$$z_{aug} \gets z + \mathcal{N}(0, \sigma^2)$$

Ensemble Modeling

To reduce prediction variance and improve robustness to noisy depth inputs, we train an ensemble of 5 DepthNet models for each task (pinna and head) using different random initializations. At inference, the predictions from all 5 models are averaged to produce the final, stable estimate.

Design Rationale

Why a Two-Stage (Detection + Regression) approach?
- Initial attempts at direct landmark detection on RGB followed by 3D geometry calculations failed. Depth maps are often too noisy, and small errors in 2D landmark positions lead to large errors in 3D-space, corrupting geometric estimates. Our two-stage approach isolates robust 2D detection (YOLO) from noise-tolerant depth regression (DepthNet).
Why DepthNet instead of direct geometric formulas?
- A learning-based regression network can implicitly handle depth noise, sensor artifacts, and complex 3D-to-feature mappings that are difficult to encode in an explicit formula.
Why an Ensemble of 5 Models?
- Ensemble averaging is a standard technique to reduce prediction variance. Each model learns slightly different representations, and averaging their outputs provides a more stable and accurate estimate than any single model.
Why the small 64x64 input size?
- This provides a balance between computational-efficiency and information preservation. It standardizes the variably-sized crops from YOLO, simplifying the network architecture and enabling fast inference suitable for real-time applications.
Why flip the right ear?
- Human ears are largely symmetric. By flipping the right ear, we can use a single model to process both left and right ears. This simplifies the architecture, reduces the number of trained models, and effectively doubles the size of our training dataset for the pinna estimator.

General information about Tech Arena 2025 from Huawei

The goal of this challenge is to extract anthropometric data of the human head and ears from a series of RGBD images of a human subject. Anthropometric data refers to measurements of the human anatomy, in this case the human head and ears. The images of the human head from which the anthropometric measurements should be extracted are taken from various angles around the subject and include both RGB data as well as depth information. Here is an overview of the process (just for illustration, images of an artificial instead of a real human head are displayed):

The submitted models are expected to accept a fixed number of subject images taken from predefined camera positions and to output specific anthropometric measurements of the human head and pinnas.

For model training, a dataset of subject images and corresponding anthropometric measurements is provided. The subject images are taken from horizontal camera positions around the subject at a 5 degree resolution (72 images per subject). These RGBD images are provided in HEIC format. The anthropometric data contains 1 measurement of the head and 5 measurement of each pinna (left and right), resulting in an 11-dimensional vector. Details about the anthropometric features can be found below.

The model should be able to work with 3 different sets of input images:

images from all directions (72 images)
front images only (36 images)
only front, left and right images (3 images)

The order of the images, i.e. their camera positions relative to the subject, will remain fixed.

Preparation

Dataset

The default data source for this challenge is the SONICOM dataset. While the anthropometric measurements for all subjects are provided as part of this project, the subject images need to be obtained from SONICOM.

To obtain access to the images, a data sharing permission form needs to be signed by all team members. The form can be obtained by navigating to the submission section of your team. Please download the form, fill in the names of all team members as well as their signatures, and upload the signed document on the same page. Please also provide a single email address of one of the team members to which the download information will be sent.

After submission of the signed form, you will receive an email giving you access to the SONICOM images. We try to keep the time between submission and access as short as possible, but since there is manual work involved, it might take several days before you can access the data.

Once your team has access, open the download platform and download all image folders (P0001, P0002, ..., P0317). Should the downloaded .zip file be corrupt, download the folders in smaller batches.

(Disclaimer: the images are actually mirror images of the subjects, due to the way the pictures were taken. The provided antropometrics account for this, so left and right pinna anthropometrics correspond to the left and right ears as seen in the images. Should you consider using the SONICOM mesh data, be aware that those are flipped w.r.t. the images.)

Code

Set up your python environment and install all required packages.

pip install -r requirements.txt

If you intend to access the image metadata using PyExifTool, the exiftool command-line tool needs to be installed and made available on the PATH.

For details on how to use the provided code resources, see the Jupyter notebook.

Evaluation

The submitted systems will be evaluated on an undisclosed set of subjects. For each subject, all sets of input images (full circle, front only and front/left/right) will be used as input to the model. The outputs will be compared against the ground truth anthropometric measurements.

To compare the model output to the ground truth anthropometric measurements, the Euclidean distance between the standardized feature vectors is computed. For an $N$-dimensional feature vector $\mathbf{x} = [x_1, x_2, \ldots, x_N]^T$, standardization is achieved independently for each feature dimension $x_i$ by subtracting its mean $\mu_i$ and dividing by its standard deviation $\sigma_i$: $$ s_i = \frac{x_i - \mu_i}{\sigma_i}, $$ resulting in a standardized feature vector $s = [s_1, s_2, \ldots, s_N]^T$. The Euclidean distance between the standardized output $s_{\text{out}}$ and the standardized ground truth $s_{\text{gt}}$ is then computed as: $$ d(s_{\text{out}}, s_{\text{gt}}) = \sqrt{\sum_{i=1}^N (s_{\text{out}, i} - s_{\text{gt}, i})^2 } $$ This metric is implemented in the metrics.py module.

The overall score will be computed as the mean across all subjects and input image sets. Since the applied metric is a distance, it means that a lower score corresponds to a better result.

Submission

Systems need to be submitted through the challenge platform, and can be updated at any time before the end of the challenge. Only the latest submission will be considered for each team and will be displayed on the leaderboard of the challenge.

Each submitted model needs to implement the Extractor class in the extractor.py module. This class will be used for automatic evaluation on a hidden test data set, and the score will be reported on the leaderboard.

The final submission at the end of the challenge must include:

the source code to extract anthropometric parameters from RGBD images
a brief documentation of the algorithm
for AI-based solutions: the training code as well as a reference to any additional datasets used

Background

Binaural audio rendering is the process of simulating sound sources in 3D space around a listener. It is not only used for virtual reality and augmented reality applications, but has made its way into mobile devices to provide immersive user experiences when listening to audio content (music, movies, radio play, etc.).

In order to provide the illusion of sounds coming from various directions, sound source signals are convolved with so-called head-related transfer functions (HRTFs). Those HRTFs encode the relevant binaural cues that let the listener perceive the sound from a certain direction.

However, HRTFs are influenced by the human anatomy. The pinna for example causes direction-dependent sound reflections and the head causes frequency-dependent sound attenuation due to shadowing effects. Therefore, HRTFs of individuals differ due to anatomic differences. Listening to rendered audio content using HRTFs of a different individual can have a detrimental effect on the perceived sound quality, and can lead to inaccurate localization and an undesired sound color. Hence there is a demand for obtaining individual HRTFs to provide personalized audio rendering.

Obtaining accurate individual HRTFs usually requires time-consuming acoustic measurements and does not scale to a large user group. A cost-effective alternative would be to simulate these HRTFs based on anthropometric measurements extracted from user-provided images or videos.

The goal of this challenge is therefore to obtain these anthropometric measurements from a set of RGBD images.

Anthropometrics

The provided anthropometrics are based on a number of landmark points on the subject's head and pinna. Each dimension in the anthropometric feature vector represents the Euclidean distance between two landmark points. The landmark points for each pinna are defined as follows:

Landmark	Description
A	point on the ridge of the tragus at the height of the ear canal entrance
B	lowest point of the intertragal notch
C	midpoint on the upper ridge of the antihelix
D	point on the ridge of the antihelix at the same height as the ear canal entrance
E	point on the outer ridge of the helix at the same height as the ear canal entrance
F	top point of the inner ridge of the helix
G	top point of the outer ridge of the helix
H	lowest point of the ear lobe
I	point at right ear canal entrance
J	point at left ear canal entrance

The anthropometrics are then defined as:

$x_1 = |\overline{IJ}|, p_{1} = |\overline{BC}|, p_2 = |\overline{AD}|, p_3 = |\overline{CF}|, p_4 = |\overline{GH}|, p_5 = |\overline{AE}|$

The overall anthropometrics feature vector is formed as:

$\mathbf{a} = [x_1, p_{1, \text{left}}, \ldots, p_{5, \text{left}}, p_{1, \text{right}}, \ldots, p_{5, \text{right}}]^T$

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
data		data
img		img
src		src
.gitignore		.gitignore
CITATION.cff		CITATION.cff
README.md		README.md
annotate_tool.py		annotate_tool.py
calculate_distance_in_3D.py		calculate_distance_in_3D.py
compute_metric.py		compute_metric.py
crop_ears.py		crop_ears.py
detect_8_ear_landmarks.py		detect_8_ear_landmarks.py
e2e_full_cache.py		e2e_full_cache.py
evaluate_mse.py		evaluate_mse.py
getting_started.ipynb		getting_started.ipynb
infer_ensemble.py		infer_ensemble.py
measure_anthropometrics.py		measure_anthropometrics.py
report.pdf		report.pdf
requirements.txt		requirements.txt
sequential_bagging.py		sequential_bagging.py
yolo_ear_detection_keypoint_detection.ipynb		yolo_ear_detection_keypoint_detection.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Depth-Based Anthropometric Feature Estimation: A Two-Stage Detection and Regression Approach

System Architecture

1. Pinna (Ear) Feature Estimation

Pinna feature estimation algorithm:

2. Head Width Estimation

Head width feature estimation algorithm:

3. Complete Extraction Pipeline

Complete feature extraction algorithm:

Mathematical Foundation

Pinhole Camera Model

Scale Invariance for Data Augmentation

Training Methodology

DepthNet Architecture

Data Augmentation

Ensemble Modeling

Design Rationale

General information about Tech Arena 2025 from Huawei

Preparation

Dataset

Code

Evaluation

Submission

Background

Anthropometrics

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

MahammadNuriyev62/anthropometric-feature-extractor

Folders and files

Latest commit

History

Repository files navigation

Depth-Based Anthropometric Feature Estimation: A Two-Stage Detection and Regression Approach

System Architecture

1. Pinna (Ear) Feature Estimation

Pinna feature estimation algorithm:

2. Head Width Estimation

Head width feature estimation algorithm:

3. Complete Extraction Pipeline

Complete feature extraction algorithm:

Mathematical Foundation

Pinhole Camera Model

Scale Invariance for Data Augmentation

Training Methodology

DepthNet Architecture

Data Augmentation

Ensemble Modeling

Design Rationale

General information about Tech Arena 2025 from Huawei

Preparation

Dataset

Code

Evaluation

Submission

Background

Anthropometrics

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages