The official repo for the ICCV-2021 paper "Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates".
project / paper / video
Our paper and this repo focus on upper-body pose generation from audio. To synthesize images from poses, please refer to this Pose2Img repo.
🔔 Update:
- 2022-04-29: Upload checkpoints for all subjects.
- 2022-04-26: Change
POSE2POSE.LAMBDA_KLinconfig/default.pyfrom 1.0 to 0.1.
|-- config
| |-- default.py
| |-- voice2pose_s2g.yaml # baseline: speech2gesture
| |-- voice2pose_sdt_bp.yaml # ours (Backprop)
| |-- voice2pose_sdt_vae.yaml # ours (VAE)
| \-- pose2pose.yaml # gesture reconstruction
|
|-- core
| |-- datasets
| |-- netowrks
| |-- pipelines
| \-- utils
|
|-- datasets
| \-- speakers
| |-- oliver
| |-- kubinec
| \-- ...
|
|-- output
| \-- <date-config-tag> # A directory for each experiment
|
`-- main.py
To generate videos, you need ffmpeg in your system.
sudo apt install ffmpegInstall Python packages
pip install -r requirements.txtWe use a subset (Oliver and Kubinec) of the Speech2Gesture dataset and remove frames with bad human poses. We also collect data of two mandarine speakers (Luo and Xing).
To ease later research, we pack our processed data including 2d human pose sequences and corresponding audio clips.
Please download from this link and organize the data under datasets/speakers as the above dirctory hierarchy.
Note that you do NOT need the source video frames to run this repo. In case you still want them for your own usage:
- For Luo and Xing, we provide the links of source videos as text files along side the above data packs.
- For Oliver and Kubinec, please refer to the Speech2Gesture dataset.
Since our method address the entire upper body including the face and hands, the number of keypoints in our data is 137. For more details, please refer to this document.
To build a dataset from custom videos, we provide reference scripts in data_preprocess/:
# ==== video processing ====
1_1_change_fps.py # we use fps=15 by default
1_2_video2frames.py # save each video as images
# ==== keypoint processing ====
2_1_gen_kpts.py # use openpose to obtain keypoints
2_2_remove_outlier.py # remove a frame with bad predicted keypoints
(2_3_rescale_shoulder_width.py # rescale the keypoints)
# ==== npz processing ====
3_1_generate_clips.py # generate a csv files as an index and npz files for clips
3_2_split_train_val_test.py # edit the csv file for dataset division
# ==== speakers_stat processing ====
4_1_calculate_mean_std.py # save the mean and std of each keypoint (137 points) into a npy file
4_2_parse_mean_std_npz.py # parse the above npy and print out for `speakers_stat.py`
The step 2_3 is optional. It rescales the keypoints so that a new speaker has the same shoulder width as Oliver, and then you can simply copy the
scale_factorof Oliver for the new speaker inspeakers_stat.py.
Training from scratch
python main.py --config_file configs/voice2pose_sdt_bp.yaml \
--tag oliver \
DATASET.SPEAKER oliver \
SYS.NUM_WORKERS 32--tagset the name of the experiment which wil be displayed in the outputfile.- You can overwrite any parameter defined in
configs/default.pyby simply adding it at the end of the command. The example above setSYS.NUM_WORKERSto 32 temporarily.
Resume training from an interrupted experiment
python main.py --config_file configs/voice2pose_sdt_bp.yaml \
--resume_from <checkpoint-to-continue-from> \
DATASET.SPEAKER oliver- With
--resume_from, the program will load thestate_dictfrom the checkpoint for both the model and the optimizer, and write results to the original directory that the checkpoint lies in.
Training from a pretrained model
python main.py --config_file configs/voice2pose_sdt_bp.yaml \
--pretrain_from <checkpoint-to-pretrain-from> \
--tag oliver \
DATASET.SPEAKER oliver- With
--pretrain_from, the program will only load thestate_dictfor the model, and write results to a new base directory.
To evaluate a model, use --test_only and --checkpoint as follows
python main.py --config_file configs/voice2pose_sdt_bp.yaml \
--tag oliver \
--test_only \
--checkpoint <path-to-checkpoint> \
DATASET.SPEAKER oliverTo evaluate a model on an audio file, use --demo_input and --checkpoint as follows
python main.py --config_file configs/voice2pose_sdt_bp.yaml \
--tag oliver \
--demo_input demo_audio.wav \
--checkpoint <path-to-checkpoint> \
DATASET.SPEAKER oliverYou can find our checkpoint here.
First, you need to train the VAE by pose sequence reconstruction:
python main.py --config_file configs/pose2pose.yaml \
--tag oliver \
DATASET.SPEAKER oliverOnce the VAE is train, you can compute FTD while training our SDT-BP model by spotting out VOICE2POSE.POSE_ENCODER.AE_CHECKPOINT as follows:
python main.py --config_file configs/voice2pose_sdt_bp.yaml \
--tag oliver \
DATASET.SPEAKER oliver \
VOICE2POSE.POSE_ENCODER.AE_CHECKPOINT <path-to-VAE-checkpoint>By changing the config file and spotting out VOICE2POSE.POSE_ENCODER.AE_CHECKPOINT, you can train our SDT-VAE model, and the FTD metric will also be computed:
python main.py --config_file configs/voice2pose_sdt_vae.yaml \
--tag oliver \
DATASET.SPEAKER oliver \
VOICE2POSE.POSE_ENCODER.AE_CHECKPOINT <path-to-VAE-checkpoint>For evaluation and demo with our SDT-VAE model, dont't forget to always specify the VOICE2POSE.POSE_ENCODER.AE_CHECKPOINT parameter.
-
We save a checkpoint and conduct validation after each epoch. You can change the interval in the config file.
-
We generate and save 2 videos in each epoch when training. During validation, we sample 8 videos for each epoch. These videos can be saved in tensorborad (without sound) and mp4 (with sound). You can change the
SYS.VIDEO_FORMATparameter to select one or two of them. -
For multi-GPU training, we recommand using DistributedDataParallel (DDP) because it provide SyncBN across GPU cards. To enable DDP, set
SYS.DISTRIBUTEDtoTrueand setSYS.WORLD_SIZEaccording to the number of GPUs.When using DDP, assure that the
batch_sizecan be divided exactly bySYS.WORLD_SIZE. -
We usually set
NUM_WORKERSto 32 for best performance. If you encounter any error about memory, try lowerNUM_WORKERS. -
We also support dataset caching (
DATASET.CACHING) to further speed up data loading.If you encounter errors in the dataloader like
RuntimeError: received 0 items of ancdata, please increaseulimitby running the commandulimit -n 262144. (refer to this issue) -
To run any module other than the main files in the root directory, for example the
core\datasets\gesture_dataset.pyfile, you should runpython -m core.datasets.gesture_datasetrather thanpython core\datasets\gesture_dataset.py. This is an interesting problem of Python's relative importing.
@inproceedings{qian2021speech,
title={Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates},
author={Qian, Shenhan and Tu, Zhi and Zhi, Yihao and Liu, Wen and Gao, Shenghua},
booktitle={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
pages={11057--11066},
year={2021},
organization={IEEE}
}
