Skip to content

LorenzoCassano/SpeechDrivesTemplates3D

Repository files navigation

Speech Drives Templates 3D

This repository allows you to train a model from a video/videos and to predict the gesture of a person from an audio track.
This repository is based on SpeechDrivesTeamplates, it does this work on 2 coordinates (x,y) and we've expanded it to 3 coordinates (x,y,z).

Organization

The main directory is splitted into 2 sub directories:

  • Preprocessing3D: It expands the SpeechDrivesTeamplates's preprocessing but managing a 3rd coordinate (z), in order to create a 3d dataset.
  • SpeechDrivesTeamplates: It's the core of the project and it contains all the files to train a 3D model and test it.

The file changed in SpeechDrivesTeamplates are:

  • core/networks/poses_recostruction/autoencoder.py
  • core/networks/keypoints_generation/generator.py
  • core/datasets/gesture_dataset.py
  • core/utils/keypoint_visualization.py
  • core/datasets/speaker_stat.py
  • core/pipelines/voice2pose.py
  • data_process/4_1_calculate_mean_std.py

Dataset

We created a small 3D dataset based on an university teacher, but it would be better creating a grater dataset to have a better model, you can use the speakers of Speech2Gesture.
You can use our preprocessed dataset from this link.
You can find the results from this link.

Execute Preprocessing3D

To build a 3D dataset, we provide scripts in Preprocessing3D.
It's necessary to run the scripts in the following order:

  • 1_1_change_fps.py, it needs 2 arguments: videos directory and target directory where videos in 15FPS will be saved.
  • 1_2_video2frames.py it needs 2 arguments: fps videos directory and directory where the frames will be saved (we suggest using the same path specified in the code).
  • preprocessing.py, it needs 2 arguments: frames path and output path.
  • fixing.py, it needs one argument: output path.
  • 3_1_generate_clips it generates the NPZ files
  • 3_2_split_train_val_test.py, it creates a csv file.
  • 4_1_calculate_mean_std.py, it calculates the mean and std for each keypoint.
  • 4_2_parse_mean_std_npz.py, it reshapes the mean and std.

    After that, insert the mean and std in speaker_stat.py

Execute Model

To run the model, we suggest using "Execute.ypnb" file in Google Colab.
We suggest running in local the files:preprocessing.py and fixing.py, you could find problems relative to python version in Google colab.
To run the code on your local machine, you need to install on your device Cuda.
You also need to create dataset and output directory and set it in the configuration files (voice2pose_sdt_bp.yaml, default.py).

Training

python main.py --config_file configs/voice2pose_sdt_bp.yaml
--tag speaker_name
DATASET.SPEAKER speaker_name
SYS.NUM_WORKERS 32

Authors

  • Lorenzo Cassano (mat.718331)
  • Jacopo D'Abramo (mat. 716484)

About

Predicting person gestures based on audio tracks.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published