amirbar / speech2gesture Goto Github PK
View Code? Open in Web Editor NEWcode for training the models from the paper "Learning Individual Styles of Conversational Gestures"
code for training the models from the paper "Learning Individual Styles of Conversational Gestures"
Hi!
I've tried to run the inference on ellen's checkpoint by installing CUDA 9.0 on Google Colab, and running the command
!python -m audio_to_multiple_pose_gan.predict_audio --audio_path speech.wav --output_path output --checkpoint checkpoint/ckpt-step-296700.ckp --speaker ellen -ag audio_to_pose_gans --gans 1
It throws the following error:
2019-08-06 01:49:17.499788: E tensorflow/stream_executor/cuda/cuda_dnn.cc:332] could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2019-08-06 01:49:17.499934: E tensorflow/stream_executor/cuda/cuda_dnn.cc:340] possibly insufficient driver version: 410.79.0
After predicting ,how can I get the video of real man?
May I ask you to use OpenPose to extract the details of key points, there are 49 key points, how are they corresponding
Thank you for the data you collected. However, I encountered some problems in getting the data.
We noticed that the videos of "Rock" is currently unavailable for download.
I also don't know how to download the videos of "Jon", could you give me some help?
Hello,
When do you plan to release your code?
Thank you very much for publish this excellent work. I am trying the inference code with your provided pre-trained model. I downloaded some audio from ellen show, and use these audio as the model input audio. But the pose generated looks not right, especially for the hands. I wonder if I messed something up. or I need to do some preprocess for the audio I downloaded from web.
Another question is what I will get if I take somebody else's audio as input of Ellen's model? Will I still get reasonable results?
Thank you very much!
Installing packages:
conda env create x python=3.6
conda activate x
conda install --file requi* -y
Download audio:
youtube-dl https://www.youtube.com/watch?v=6IdXEOdRxPs -x --audio-format wav
Errors and solutions:
ModuleNotFoundError: No module named 'numba.decorators'
conda install numba==0.48
TypeError: unsupported operand type(s) for /: 'Dimension' and 'int'
change: reshaped = tf.reshape(pose_batch, (-1, 64, 2, shape[-1]/2))
to: reshaped = tf.reshape(pose_batch, (-1, 64, 2, shape[-1].value/2))
TypeError: Value passed to parameter 'shape' has DataType float32 not in list of allowed values: int32, int64
change: reshaped = tf.reshape(pose_batch, (-1, 64, 2, shape[-1].value/2))
to: reshaped = tf.reshape(pose_batch, (-1, 64, 2, int(shape[-1].value/2)))
DataLossError (see above for traceback): Unable to open table file /media/data/study/AI/speech2gesture/rock-20210114T070036Z-001/rock/ckpt-step-296700.ckp.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
change: --checkpoint '/media/data/study/AI/speech2gesture/rock-20210114T070036Z-001/rock/'
to: --checkpoint '/media/data/study/AI/speech2gesture/rock-20210114T070036Z-001/rock/ckpt-step-296700.ckp'
-- not *.ckp.(index|meta|data)
You will have to change the paths
How to run:
/home/vali/system/apps/anaconda3/envs/x/bin/python -m audio_to_multiple_pose_gan.predict_audio --audio_path '/media/data/study/AI/speech2gesture/History of Rock, Part 1 by University of Rochester-6IdXEOdRxPs.wav' --output_path '/media/data/study/AI/speech2gesture/tmp/' --checkpoint '/media/data/study/AI/speech2gesture/rock-20210114T070036Z-001/rock/ckpt-step-296700.ckp' --speaker rock -ag audio_to_pose_gans --gans 1
hello ,thanks for your great work.
I also need the following dataset files for study and research~~,the provided link is invalid~
[the file frames_df_10_19_19.csv]、[a single or multiple speakers keypoints & frames tar file]、[the file containing all video links video_links.csv] and [intervals_df.csv].
@amirbar I'm facing an error when I run python -m audio_to_multiple_pose_gan.predict_audio --audio_path oliver_test.wav -output_path tmp_output/ --checkpoint Gestures/pretrained_models/oliver/ckpt-step-296700.ckp.data-00000-of-00001 --speaker oliver -ag audio_to_pose_gans --gans 1
for inference on audio.
This is the output error I get:
DataLossError (see above for traceback): Unable to open table file Gestures/pretrained_models/conan/ckpt-step-296700.ckp.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator? [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
Any idea what's causing this? (Do let me know if you want the full log)
Hello,I‘m try to do work which needs the speech2gesture dataset. But it seems that the link provided in dataset.md is invalid now. Could you please provide another link? Thanks!
According to the file common/consts.py
, we know that
SR = 16000
AUDIO_SHAPE = 67267
FPS = 15
FRAMES_PER_SAMPLE = 64
From the first three constants, we can compute num_frames = AUDIO_SHAPE / SR * FPS = 67267 / 16000 * 15 = 63.06281249999999
which is about one whole frame less than FRAMES_PER_SAMPLE
.
We have encountered this problem when we were trying to test the model on a longer audio sequence, for which the misalignment is magnified.
hello, appricate your work!
have a question, after try your pretrained models, the result is some skeletal animation,but i want to get some animation with human action, did i do something wrong? if i did wrong, pleace help me which file should i run, thanks.
Could you please explain about how to get SPEAKER_CONFIG params in consts.py? It seems these params are uesd to normalize the keypoints, but i am confused about how these params are determined and the purpose of the normalization process.
Hi I tried to run this:
python -m audio_to_multiple_pose_gan.predict_audio --audio_path /content/parte1_2.wav --output_path /content/ --checkpoint ????? --speaker angelica -ag audio_to_pose_gans --gans 1
I have problems with checkpoints! because in the given folder of pretrained models there is not ckp file!!
Can anyone help?
Excuse me, I'm new in video processing, and is confused by the 29.97 in this script。It look like you resample all the videos to 29.97 Hz. But why not 30Hz? Thanks in advance.
Hi, thank you so much for your work!
I met some problems reproducing your work. Many pre-trained models are missing in google drive. All of them do not have data-00000-of-00001
files. Some of them only contain a .index
file. https://drive.google.com/drive/folders/1yBJur-FjtMGNZTKKvEY5WuppG2yp2SJO
Then I try to retrain your model. But frames_df_10_19_19.csv
is missing in google drive.
It is really strange that someone could reproduce them in 2021. So did you remove those files from the drive?
Looking forward to your reply!
https://drive.google.com/drive/folders/1qvvnfGwas8DUBrwD4DoBnvj8anjSLldZ
Hi,
Thank you for your great work.
In the paper you mentioned that you train for 300K/90K iterations with and without an adversarial loss, respectively, to achieve the results presented in Tables 1-3.
I assume that you first run the train.py script with --epochs 1000 and --lambda_gan 1.0. Then, you select the best model and run the same script again with --epochs 300 and --lambda_gan 0. The best model of the second run should achieve the presented results. Is that assumption correct, or did you use a different approach?
Hello Amir,
Does your generator provide any random noise as input? or it generates the same pose based on the same audio input? I'm working on a similar sequence generation project. I'm curious how you put the randomness into sequence.
Thanks.
This is a great project! I found a few typos / missed instructions while following the dataset.md page to prepare the dataset. Hope this could be helpful to other people.
Typos:
Folder structure in Download speaker data / 4 Download the speaker videos from youtube.. There should be one video folder for each speaker instead of two
Download crop_intervals.py rather than crop_intervals.csv
Missed Instructions
pip install --upgrade youtube_dl
"During training, we take as input spectrograms corresponding to about 4 seconds of audio and predict 64 pose vectors, which correspond to about 4 seconds at a 15Hz frame-rate. At test time we can run our network on arbitrary audio durations"(Section 4.3).
What are the details of the testing implementation?
I notice that the loss function of the discriminator is mse. In my view, cross-entropy is a commonly used loss function in binary classification tasks in discriminator. So I wanna know why? And I'm also confused in the output of the discriminator. The output of the discriminator is a vector with unequal length to the input, what dose it denote?
hello, appricate your work!
have a question, where is the frame_df.csv.tgz? I cannot find it in the link, please tell me how and where to get it , thanks.
Hi @amirbar,
Can the dataset be helpful in the generation of speech using gestures?
Sign language applications, human-computer interaction can be some of the applications.
hi and thanks for this work .
Could you please give me the script or tell me how did you build this data set?i am trying to build similar data set for another language and there are so many files like jpg files and some columns like interval_id and frame_id in data frames that It is not clear how they were extracted .
Please help me with this.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.