dr-pato / audio_visual_speech_enhancement Goto Github PK

Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments

Home Page: https://dr-pato.github.io/audio_visual_speech_enhancement/

License: Apache License 2.0

Python 100.00%

audio_visual_speech_enhancement's Introduction

Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments

Implementation of the audio-visual speech enhancement system described in the paper Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments by University of Modena and Reggio Emilia and Istituto Italiano di Tecnologia.

If you are interested in this work check out the project page.

Getting Started

Install requirements

All code is written for Python 3. Create a virtual environment (optional) and install all the requirements running:

pip install -r requirements.txt

Usage

The main program is av_speech_enhancement.py. You can get a list of subcommands typing av_speech_enhancement.py -h. Try av_speech_enhancement.py <subcommand> -h for more information about a subcommand. The audio-visual dataset must have the following directory structure:

s1
  /audio
	/file1.waw
	/file2.wav
	...
  /video
	/file1.mpg
	/file2.mpg
	...
s2
  /audio
	/file1.wav
	/file2.wav
	...
  /video
	/file1.mpg
	/file2.mpg
	...
...

Mixed-speech generation

Generate mixed-speech for training, validation and test sets separately:

av_speech_enhancement.py mixed_speech_generator
	--data_dir <data_dir>
	--base_speaker_ids <spk1> <spk2> <...>
	[--noisy_speaker_ids <spk1> <spk2> <...>]
	--audio_dir <audio_dir>
	--dest_dir <dest_dir>
	--num_samples <num_samples>
	--num_mix <num_mix>
	--num_mix_speakers <num_mix_speakers> {1,2}

The generated files are organized as follow:

TRAINING_SET
	    /s1
	       /file1_with_s2_file2.wav
	       /file2_with_s10_file4.wav
	       ...
	    /s2
	       /file1_with_s12_file5.wav
	       /file2_with_s1_file1.wav
	       ...
	...
VALIDATION_SET
	...
TEST_SET
	...

Audio pre-processing

Compute power-law compressed spectrograms of mixed-speech audio samples. Repeat this operation for training, validation and test sets. Files are saved in NPY format.

av_speech_enhancement.py audio_preprocessing
	--data_dir <data_dir>
	--speaker_ids <spk1> <spk2> <...>
	--audio_dir <audio_dir>
	--dest_dir <dest_dir>
	--sample_rate <sample_rate>
	--max_wav_length <max_wav_length>

Video pre-processing

Extract face landmarks from video using Dlib face detector and face landmark extractor. Files are saved in TXT format (each row has 136 values that represents the flattened x-y values of 68 face landmarks).

av_speech_enhancement.py video_preprocessing
	--data_dir <data_dir>
	--speaker_ids <spk1> <spk2> <...>
	--video_dir <video_dir>
	--dest_dir <dest_dir>
	--shape_predictor <shape_predictor_file>
	--ext <video_file_extension>

<shape_predictor_file> contains the parameters of the face landmark extractor model. You can download a pre-trained model file here.

If you want to check the result of the face landmark extractor type:

av_speech_enhancement.py show_face_landmarks
	--video <video_file>
	--fps <fps>
	--shape_predictor <shape_predictor_file>

Computing Target Binary Masks

Compute TBMs from clean audio samples. For each speaker Long-Term Average Speech Spectrum (LTASS) is computed and then the threshold is applied to all clean audio samples in <audio_dir>.

av_speech_enhancement.py tbm_computation
	--data_dir <data_dir>
	--speaker_ids <spk1> <spk2> <...>
	--audio_dir <audio_dir>
	--dest_dir <dest_dir>
	--sample_rate <sample_rate>
	--max_wav_length <max_wav_length>

TFRecords generation

Before training you have to generate TFRecords of mixed-speech dataset. <data_dir>/<mix_dir> must have three subdirectories named TRAINING_SET, VALIDATION_SET and TEST_SET created with <mixed_speech_generator> subcommand. Pre-computed spectrogram (NPY format) must be located in the same directory of audio file. Set <tfrecords_mode> to "fixed" if samples of the dataset all have the same length (as in GRID corpus), otherwise use "var" (as in TCD-TIMIT corpus).

av_speech_enhancement.py tfrecords_generator
	--data_dir <data_dir>
	--num_speakers <number_speakers_mixed> {2,3}
	--mode <tfrecords_mode> {fixed,var}
	--dest_dir <dest_dir>
	--base_audio_dir <base_audio_dir>
	--video_dir <video_dir>
	--tbm_dir <tbm_dir>
	--mix_audio_dir <mix_audio_dir>
	--delta <delta_video_feat> {0,1,2]
	--norm_data_dir <normalization_data_dir>

Training

Train an audio-visual speech enhancement model described. You can choose between VL2M, VL2M_ref, Audio-Visual Concat and Audio-Visual Concat-ref models.

av_speech_enhancement.py training
	--data_dir <data_dir>
	--train_set <training_set_subdir>
	--validation_set <validation_set_subdir>
	--exp <experiment_id>
	--mode <tfrecords_mode> {fixed,var}
	--audio_dim <audio_frame_dimension>
	--video_dim <video_frame_dimension>
	--num_audio_samples <num_audio_samples>
	--model <model_selection> {vl2m,vl2m_ref,av_concat_mask,av_concat_mask_ref}
	--opt <optimizer_choice> {sgd,adam,momentum}
	--learning_rate <learning_rate>
	--updating_step <updating_step>
	--learning_decay <learning_decay>
	--batch_size <batch_size>
	--epochs <num_epochs>
	--hidden_units <num_hidden_lstm_units>
	--layers <num_lstm_layers>
	--dropout <dropout_rate>
	--regularization <regularization_weight>

Testing

Test your trained model. Enhanced speech samples and estimated masks are saved in <data_dir>/<output_dir>. Estimated masks are saved in subdirectories <mask_dir> of each speaker directory.

av_speech_enhancement.py testing
	--data_dir <data_dir>
	--test_set <training_set_subdir>
	--exp <experiment_id>
	--ckp <model_checkpoint>
	--mode <tfrecords_mode> {fixed,var}
	--audio_dim <audio_frame_dimension>
	--video_dim <video_frame_dimension>
	--num_audio_samples <num_audio_samples>
	--output_dir <output_dir>
	--mask_dir <mask_dir>

Reference

If this project is useful for your research, please cite:

@inproceedings{morrone2019face,
  title={Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments},
  author={Morrone, Giovanni and Bergamaschi, Sonia and Pasa, Luca and Fadiga, Luciano and Tikhanoff, Vadim and Badino, Leonardo},
  booktitle={2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={6900-6904},
  year={2019},
  organization={IEEE}
}

audio_visual_speech_enhancement's People

Contributors

Stargazers

Watchers

audio_visual_speech_enhancement's Issues

Training Model

Hello sir,
I am facing some problem while running the program.
currently I am working with 10 speaker -6 for training ,2 for validation and 2 for testing.
And directory structure :Data
-s1
-audio(containing .wav and .npy file )
-video(containing .mpg and .txt file)
-TBM
-s2
--
--
-s10
-mix
-Training_set(containg .wav and .npy file )(training set contain 6 num of samples of 6 speaker total 36 .wav file)
-Test_Set(4 .wav)
-Validation_Set(4 .wav)
-tfrecords
-Training_set(212 files)
-Test_set(8 files)
-Validation_Set(8 files)

I am Working with training.py file and using av_concat_mask_ref model ,while running this file i am facing some errors.
i am passing arguments by considering this function calling :
1.config = Configuration(args.learning_rate, args.updating_step, args.learning_decay, args.dropout, args.batch_size,args.opt, args.video_dim, args.audio_dim, args.num_audio_samples, args.epochs, args.hidden_units,args.layers, args.regularization, args.mask_threshold)
2.train(args.model, args.data_dir, args.train_set, args.val_set, config, args.exp, args.mode)
Actual Paramenters i am passing are:
config = Configuration(10^-3, 1000, 1.0, 1,366,'adam', 136,257, 366, 5 , 250,3, 0, -1)
train('av_concat_mask_ref', '/content/drive/My Drive/project1', 'tfrecords/TRAINING_SET','tfrecords/VALIDATION_SET', config, '0', 'fixed')

getting error as:
InvalidArgumentError: {{function_node __inference_Dataset_map_DataManager.read_data_format_fixed_32951}} Name: , Key: base_audio_wav, Index: 0. Number of float values != expected. values size: 31040 but output shape: [216]
[[{{node ParseSingleSequenceExample/ParseSingleSequenceExample}}]]
[[validation_batch/IteratorGetNext]]

During handling of the above exception, another exception occurred:

InvalidArgumentError Traceback (most recent call last)

/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py in _do_call(self, fn, *args)
1382 '\nsession_config.graph_options.rewrite_options.'
1383 'disable_meta_optimizer = True')
-> 1384 raise type(e)(node_def, op, message)
1385
1386 def _extend_graph(self):

InvalidArgumentError: Name: , Key: base_audio_wav, Index: 0. Number of float values != expected. values size: 31040 but output shape: [216]
[[{{node ParseSingleSequenceExample/ParseSingleSequenceExample}}]]
[[validation_batch/IteratorGetNext]]

Can you please help me sir for solving this issue?please help me sir .
Thank you.

mean and standard deviation file

Hello sir,
Thank you for helping me in previous issue.

What should be location of norm_data_dir create while running create_dataset_tfrecord.py file.Is num_data_dir need to be store somewhere for future processing ???

Facial landmark extractor not working

Hi, thanks for your nice & clean repo, dr. pato.
I was trying to reproduce the results, but I'm having several problems.
Somehow this facial landmark extractor failed to compute the facial landmarks I guess.
While computing facial landmarks, I got these OpenCV exceptions for quite many videos.

[ERROR:0] global /tmp/pip-install-vl9sdgay/opencv-python/opencv/modules/videoio/src/cap.cpp (142) 
open VIDEOIO(CV_IMAGES): raised OpenCV exception:
OpenCV(4.4.0) /tmp/pip-install-vl9sdgay/opencv-python/opencv/modules/videoio/src/cap_images.cpp:253: error: (-5:Bad argument) 
CAP_IMAGES: can't find starting number (in the name of file): /tf/data/GRID/s1/video/lbbezn.mpg in function 'icvExtractPattern'

And I found out that produced txt files were all empty..

These are the scripts I ran till now.

# Generate mixed speech
python av_speech_enhancement.py mixed_speech_generator --data_dir /tf/data/GRID --base_speaker_ids 1 2 3 --audio_dir . --dest_dir MIXED/TRAINING_SET --num_samples 3 --num_mix 4 --num_mix_speakers 1
python av_speech_enhancement.py mixed_speech_generator --data_dir /tf/data/GRID --base_speaker_ids 1 2 3 --audio_dir . --dest_dir MIXED/VALIDATION_SET --num_samples 3 --num_mix 4 --num_mix_speakers 1
python av_speech_enhancement.py mixed_speech_generator --data_dir /tf/data/GRID --base_speaker_ids 1 2 3 --audio_dir . --dest_dir MIXED/TEST_SET --num_samples 3 --num_mix 4 --num_mix_speakers 1

# Compute power-law compressed spectrograms of mixed-speech audio samples.
python av_speech_enhancement.py audio_preprocessing --data_dir /tf/data/GRID/MIXED/TRAINING_SET --speaker_ids 1 2 3 --audio_dir . --dest_dir . -ml 48000
python av_speech_enhancement.py audio_preprocessing --data_dir /tf/data/GRID/MIXED/VALIDATION_SET --speaker_ids 1 2 3 --audio_dir . --dest_dir . -ml 48000
python av_speech_enhancement.py audio_preprocessing --data_dir /tf/data/GRID/MIXED/TEST_SET --speaker_ids 1 2 3 --audio_dir . --dest_dir . -ml 48000

# Extract face landmarks from video
python av_speech_enhancement.py video_preprocessing --data_dir /tf/data/GRID --speaker_ids 1 2 3 --video_dir video --dest_dir face_landmark --shape_predictor /tf/data/GRID/shape_predictor_68_face_landmarks.dat --ext mpg

Below is directory structure of my dataset (/tf/data/GRID)

|-- MIXED
|   |-- TEST_SET
|   |-- TRAINING_SET
|   `-- VALIDATION_SET
|-- s1
|   |-- audio
|   |-- face_landmark
|   |-- tbm
|   `-- video
|-- s2
|   |-- audio
|   |-- face_landmark
|   |-- tbm
|   `-- video
|-- s3
|   |-- audio
|   |-- face_landmark
|   |-- tbm
|   `-- video
|-- shape_predictor_68_face_landmarks.dat
`-- tfrecords
    |-- TEST_SET
    |-- TRAINING_SET
    `-- VALIDATION_SET

memory leak on gpu

Hi, I got an memory issue(on RAM not VRAM) when running this code on gpu.
After training 1-2 epochs (15000 test cases with 128 batchs), it was killed either with segmentation fault error or no error message.

I checked that meric process (SNR SDR stuff) makes growing up cpu utilization and RAM so I removed those, but the memory leak still doesn't disappered.

Any clues why the memory leak happends?

ubuntu 16.04 / tensorflow(gpu) 1.14 (also checked with 1.15) / GTX 1080 Ti

Problem with files inside testing results

Hi Dr. Pato it's me again....
(I really wish that I can close this issue by my self again....)

I have issues with the testing results.
I successfully ran training/testing code with no error, but as I checked the results by listening to the wav files inside \SAVED\enhanced, the results were poor. It was not separated at all.

I first thought that maybe epochs are not enough or something,
(Saw that you mentioned 10 is enough, so I just set epoch as 10)
So I trained several epochs more, but the results were still poor.

However, I found this problem:
Audio files inside SAVED/mixed, SAVED/target were strange.
As far I understood, files inside SAVED/target should be base audio(clear audio), and SAVED/mixed should be mixed audio.
However I found that SAVED/target files were mostly mixed up, and some files inside SAVED/mixed were base audio....

Mixed speech generator worked well, I literally listened to every audios inside MIXED/TRAINING_SET, MIXED/VALIDATION_SET, MIXED/TRAINING_SET.

But I found that several files inside SAVED/mixed were not mixed at all. There were several "base wav" files inside SAVED/mixed.

For example, MIXED/TESTING_SET/s1/lwbf3s_with_s3_lraq8n.wav was a mixed audio, but SAVED/mixed/s1_lwbf3s_with_s3_lraq8n.wav was a clear audio.

So, the problem is that "SOMEHOW", some of SAVED/mixed files are not mixed audio but clear audio. So I followed your testing.py file,
And it just points that problem might reside in tfrecord file itself.

However, I followed your create_dataset_tfrecords.py file line by line, printing the data path directories, but found there were no problems...
file_base_audio /tf/data/GRID/s3/audio/pgaq5s.wav
file_mix_audio/tf/data/GRID/MIXED/TEST_SET/s3/pgaq5s_with_s2_sgbp3n.wav
file_other_audio /tf/data/GRID/s2/audio/sgbp3n.wav
To this point, I really don't get any sense what my problem is so..
I 'm just guessing that something strange happened in TFRecord generation part?

FYI, These are the scripts I ran.

# Generate mixed speech
python av_speech_enhancement.py mixed_speech_generator --data_dir /tf/data/GRID --base_speaker_ids 1 2 3 --audio_dir . --dest_dir MIXED/TRAINING_SET --num_samples 3 --num_mix 4 --num_mix_speakers 1
python av_speech_enhancement.py mixed_speech_generator --data_dir /tf/data/GRID --base_speaker_ids 1 2 3 --audio_dir . --dest_dir MIXED/VALIDATION_SET --num_samples 3 --num_mix 4 --num_mix_speakers 1
python av_speech_enhancement.py mixed_speech_generator --data_dir /tf/data/GRID --base_speaker_ids 1 2 3 --audio_dir . --dest_dir MIXED/TEST_SET --num_samples 3 --num_mix 4 --num_mix_speakers 1

# Compute power-law compressed spectrograms of mixed-speech audio samples.
python av_speech_enhancement.py audio_preprocessing --data_dir /tf/data/GRID/MIXED/TRAINING_SET --speaker_ids 1 2 3 --audio_dir . --dest_dir . -ml 48000
python av_speech_enhancement.py audio_preprocessing --data_dir /tf/data/GRID/MIXED/VALIDATION_SET --speaker_ids 1 2 3 --audio_dir . --dest_dir . -ml 48000
python av_speech_enhancement.py audio_preprocessing --data_dir /tf/data/GRID/MIXED/TEST_SET --speaker_ids 1 2 3 --audio_dir . --dest_dir . -ml 48000

# Extract face landmarks from video
python av_speech_enhancement.py video_preprocessing --data_dir /tf/data/GRID --speaker_ids 1 2 3 --video_dir video --dest_dir face_landmark --shape_predictor /tf/data/GRID/shape_predictor_68_face_landmarks.dat --ext mpg

# Computing Target Binary Masks from clean audio
python av_speech_enhancement.py tbm_computation --data_dir /tf/data/GRID --speaker_ids 1 2 3 --audio_dir audio --dest_dir tbm -ml 48000

# Generate TF Records <<FIXED>>
python av_speech_enhancement.py tfrecords_generator --data_dir /tf/data/GRID --num_speakers 2 --dest_dir tfrecords --base_audio_dir audio --video_dir face_landmark --tbm_dir tbm --mix_audio_dir MIXED --norm_data_dir /tf/data/GRID/NORM --mode fixed

# TRAIN <<FIXED>>
python av_speech_enhancement.py training --data_dir /tf/data/GRID --train_set tfrecords/TRAINING_SET --val_set tfrecords/VALIDATION_SET --exp 2 --mode fixed --num_audio_samples 48000 --model vl2m --opt adam --learning_rate 0.005 --batch_size 8 --epochs 10 -nl 1 -nh 1

# TEST <<FIXED>>
python av_speech_enhancement.py testing --data_dir /tf/data/GRID --test_set tfrecords/TEST_SET --exp 3 --num_audio_samples 48000 --mode fixed --output_dir /tf/data/GRID/eval-3 --mask_dir estimated_masks --ckp 10_44

Any device would be helpful

Training.py

Hello sir,
While running training.py file by using this command-

av_speech_enhancement.py training --data_dir dataset1 --train_set tfrecords/TRAINING_SET --val_set tfrecords/VALIDATION_SET --exp '0' --mode var --video_dim 136 --audio_dim 257 --num_audio_samples 75000 --model av_concat_mask_ref --opt adam --learning_rate 0.001 --updating_step 50 --learning_decay 0.9 --batch_size 16 --epochs 20 --hidden_units 250 --layers 3 --dropout 1 --regularization 0000e-04

i am getting error -

Traceback (most recent call last):
File "/home/pradnya/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/pradnya/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/pradnya/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __inference_Dataset_map_DataManager.read_data_format_var_143}} Name: , Key: tbm, Index: 0. Number of float values != expected. values size: 513 but output shape: [257]
[[{{node ParseSingleSequenceExample/ParseSingleSequenceExample}}]]
[[validation_batch/IteratorGetNext]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "av_speech_enhancement.py", line 226, in
main()
File "av_speech_enhancement.py", line 216, in main
train(args.model, args.data_dir, args.train_set, args.val_set, config, args.exp, args.mode)
File "/home/pradnya/audio_visual_speech_enhancement-master(1)/audio_visual_speech_enhancement-master/training.py", line 133, in train
val_mixed_audio, val_base_paths, val_other_paths, val_mixed_paths = sess.run(next_val_batch)
File "/home/pradnya/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/pradnya/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/home/pradnya/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/pradnya/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Name: , Key: tbm, Index: 0. Number of float values != expected. values size: 513 but output shape: [257]
[[{{node ParseSingleSequenceExample/ParseSingleSequenceExample}}]]
[[validation_batch/IteratorGetNext]]

sir,am i doing anything wrong ??
thank you sir.

Training has "None values not supported"

I have tried using Tensorflow-gpu versions 1.15.0 and converted your training script to use Tensorflow-gpu 2.10.0.

My data structure is as follows:

/data
       /TRAINING_SET
           /s2_l_bgwj2n_with_s4_s4_l_pwiv4a.npy
           /s2_l_bgwj2n_with_s4_s4_l_pwiv4a.wav
           ...
       /VALIDATION_SET
       ...
       /TEST_SET
      ...

My TFrecords get produced as expected:

/tf_records
       /TRAINING_SET
           /sample_00000.tfrecords
           /sample_00002.tfrecords
           ...
       /VALIDATION_SET
       ...
       /TEST_SET
      ...

But when it comes to training using the VL2m, "None values are not supported"
Have you come across this before?

little more detail about how to train AV c-ref model

How I have a question reagarding training AV c-ref model.

what -mt options does for training? should it be 0 (other than -1) on vl2m model?
training_parser.add_argument('-mt', '--mask_threshold', type=float, default=-1,
help='Threshold on estimated TBM for reconstruction ("vl2m" model only). If -1 (default) thresholding is not applied')
how many epoch did you trained for vl2m?
since metrics(SNR SDR stuff) of vl2m doesn't grow up fastly comparing to other models,
I don't know whether it train well during training, and how many epoch should I run.

below is your comments in other issue

---------------------------------------------------------------------------------------------------
If you want to train the AV c-ref you have to do the following steps (described in the paper in section 3.2):

Train the "vl2m" model (adam optimizer with 10^-4 as initial learning rate).
Save estimated TBMs of training, validation and test set with "testing" subcommand.
Train "av_concat_mask_ref" with oracle TBMs (adam optimizer with 10^-3 as initial learning rate).
Generate TFRecords using the estimated TBMs instead of oracle TBMs.
Select the best epoch of trained "av_concat_mask_ref" model and fine-tune the model with TFRecords of point 4.
Let me know if you are able to do all steps correctly.
---------------------------------------------------------------------------------------------------

On the issue of model training

Here is my training log
Model one python av_speech_enhancement.py training --data_dir ./data3 --train_set TF/initial/TRAINING_SET --val_set TF/initial/VALIDATION_SET --exp 538 --mode fixed --audio_dim 257 --video_dim 136 --num_audio_samples 48000 --model av_concat_mask --opt adam --learning_rate 0.00001 --updating_step 100 --learning_decay 1.0 --batch_size 8 --epochs 10 --hidden_units 300 --layers 5 --dropout 0.5 --regularization 0 is executed by me
then i obtain the following training logs
+-- EXPERIMENT NUMBER - 538 --+

optimizer: adam

number of hidden layers (other): 5

number of hidden units: 300

initial learning rate: 0.001000

regularization: 0.000000

dropout keep probability (no dropout if 1): 0.500000

training size: 96

validation size: 24

batch size: 8

approx number of steps: 120

approx number of steps per epoch: 12

Epoch LR Train[Cost|L2-SPEC|SNR|SDR|SIR|SAR] Val[Cost|L2-SPEC|SNR|SDR|SIR|SAR]
0 [0.001000][31570019.88236|1709225.63460|-13.12816|0.18333|0.55682|17.89409] [1975891.51622|158995.56250|0.00015|-0.51246|-0.03308|14.50995]
1 [0.001000][2636708.38466|225585.16838|-0.00033|0.21334|0.56131|17.34965] [1924917.82028|154274.00000|-0.00150|1.40241|1.53057|19.73535]
2 [0.001000][2140906.74620|152231.50675|-0.12245|0.50816|0.61324|21.01079] [1127515.98006|89183.06250|-0.26976|2.36294|2.43035|23.03181]
3 [0.001000][1722309.84827|103804.95866|-0.22557|0.25222|0.29530|23.95723] [880734.01623|75290.49219|-0.20167|0.26640|0.31491|23.15888]
4 [0.001000][1519283.73804|87616.62866|-0.16919|0.23113|0.28726|22.74769] [777480.35918|71305.63281|-0.16929|0.54480|0.59922|22.95043]
5 [0.001000][1443420.57966|86095.03585|-0.17768|0.32173|0.36417|24.09551] [767667.54653|71888.60156|-0.18584|0.81632|0.85498|24.58647]
6 [0.001000][1400788.72074|85683.11458|-0.20549|0.32629|0.37772|23.44259] [727436.30309|68974.75781|-0.23568|1.59682|1.64908|23.65309]
7 [0.001000][1373982.10674|85067.37577|-0.21270|0.32271|0.36764|23.89874] [717339.86095|68491.68750|-0.26695|1.11602|1.17088|23.10323]
8 [0.001000][1328416.01693|81485.82133|-0.26452|0.32197|0.37998|22.72361] [708650.96137|68050.54688|-0.29006|0.68897|0.74409|22.81329]
9 [0.001000][1302718.35020|80514.23511|-0.26054|0.31579|0.36549|23.40774] [729637.53720|69496.58594|-0.21461|0.96021|1.00348|24.11726]
10 [0.000100][1281801.58735|82254.72701|-0.22952|0.33012|0.37483|23.95600] [708477.12235|68036.82812|-0.27129|1.03879|1.08591|23.76498]
11 [0.000100][1275627.79961|79284.57194|-0.29610|0.32261|0.36980|23.63214] [698638.96618|67444.65625|-0.32302|0.84844|0.89582|23.60907]
12 [0.000100][1267918.67399|79293.26168|-0.29410|0.33657|0.38365|23.64460] [701850.88089|67644.44531|-0.30031|0.90971|0.95642|23.71244]
13 [0.000100][1269136.25578|79197.69096|-0.29316|0.33315|0.37936|23.71920] [701300.80841|67630.75000|-0.30473|0.84122|0.88648|23.81616]
14 [0.000100][1260134.10445|79627.33919|-0.27986|0.33905|0.38384|23.87830] [703340.61619|67773.17969|-0.29527|0.81707|0.86083|23.96222]
15 [0.000100][1261865.75887|79141.21179|-0.29393|0.33404|0.37916|23.84493] [699860.65311|67549.57812|-0.30905|0.83365|0.87842|23.87263]
16 [0.000100][1252664.10120|79255.17277|-0.29193|0.33842|0.38439|23.75614] [699368.99738|67490.92969|-0.30399|0.93634|0.98177|23.86592]
17 [0.000100][1253516.99967|79154.08610|-0.29474|0.33794|0.38333|23.81782] [697506.26763|67394.03906|-0.31475|0.89868|0.94387|23.85658]
18 [0.000100][1246361.52609|79118.05432|-0.29539|0.34490|0.38981|23.91663] [696388.32105|67310.22656|-0.31471|0.94217|0.98734|23.88559]
19 [0.000100][1240519.73664|78834.66744|-0.30179|0.34777|0.39270|23.88205] [694920.96278|67218.67969|-0.32202|0.91532|0.95951|23.97451]
20 [0.000100][1246233.74705|78866.22001|-0.30074|0.34691|0.39048|24.02649] [694796.32193|67250.82031|-0.32852|0.84411|0.88889|23.84819]
21 [0.000100][1233846.03362|78229.35763|-0.31758|0.34438|0.38991|23.75011] [692615.70580|67100.71875|-0.33782|0.86155|0.90657|23.83718]
22 [0.000010][1233310.14358|78084.23238|-0.32119|0.34684|0.39194|23.81030] [692574.29280|67102.21094|-0.33903|0.84609|0.89084|23.85644]
23 [0.000010][1237074.54296|78170.68587|-0.31848|0.34786|0.39243|23.87702] [692993.30367|67117.94531|-0.33470|0.86234|0.90634|23.95278]
24 [0.000010][1232479.62149|78292.84021|-0.31485|0.34975|0.39376|23.95652] [693375.77422|67135.53906|-0.33133|0.86902|0.91236|24.03470]

After many epochs, the function loss is still very large, about one million, and the SDR parameters are not very good, making the model unable to separate speech. May I ask if the training parameters are incorrect, and is it normal for me to have such a large loss? How should I train
If the author could help, I would be very grateful

Can't Understand the directory structure

I am very new to Neural Network and python programming in general.
I cloned the repository and created the audio and video files.
I am not understanding how to run and substitute the parameters for the following command
av_speech_enhancement.py mixed_speech_generator --data_dir <data_dir> --base_speaker_ids <spk1> <spk2> <...> [--noisy_speaker_ids <spk1> <spk2> <...>] --audio_dir <audio_dir> --dest_dir <dest_dir> --num_samples <num_samples> --num_mix <num_mix> --num_mix_speakers <num_mix_speakers> {1,2}

I tried the following way with no luck
python3 av_speech_enhancement.py mixed_speech_generator --data_dir data --base_speaker_ids S1 S2 S3 S4 --audio_dir data/S1/audio --dest_dir dest --num_samples 4 --num_mix 4 --num_mix_speakers 4 {1,2}

Please try to help me and give me some hints to solve the issue.

Thanks in advance

TFRecords generation

The last attribute is norm_data_dir, but where to get it?
--norm_data_dir <normalization_data_dir>

Got errors like that: FileNotFoundError: [Errno 2] No such file or directory: 'norm/bbal5n_with_s2_bbap7s.npy_video_mean.npy'

But I dont see where we make that kind of files before.

Parameter setting

Hi @dr-pato
There are some questions when I'm training the model.
Could you please show me what parameter would be suitable in the training :
--hidden_units','Number of units of BLSTM cells
and
'--layers', 'Number of stacked BLSTM cells'
for the model 'vl2m' or 'av_concat_mask_ref'.
Thank you so much.
Best,
Yuyue.

base training configuration

Hello, Giovanni. Could you provide the best configuration for training grid database. Thank you.

Training function arguments problem

hello sir,
I am executing this code in google colab.
I refer following functions and parameters:

config = Configuration(args.learning_rate, args.updating_step, args.learning_decay, args.dropout, args.batch_size,args.opt, args.video_dim, args.audio_dim, args.num_audio_samples, args.epochs, args.hidden_units,args.layers, args.regularization, args.mask_threshold)

train(args.model, args.data_dir, args.train_set, args.val_set, config, args.exp, args.mode)

I passed following args to these function

config = Configuration(0.5, 10000, 1.0, 1, 1,'adam',136, 257, 100, 60, 250,5, 0, -1)
train('vl2m', '/content/drive/My Drive/data_dir', 'tfr/TRAINING_SET', 'tfr/VALIDATION_SET', 'config',' 0', 'fixed')

problem
->train_dataset, train_it = train_data_manager.get_iterator(train_dataset, batch_size=config.batch_size, n_epochs=config.num_epochs, train=True)

AttributeError: 'str' object has no attribute 'batch_size'

also...I faced same errors in DataManager function for
contrib.audio_dim,contrib.video_dim and for num_audio_samples

Plz help me to solve this issue

Input format

Could you please show me what kind of format should I put after "--base_speaker_ids" .For the speakers list like (s1 s2 s3 s4 s5...)
I have tried :
--base_speaker_ids s1 s2 s3 s4
--base_speaker_ids s1+s2+ s3+s4
--base_speaker_ids s1&s2&s3&s4
--base_speaker_ids (s1 s2 s3 s3 s4)
--base_speaker_ids s1,s2,s3,s4
But it doesn't work.
So what should I input?
Thank you so much.

Parameter passing confusion

Thank you sir for replying....

Can you plz help me out for parameter passing for following arguments

for sample rate,max_wav_length from spectrogram function
for mix track function
and configuration parameter values

plz sir help me....give me a example of parameters...because I think....I passed wrong values while running these functions which are generating such errors while training the model

I have passed following parameters..I think I did something wrong ...plz help me
create_mixed_tracks_data('/content/drive/My Drive/project/data1', [1,2],[1,2],'audio','mix/test', 2, 2, 1)

save_spectrograms('/content/drive/My Drive/project/data1', [1,2], 'audio', ' audio',16e3, 48000)

config = Configuration(0.001, 50, 0.9, 1,16,'adam', 136,257, 50653, 20 , 250,3,1.0000e-04, -1)
train('vl2m', '/content/drive/My Drive/project_data_final', 'tfrecords/TRAINING_SET','tfrecords/VALIDATION_SET', config, '0', 'fixed')

I am confused while passing parameters ..

create_mixed_tracks_data(args.data_dir, args.base_speaker_ids, args.noisy_speaker_ids, args.audio_dir,args.dest_dir, args.num_samples, args.num_mix, args.num_mix_speakers)

save_spectrograms(args.data_dir, args.speaker_ids, args.audio_dir, args.dest_dir,args.sample_rate, args.max_wav_length)

config = Configuration(args.learning_rate, args.updating_step, args.learning_decay, args.dropout, args.batch_size, args.opt, args.video_dim, args.audio_dim, args.num_audio_samples, args.epochs, args.hidden_units,args.layers, args.regularization, args.mask_threshold)

train(args.model, args.data_dir, args.train_set, args.val_set, config, args.exp, args.mode)

Originally posted by @malineha in #12 (comment)

Training: Values size X by output shape Y

Running the training command:

python3 av_speech_enhancement.py training --data_dir data/tf_records/ --train_set TRAINING_SET --val_set VALIDATION_SET --exp 1 --mode fixed -ns 48000 --model vl2m --opt adam -lr 0.005 -nl 1 -nh 1 -d 1

Gives me the error below

 Traceback (most recent call last):
  File "av_speech_enhancement.py", line 227, in <module>
    main()
  File "av_speech_enhancement.py", line 217, in main
    train(args.model, args.data_dir, args.train_set, args.val_set, config, args.exp, args.mode)
  File "/home/git/audio_visual_speech_enhancement/training.py", line 130, in train
    val_mixed_audio, val_base_paths, val_other_paths, val_mixed_paths = sess.run(next_val_batch)
  File "/home/git/audio_visual_speech_enhancement/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/git/audio_visual_speech_enhancement/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/git/audio_visual_speech_enhancement/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/git/audio_visual_speech_enhancement/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError:  Name: , Key: base_audio_wav, Index: 0.  Number of float values != expected.  values size: 29440 but output shape: [48000]
         [[{{node ParseSingleSequenceExample/ParseSingleSequenceExample}}]]
         [[validation_batch/IteratorGetNext]]

Am I doing something wrong?

I have already created the TF records ready for training.

Am i right in assuming that the num_audio_samples is the length of the audio of the output when training?

Value error when trying to cut audio

I have found an issue while using the GRID dataset. The audio/video files range in length. Using the audio_preprocessing script, the option 'max_wav_length' seems to fail when wanting to set the desired length of the audio.

So for example if i run the option with 42000 as the max wave length, I get this error:

audio_samples[i, n_fft//2: len(samples) + n_fft//2] = samples
ValueError: could not broadcast input array from shape (42240) into shape (42000)

I'm assuming that it should cut the audio up to 42000 samples? Am i correct in thinking this?

Training/Testing/Validation Set Split

hello Sir,
In Facelandmark research Paper
The resulting dataset was split into disjoint sets of 25/4/4 speakers for training/validation/testing respectively.

But In my case ,I am Working On dataset of only 2 Speakers
in which each speaker have 1000 records(1000audio,1000video)

how can I split these records in Training/Testing/Validation set
I am confused plz help me...
I also confused about Mix speech training/validation/testing split according to above dataset...
Plz help me

Question about training vl2m with fixed TFRecord type

Hello, thanks for your work. I'm training the vl2m model using grid dataset. l set TFRecord type='fixed', num_audio_samples=48000, batch_size=10. But there is an error when I start training:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Name: , Key: base_audio_wav, Index: 0. Number of float values != expected. values size: 19200 but output shape: [48000]

l tried to change num_audio_samples but it didn't work. Does 19200 mean the length of the 0th wav while 48000 means the number of audio wavs?
Hope for your reply.

can you tell me how to get GRID dataset

Error from training script

Hi, thanks for your nice & clean repo, dr. pato.
I had problems while running the training script, which says:

Traceback (most recent call last):
  File "av_speech_enhancement.py", line 227, in <module>
    main()
  File "av_speech_enhancement.py", line 217, in main
    train(args.model, args.data_dir, args.train_set, args.val_set, config, args.exp, args.mode)
  File "/tf/audio_visual_speech_enhancement/training.py", line 130, in train
    val_mixed_audio, val_base_paths, val_other_paths, val_mixed_paths = sess.run(next_val_batch)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Name: , Feature list 'base_audio_path' is required but could not be found.  Did you mean to include it in feature_list$
dense_missing_assumed_empty or feature_list_dense_defaults?
         [[{{node ParseSingleSequenceExample/ParseSingleSequenceExample}}]]
         [[validation_batch/IteratorGetNext]]

While searching for this error, I found some discussion which implied that the problem was with tfrecord
However when I printed the tfrecord inspection code, I found that there were base_audio_path in the tfrecord keys
(printed keys: dict_keys(['base_audio_path', 'mix_audio_path', 'mix_audio_wav', 'base_audio_wav', 'sequence_length', 'other_audio_wav', 'other_audio_path']))
Do you have any idea?

FYI,
Following is my training script

python av_speech_enhancement.py training --data_dir /tf/data/GRID --train_set tfrecords/TRAINING_SET --val_set tfrecords/VALIDATION_SET --exp 1 --mode var --num_audio_samples 48000 --model vl2m --opt adam --learning_rate 0.005 --batch_size 4 --epochs 10 -nl 1 -nh 1

Following is my dataset directory structure

|-- MIXED
|   |-- TEST_SET
|   |-- TRAINING_SET
|   `-- VALIDATION_SET
|-- check_tfrecords.py
|-- logs
|   |-- checkpoints
|   |-- tensorboard
|   `-- training_logs
|-- s1
|   |-- audio
|   |-- face_landmark
|   |-- tbm
|   `-- video
|-- s2
|   |-- audio
|   |-- face_landmark
|   |-- tbm
|   `-- video
|-- s3
|   |-- audio
|   |-- face_landmark
|   |-- tbm
|   `-- video
|-- shape_predictor_68_face_landmarks.dat
`-- tfrecords
    |-- TEST_SET
    |-- TRAINING_SET
    |-- VALIDATION_SET
    `-- logs

dr-pato / audio_visual_speech_enhancement Goto Github PK

audio_visual_speech_enhancement's Introduction

Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments

Getting Started

Install requirements

Usage

Mixed-speech generation

Audio pre-processing

Video pre-processing

Computing Target Binary Masks

TFRecords generation

Training

Testing

Reference

audio_visual_speech_enhancement's People

Contributors

Stargazers

Watchers

Forkers

audio_visual_speech_enhancement's Issues

optimizer: adam

number of hidden layers (other): 5

number of hidden units: 300

initial learning rate: 0.001000

regularization: 0.000000

dropout keep probability (no dropout if 1): 0.500000

training size: 96

validation size: 24

batch size: 8

approx number of steps: 120

approx number of steps per epoch: 12

Recommend Projects

Recommend Topics

Recommend Org