facebookresearch / visualvoice Goto Github PK
View Code? Open in Web Editor NEWAudio-Visual Speech Separation with Cross-Modal Consistency
License: Other
Audio-Visual Speech Separation with Cross-Modal Consistency
License: Other
I've trained the model for up to 200 thousands epoch, but the sdr performence is only 7.4 in unseen_unheard_test set. It is wondered how long the model has been training in paper. Due to the limitation of the numbers of GPUs, the config of paper can't be implemented. So, any advice will help me.
my training config below
--gpu_ids 0,1 \ --batchSize 10 \ --nThreads 16 \ --decay_factor 0.5 \ --num_batch 400000 \ --lr_steps 40000 80000 120000 160000 200000 \ --coseparation_loss_weight 0.01 \ --mixandseparate_loss_weight 1 \ --crossmodal_loss_weight 0.01 \ --lr_lipreading 0.0001 \ --lr_facial_attributes 0.00001 \ --lr_unet 0.0001 \ --lr_vocal_attributes 0.00001 \ --
Dear Dr. Gao:
Hi!
Thank you for your excellent work on this paper "VISUALVOICE: Audio-Visual Speech Separation with Cross-Modal Consistency". I would like to consult the experimental part. Do you have the code for Audio-Only[79] in Table 1? Why did I delete the visual information in the code VisualVoice, and only audio separation cannot achieve the result of Audio-Only[79] in Table 1. If you have this part of Audio-Only[79] code, can you send it to me to learn it? I would appreciate any help, thanks!
Hello, how to generate pre-trained cross-modal matching models facial.pth and vocal.pth.
Hello, I observed the effect of net_vocal_attributes in the whole model framework.
At present, the embedding extracted from the predicted sound, the distance of the negative sample pair (audio_embedding_A1_pred
and audio_embedding_B1_pred
) can reach 2, and the distance of the positive sample pair (audio_embedding_A1_pred
and audio_embedding_A2_pred
) can reach about 0.
But after I changed the input of net_vocal to pure real sound, the distance between negative sample pairs (audio_embedding_A1_gt
and audio_embedding_B_gt
) can only reach 1. That is to say, the sound feature extraction is not good when I train the net_vocal alone.
It stands to reason that pure ground voices are easier to extract features than predicted voices. I modified the parameters of the training (batch, learning rate, etc.) but none solved the problem. May I know what is the reason?
Looking forward to your reply!
I think the number of files in mp4, audio, mouth_roi_train is identity.
when I check it, there are 1092003 videos in mp4/train and audio/train, but 84730 of which are lost in mouth_roi/train. There is no .h5 file of the same name in mouth_roi/train, such as:
video_path: mp4/train/id04262/96JSsr9Q00k/00009.mp4
mouthroi_path: mouth_roi/train/id04262/96JSsr9Q00k/00009.h5
audio_path: audio/train/id04262/96JSsr9Q00k/00009.wav
video_path: mp4/train/id04262/PX8fGdzDlEs/00011.mp4
mouthroi_path: mouth_roi/train/id04262/PX8fGdzDlEs/00011.h5
audio_path: audio/train/id04262/PX8fGdzDlEs/00011.wav
Hi,Thank you for releasing the code.I wonder why you set the value of num_classes to 500? I read the code, paper, and support file carefully, but I couldn't figure it out, please advise. Thank you so much!
Nice work and impressive results. From the ablation study of your paper, I saw some variants of your model that only take static face (identity features) or lip motion features as visual signals.
Personally, I am interested in those ablation models. I wonder whether I can ask for pre-trained model weights for the static face version. I assume the current weights you provide are the full model, if I change the configure to identity feature as input, the current weight of U-net is not supported for this feature input.
While I try to zero out the lip motion feature and perform a demo video, i.e. visual_feature = torch.cat((identity_feature, lipreading_feature * 0), dim=1)
, it could not work.
Hello, could you please explain the meaning of the weights here?
This coefficient is not included in the paper, and I have found that it is not necessary to calculate this weight in the test.py.
# calculate loss weighting coefficient
if self.opt.weighted_loss:
weight1 = torch.log1p(torch.norm(audio_mix_spec1[:,:,:-1,:], p=2, dim=1)).unsqueeze(1).repeat(1,2,1,1)
weight1 = torch.clamp(weight1, 1e-3, 10)
weight2 = torch.log1p(torch.norm(audio_mix_spec2[:,:,:-1,:], p=2, dim=1)).unsqueeze(1).repeat(1,2,1,1)
weight2 = torch.clamp(weight2, 1e-3, 10)
else:
weight1 = None
weight2 = None
Hi, thanks for your great work.
Could you please release face landmarks for voxceleb2? Because the mouth roi files are too large to download. If landmarks are provided, we can cut mouth area on our own computer and save a lot of time.
Thank you!
When I inference the test demo video, I got the error information:
and I found it in "audioVisual_feature = torch.cat((visual_feat, audio_conv8feature), dim=1)"
RuntimeError: Given groups=1, weight of size [512, 1152, 3, 3], expected input[1, 1792, 2, 64] to have 1152 channels, but got 1792 channels instead
Hi @rhgao,
I'm trying to train it on a new dataset,
but I'm wondering what's the structure in the h5 files
thanks a lot !
Hello, I want to try to run your code recently, but after I find the voxceleb2 data set and download it, the data set is a whole. How should I match the data set with the pre processed mouth ROIs you gave?
Thank you very much for your excellent work.
One problem I am confused about is the definition of the crossmodal loss function
and coseparation loss function
. In the train.py, why random numbers and opt.gt_percentage
are used to select which audio feature (audio_embedding_A1_pred
or audio_embedding_A1_gt
) is used. According to the method of the paper, shouldn't the predictive features be used?
def get_coseparation_loss(output, opt, loss_triplet):
if random.random() > opt.gt_percentage:
audio_embeddings_A1 = output['audio_embedding_A1_pred']
audio_embeddings_A2 = output['audio_embedding_A2_pred']
audio_embeddings_B1 = output['audio_embedding_B1_pred']
audio_embeddings_B2 = output['audio_embedding_B2_pred']
else:
audio_embeddings_A1 = output['audio_embedding_A1_gt']
audio_embeddings_A2 = output['audio_embedding_A2_gt']
audio_embeddings_B1 = output['audio_embedding_B_gt']
audio_embeddings_B2 = output['audio_embedding_B_gt']
coseparation_loss = loss_triplet(audio_embeddings_A1, audio_embeddings_A2, audio_embeddings_B1) + loss_triplet(audio_embeddings_A1, audio_embeddings_A2, audio_embeddings_B2)
return coseparation_loss
def get_crossmodal_loss(output, opt, loss_triplet):
identity_feature_A = output['identity_feature_A']
identity_feature_B = output['identity_feature_B']
if random.random() > opt.gt_percentage:
audio_embeddings_A1 = output['audio_embedding_A1_pred']
audio_embeddings_A2 = output['audio_embedding_A2_pred']
audio_embeddings_B1 = output['audio_embedding_B1_pred']
audio_embeddings_B2 = output['audio_embedding_B2_pred']
else:
audio_embeddings_A1 = output['audio_embedding_A1_gt']
audio_embeddings_A2 = output['audio_embedding_A2_gt']
audio_embeddings_B1 = output['audio_embedding_B_gt']
audio_embeddings_B2 = output['audio_embedding_B_gt']
crossmodal_loss = loss_triplet(audio_embeddings_A1, identity_feature_A, identity_feature_B) + loss_triplet(audio_embeddings_A2, identity_feature_A, identity_feature_B) + loss_triplet(audio_embeddings_B1, identity_feature_B, identity_feature_A) + loss_triplet(audio_embeddings_B2, identity_feature_B, identity_feature_A)
return crossmodal_loss`
```
Hello, I would like to ask a question.
Regarding the mouth data in the dataset, it is stored as an h5 file.
Could you please explain how it was generated? Is there a pre-trained model available?
If I want to replace VoxCeleb2 with a different dataset, how can I generate the mouth h5 files?
Looking forward to your answer! Thank you very much!!
I have recently tried av-enhancement and found out that the provided pretrained models only show classifier, identity, unet, and lipreading_best.pth models.
I could not find vocal_best.pth and facial_best.pth pretrained model, so I tried to used ones in the original repository, the result was not as good as what the demo video represents.
Could you please add the both pretrained model or could you tell me the way to solve my problem?
Thank you so much for your help.
Hi @rhgao,
Thank you for releasing the code. When I perform "tar zxvf mouth_roi_train.tar.gz", it shows
"tar: Skipping to next header
tar: A lone zero block at 3218058299
tar: Exiting with failure status due to previous errors". The size of file is 1155287823247. Is the file broken? Thanks.
Thanks for your great work.
I was curious about the parameter of num_frames. Why we only get 64 frames of mouth ROI for 2.55 seconds? The end of 10 frames is abandoned? I can't figure it out. Thanks again.
the mouth_roi dataset is separated by train,val,seen,unseen 4 directories, but the audio and mp4 raw dataset only contain train and test directory. So how to extract the seen and unseen dataset from raw dataset? thx very much
RuntimeError: Given groups=1, weight of size [512, 1152, 3, 3], expected input[1, 1792, 2, 64] to have 1152 channels, but got 1792 channels instead
Hello!
Thanks for sharing this code with us.
When testing your two-speaker speech separation pre-train models, I found that the model performance deteriorates when extracting a specific single speaker. Only when I combine two speakers' mouth RoIs and faces into the model at the same time can I get a satisfactory separation result. I think this deterioration is caused by separation models, not enhancement models.
In a real scene, the number of speakers is unknown, and extracting only one specific person is needed. So can you provide a speech enhancement model for testing? Such as model structure or pre-trained model.
We will appreciate it if you can provide.
Thanks again for your contribution.
Hi ,
After I downloaded the mouth_roi_train.tar.gz, I encountered a problem while decompressing it. I'm not sure whether it was a data problem or an operation problem. Others data can be decompressed correctly. Could you please take a look? Thank you.
`
$ tar -xf mouth_roi_train.tar.gz
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
$ tar -zxf mouth_roi_train.tar.gz
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
$ ll -h
总用量 1.1T
-rw-rw-r-- 1 **** **** 1.1T 1月 20 11:59 mouth_roi_train.tar.gz
$ file mouth_roi_train.tar.gz
mouth_roi_train.tar.gz: data
`
Hi, thanks for your great work, how can I generate the pretrained cross-modal matching models facial.pth and vocal.pth. I want to train the facial.pth and vocal.pth models on the Voxceleb1 dataset is it possible? How should I do it?
Hi,thx for your great works, I am confused that which alignment algorithm you used and how many landmarks output ?
Hi, I downloaded the dataset and the format of all audio files are m4a.
But the code in audioVisual_dataset used "wavfile.read()" directly.
Does that mean I have to convert the audio files from m4a to wav by myself?
Or may be I miss something important?
Hello, thanks for your great work.
I've been trying to reproduce the enhancement performance on the VoxCeleb2 test set, but the performance of the given pre-trained model was much lower than in the paper.
(I used evaluateSeparation.py
from the main directory to evaluate the metrics.)
And when I tried with test_synthetic_script.sh, the outputs were bad for my hearing.
The offscreen noise in the mixture (audio_mixed.wav) was much larger than the voice from what I heard, so I felt that the enhancement would be too difficult for the model.
I have 2 questions regarding this.
av-enhancement
directory your best model for speech enhancement, not separation?Thanks in advance.
I found out from the code that you use a visual and auditory network with shared parameters for the visual and auditory features of the two speakers. But I'm not sure if my findings are correct as it doesn't seem to be stated in the paper.
When I want to preprocess the demo videos, I can't find the test_videos. Where can I download it?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.