facebookresearch / visualvoice Goto Github PK

View Code? Open in Web Editor NEW

207.0 207.0 35.0 3.39 MB

Audio-Visual Speech Separation with Cross-Modal Consistency

License: Other

Python 99.03% Shell 0.97%

visualvoice's People

Contributors

Stargazers

Watchers

visualvoice's Issues

why the sdr performence in the paper cannot be realized

I've trained the model for up to 200 thousands epoch, but the sdr performence is only 7.4 in unseen_unheard_test set. It is wondered how long the model has been training in paper. Due to the limitation of the numbers of GPUs, the config of paper can't be implemented. So, any advice will help me.

my training config below
--gpu_ids 0,1 \ --batchSize 10 \ --nThreads 16 \ --decay_factor 0.5 \ --num_batch 400000 \ --lr_steps 40000 80000 120000 160000 200000 \ --coseparation_loss_weight 0.01 \ --mixandseparate_loss_weight 1 \ --crossmodal_loss_weight 0.01 \ --lr_lipreading 0.0001 \ --lr_facial_attributes 0.00001 \ --lr_unet 0.0001 \ --lr_vocal_attributes 0.00001 \ --

The experimental part of the consulting paper "VISUALVOICE: Audio-Visual Speech Separation with Cross-Modal Consistency" Table 1 The first part is audio only

Dear Dr. Gao:
Hi!
Thank you for your excellent work on this paper "VISUALVOICE: Audio-Visual Speech Separation with Cross-Modal Consistency". I would like to consult the experimental part. Do you have the code for Audio-Only[79] in Table 1? Why did I delete the visual information in the code VisualVoice, and only audio separation cannot achieve the result of Audio-Only[79] in Table 1. If you have this part of Audio-Only[79] code, can you send it to me to learn it? I would appreciate any help, thanks!

the pre-trained cross-modal matching models

Hello, how to generate pre-trained cross-modal matching models facial.pth and vocal.pth.

Ask for help about the net_vocal

Hello, I observed the effect of net_vocal_attributes in the whole model framework.

At present, the embedding extracted from the predicted sound, the distance of the negative sample pair (audio_embedding_A1_pred and audio_embedding_B1_pred) can reach 2, and the distance of the positive sample pair (audio_embedding_A1_pred and audio_embedding_A2_pred) can reach about 0.

But after I changed the input of net_vocal to pure real sound, the distance between negative sample pairs (audio_embedding_A1_gt and audio_embedding_B_gt) can only reach 1. That is to say, the sound feature extraction is not good when I train the net_vocal alone.

It stands to reason that pure ground voices are easier to extract features than predicted voices. I modified the parameters of the training (batch, learning rate, etc.) but none solved the problem. May I know what is the reason?

Looking forward to your reply！

about 1/10 data in mouth_roi_train is lost

I think the number of files in mp4, audio, mouth_roi_train is identity.
when I check it, there are 1092003 videos in mp4/train and audio/train, but 84730 of which are lost in mouth_roi/train. There is no .h5 file of the same name in mouth_roi/train, such as:

video_path: mp4/train/id04262/96JSsr9Q00k/00009.mp4
mouthroi_path: mouth_roi/train/id04262/96JSsr9Q00k/00009.h5
audio_path: audio/train/id04262/96JSsr9Q00k/00009.wav

video_path: mp4/train/id04262/PX8fGdzDlEs/00011.mp4
mouthroi_path: mouth_roi/train/id04262/PX8fGdzDlEs/00011.h5
audio_path: audio/train/id04262/PX8fGdzDlEs/00011.wav

num_classes

Hi,Thank you for releasing the code.I wonder why you set the value of num_classes to 500? I read the code, paper, and support file carefully, but I couldn't figure it out, please advise. Thank you so much!

Variants of Pretrained Model

Nice work and impressive results. From the ablation study of your paper, I saw some variants of your model that only take static face (identity features) or lip motion features as visual signals.

Personally, I am interested in those ablation models. I wonder whether I can ask for pre-trained model weights for the static face version. I assume the current weights you provide are the full model, if I change the configure to identity feature as input, the current weight of U-net is not supported for this feature input.
While I try to zero out the lip motion feature and perform a demo video, i.e. visual_feature = torch.cat((identity_feature, lipreading_feature * 0), dim=1), it could not work.

How to define the weight coefficient in mask loss?

Hello, could you please explain the meaning of the weights here?

This coefficient is not included in the paper, and I have found that it is not necessary to calculate this weight in the test.py.

     # calculate loss weighting coefficient        
     if self.opt.weighted_loss:
        weight1 = torch.log1p(torch.norm(audio_mix_spec1[:,:,:-1,:], p=2, dim=1)).unsqueeze(1).repeat(1,2,1,1)
        weight1 = torch.clamp(weight1, 1e-3, 10)
        weight2 = torch.log1p(torch.norm(audio_mix_spec2[:,:,:-1,:], p=2, dim=1)).unsqueeze(1).repeat(1,2,1,1)
        weight2 = torch.clamp(weight2, 1e-3, 10)
    else:
        weight1 = None
        weight2 = None

Provide face landmarks

Hi, thanks for your great work.
Could you please release face landmarks for voxceleb2? Because the mouth roi files are too large to download. If landmarks are provided, we can cut mouth area on our own computer and save a lot of time.
Thank you!

wrong for inference test demo video

When I inference the test demo video, I got the error information:
and I found it in "audioVisual_feature = torch.cat((visual_feat, audio_conv8feature), dim=1)"

RuntimeError: Given groups=1, weight of size [512, 1152, 3, 3], expected input[1, 1792, 2, 64] to have 1152 channels, but got 1792 channels instead

Data structure

Hi @rhgao,
I'm trying to train it on a new dataset,
but I'm wondering what's the structure in the h5 files
thanks a lot !

Data set classification

Hello, I want to try to run your code recently, but after I find the voxceleb2 data set and download it, the data set is a whole. How should I match the data set with the pre processed mouth ROIs you gave?

About

Thank you very much for your excellent work.
One problem I am confused about is the definition of the crossmodal loss function and coseparation loss function. In the train.py, why random numbers and opt.gt_percentage are used to select which audio feature (audio_embedding_A1_pred or audio_embedding_A1_gt) is used. According to the method of the paper, shouldn't the predictive features be used?

def get_coseparation_loss(output, opt, loss_triplet):
if random.random() > opt.gt_percentage:
audio_embeddings_A1 = output['audio_embedding_A1_pred']
audio_embeddings_A2 = output['audio_embedding_A2_pred']
audio_embeddings_B1 = output['audio_embedding_B1_pred']
audio_embeddings_B2 = output['audio_embedding_B2_pred']
else:
audio_embeddings_A1 = output['audio_embedding_A1_gt']
audio_embeddings_A2 = output['audio_embedding_A2_gt']
audio_embeddings_B1 = output['audio_embedding_B_gt']
audio_embeddings_B2 = output['audio_embedding_B_gt']

coseparation_loss = loss_triplet(audio_embeddings_A1, audio_embeddings_A2, audio_embeddings_B1) + loss_triplet(audio_embeddings_A1, audio_embeddings_A2, audio_embeddings_B2)
return coseparation_loss
def get_crossmodal_loss(output, opt, loss_triplet):
identity_feature_A = output['identity_feature_A']
identity_feature_B = output['identity_feature_B']
if random.random() > opt.gt_percentage:
audio_embeddings_A1 = output['audio_embedding_A1_pred']
audio_embeddings_A2 = output['audio_embedding_A2_pred']
audio_embeddings_B1 = output['audio_embedding_B1_pred']
audio_embeddings_B2 = output['audio_embedding_B2_pred']
else:
audio_embeddings_A1 = output['audio_embedding_A1_gt']
audio_embeddings_A2 = output['audio_embedding_A2_gt']
audio_embeddings_B1 = output['audio_embedding_B_gt']
audio_embeddings_B2 = output['audio_embedding_B_gt']
crossmodal_loss = loss_triplet(audio_embeddings_A1, identity_feature_A, identity_feature_B) + loss_triplet(audio_embeddings_A2, identity_feature_A, identity_feature_B) + loss_triplet(audio_embeddings_B1, identity_feature_B, identity_feature_A) + loss_triplet(audio_embeddings_B2, identity_feature_B, identity_feature_A)
return crossmodal_loss`
```

The pre-processed mouth ROIs

Hello, I would like to ask a question.

Regarding the mouth data in the dataset, it is stored as an h5 file.

Could you please explain how it was generated? Is there a pre-trained model available?

If I want to replace VoxCeleb2 with a different dataset, how can I generate the mouth h5 files?

Looking forward to your answer! Thank you very much!!

Where can I get vocal_best.pth and facial_best.pth for speech enhancement model?

I have recently tried av-enhancement and found out that the provided pretrained models only show classifier, identity, unet, and lipreading_best.pth models.
I could not find vocal_best.pth and facial_best.pth pretrained model, so I tried to used ones in the original repository, the result was not as good as what the demo video represents.
Could you please add the both pretrained model or could you tell me the way to solve my problem?
Thank you so much for your help.

error extracting mouth_roi_train.tar.gz

Hi @rhgao,

Thank you for releasing the code. When I perform "tar zxvf mouth_roi_train.tar.gz", it shows
"tar: Skipping to next header
tar: A lone zero block at 3218058299
tar: Exiting with failure status due to previous errors". The size of file is 1155287823247. Is the file broken? Thanks.

Why num_frames is 64 not 75 or other number?

Thanks for your great work.
I was curious about the parameter of num_frames. Why we only get 64 frames of mouth ROI for 2.55 seconds? The end of 10 frames is abandoned? I can't figure it out. Thanks again.

how to separate the audio and mp4 directory

the mouth_roi dataset is separated by train,val,seen,unseen 4 directories, but the audio and mp4 raw dataset only contain train and test directory. So how to extract the seen and unseen dataset from raw dataset? thx very much

Are the masks supposed to be clamped or clipped?

I am confused because of the function and variable names.

Questions about the testRealVideo.py

RuntimeError: Given groups=1, weight of size [512, 1152, 3, 3], expected input[1, 1792, 2, 64] to have 1152 channels, but got 1792 channels instead

Can you release speech enhancement models ?

Hello!
Thanks for sharing this code with us.

When testing your two-speaker speech separation pre-train models, I found that the model performance deteriorates when extracting a specific single speaker. Only when I combine two speakers' mouth RoIs and faces into the model at the same time can I get a satisfactory separation result. I think this deterioration is caused by separation models, not enhancement models.

In a real scene, the number of speakers is unknown, and extracting only one specific person is needed. So can you provide a speech enhancement model for testing? Such as model structure or pre-trained model.

We will appreciate it if you can provide.

Thanks again for your contribution.

mouth_roi_train

Hi ,
After I downloaded the mouth_roi_train.tar.gz, I encountered a problem while decompressing it. I'm not sure whether it was a data problem or an operation problem. Others data can be decompressed correctly. Could you please take a look? Thank you.

`
$ tar -xf mouth_roi_train.tar.gz

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
$ tar -zxf mouth_roi_train.tar.gz

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
$ ll -h
总用量 1.1T
-rw-rw-r-- 1 **** **** 1.1T 1月 20 11:59 mouth_roi_train.tar.gz
$ file mouth_roi_train.tar.gz
mouth_roi_train.tar.gz: data
`

the pre-trained cross-modal matching models(facial.pth and vocal.pth)

Hi, thanks for your great work, how can I generate the pretrained cross-modal matching models facial.pth and vocal.pth. I want to train the facial.pth and vocal.pth models on the Voxceleb1 dataset is it possible? How should I do it?

landmarks

Hi，thx for your great works， I am confused that which alignment algorithm you used and how many landmarks output ?

About audio files

Hi, I downloaded the dataset and the format of all audio files are m4a.
But the code in audioVisual_dataset used "wavfile.read()" directly.
Does that mean I have to convert the audio files from m4a to wav by myself?
Or may be I miss something important?

Speech enhancement evaluation

Hello, thanks for your great work.

I've been trying to reproduce the enhancement performance on the VoxCeleb2 test set, but the performance of the given pre-trained model was much lower than in the paper.
(I used evaluateSeparation.py from the main directory to evaluate the metrics.)

And when I tried with test_synthetic_script.sh, the outputs were bad for my hearing.
The offscreen noise in the mixture (audio_mixed.wav) was much larger than the voice from what I heard, so I felt that the enhancement would be too difficult for the model.

I have 2 questions regarding this.

Is the pre-trained model in the av-enhancement directory your best model for speech enhancement, not separation?
Is your evaluation done with a mixture of two speeches and an offscreen noise with weight 1?
Isn't it too difficult for the model to separate and enhance at the same time?

Thanks in advance.

Questions about the network structure

I found out from the code that you use a visual and auditory network with shared parameters for the visual and auditory features of the two speakers. But I'm not sure if my findings are correct as it doesn't seem to be stated in the paper.

Can't find the test_videos.

When I want to preprocess the demo videos, I can't find the test_videos. Where can I download it?

facebookresearch / visualvoice Goto Github PK

visualvoice's People

Contributors

Stargazers

Watchers

Forkers

visualvoice's Issues

Recommend Projects

Recommend Topics

Recommend Org