joonson / syncnet_trainer Goto Github PK

View Code? Open in Web Editor NEW

143.0 143.0 26.0 15 KB

Disentangled Speech Embeddings using Cross-Modal Self-Supervision

License: MIT License

Python 100.00%

syncnet_trainer's People

Contributors

Stargazers

Watchers

syncnet_trainer's Issues

Choice of loss function used

Hi,

I am looking through this repo and I am confused about the choice of loss function used. I am using SyncNet to measure lip-sync error and considering that this may be considered a binary classification problem, I am confused as to why the CrossEntropyLoss is used as opposed to the Binary Cross Entropy Loss.

Any clarification would be highly appreciated.
Thanks

about pretrained model

The pre-trained model cannot be downloaded

What is the meaning of offset in the txt file ?

Hi,
I'm a little confused about the meaning of "offset" in the txt file.
Could anyone please explain the meaning of it ?
Thank you.

Weird behaviour of identity loss

I am training the syncnet on the voxceleb dataset and I see a weird behaviour where the model is overfitting on the identity loss. The sync loss seems to be working fine.
And the weights assigned to each loss is 1.0.

Evaluation Protocol for synchronization accuracy in Perfect Match Paper

Hello,

I have a couple of questions regarding the 75.8% synchronization accuracy reported in https://ieeexplore.ieee.org/abstract/document/9067055/

Perfect match Evaluation protocol: The task is to determine the correct synchronisation within a ±15 frame window, and the synchronisation is determined to be correct if the predicted offset is within 1 video frame of the ground truth. A random prediction would therefore yield 9.7% accuracy.

How does changing M affect the model?
The training is a 46-way classification. How exactly do you go from 46-way classification to ±15 way classification?
Do you have the class-split for your evaluation data? Aren't all the test samples in sync? Where do you get out of sync ground truth frames from?
The accuracy for N-way classification reported here is 49%. But your numbers are much higher. I'm wondering why there is a large discrepancy in the two numbers.
The visual stream uses whole face pixels and not just mouth crops. Is that correct?

Thank you!

Training list only contains 270k+ lines

Most of video-audio pairs are not synchronized so they are ignored.
But you mentioned there should be 1000k+ lines?

makeFileList.py doesn't work.

Hi,

I'm trying to convert voxceleb2 test set using makeFileList.py.
But, it raise error like this

raise ValueError(f"File format {repr(str1)} not understood. Only "
ValueError: File format b'\x00\x00\x00\x18' not understood. Only 'RIFF' and 'RIFX' supported.

I think the above error was caused by trying to read the m4a file in wav format.
In makeFileList.py, it looks assuming that the audio file format is wav.
But, downloaded audio's format is m4a.

I have to utilize this code to convert format of audio?

Thanks

Model overfitting when finetuned on smaller data

Hi
I am trying to finetune the model on a smaller dataset of 450 samples of greyhead renders that look like this

My training loss seems to be converging but the validation loss is diverging.
I am freezing everything but the final layers but otherwise using the exact same code as the repo.

When I plot the euclidean distance of the FC layer outputs I find that the distribution is pretty overlapping for matched (green) and unmatched pairs (red).

Wrong Post

Ignore this

Do we need to download VGGFACE and VoxCeleb in advance?

label

why you define your label in the loss function to be :

` def sync_loss(self,out_v,out_a,criterion):

    batch_size  = out_a.size()[0]
    time_size   = out_a.size()[2]

    label       = torch.arange(time_size).cuda()

    nloss = 0
    prec1 = 0

should the label be 0 or 1 depended on the data is synchronized or not ?

No such file :/vox2_dev_txt/id04482/hB9jA7_P7Qk/00053.txt

It is 00048.txt instead. Is there anything wrong with the dataset?

intuition behind trainable weights and bias for losses

Hey
I was wondering if you could shed some light on why did you add learnable weights and biases to the sync and identity losses? To me it seemed like you were possibly trying to scale and shift but I don't understand why the model doesn't train without it.
Also if you put learnable weights and biases, whats stopping the weight to be 0 and making loss 0?
I am using voxceleb dataset for training.

Below are the loss curves for when I removed the weights. The sync loss seems to be stagnant while the identity loss is increasing.

What is the txt files

I want to ask about VOX2_PATH/dev/txt .Where is the txt file?
Thanks a lot for your help

Where do I find the txt files?

I am trying the repo for the first time. While preparing the data I find that we need the text annotations of the voxceleb files. But I find the dataset appears to have only mp4s and wavs.

Is there a way to get the annotations and the offsets? Is it available elsewhere?

Related command -

python ./makeFileList.py --output data/test.txt --mp4_dir VOX2_PATH/test/mp4 --txt_dir VOX2_PATH/test/txt --wav_dir VOX2_PATH/test/wav

Negative audio samples for M way matching

Where are the negative audio samples being generated for M-way matching problem? I just see load_wav function samples the audio corresponding to the starting index in video frame.

I only see positive samples

What does text_path mean?

In the speaker recognition task, two parameters are test_list and test_path. What path is test_path?
The code uses "inp1 = loadWAV(os.path.join(test_path,file), self.max_frames*4, evalmode=True, num_eval=num_eval).cuda()", But some of the data in the file is a single string of numbers, and some is the full audio address.

makeFileList.py won't let any dev video out

Hi,

I'm about to train the model with Voxceleb2 dataset just to monitor the train pipeline.
I run this command on dev set:
python ./makeFileList.py --output data/dev.txt --mp4_dir VOX2_PATH/dev/mp4 --txt_dir VOX2_PATH/dev/txt --wav_dir VOX2_PATH/dev/wav

But for all files I'm getting this error :
Skipped ./vox2celeb/dev/mp4/id04484/ex2J3Oq2CAE/00084.mp4 - audio and video lengths different

I worth mentioning that I had extracted the .wav files from mp4 files by using ffmpeg instead of downloading audio files.

Could anyone help please?

The link of pretrained model is out of work, can you provide it again? Thank you.

Disentanglement loss

Hi, It is mentioned in the paper to use disentanglement loss, but the code is not used, can you provide the code to solve entanglement loss? Thank you.

Evaluation on list save

Hi, I am wondering what the reasoning behind the evaluation implemented in evaluateFromListSave is - it seems to me this is loading in 2 audio files, running the audio feature extractor on them, and computing the feature-wise cosine distance between them. Where is the video pipeline in this? How is this a good evaluation metric without using the visual stream?

run python makeFileList.py, It always Skipped audio and video lengths different, I get wav from mp4 by ffmpeg, how do you get wav from m4a or from mp4 ?
in dataLoader frame get by cap = cv2.VideoCapture(filename), here frame is whole frame or is only face rect?

joonson / syncnet_trainer Goto Github PK

syncnet_trainer's People

Contributors

Stargazers

Watchers

Forkers

syncnet_trainer's Issues

Recommend Projects

Recommend Topics

Recommend Org