Git Product home page Git Product logo

syncnet_trainer's People

Contributors

joonson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

syncnet_trainer's Issues

Choice of loss function used

Hi,

I am looking through this repo and I am confused about the choice of loss function used. I am using SyncNet to measure lip-sync error and considering that this may be considered a binary classification problem, I am confused as to why the CrossEntropyLoss is used as opposed to the Binary Cross Entropy Loss.

Any clarification would be highly appreciated.
Thanks

Weird behaviour of identity loss

I am training the syncnet on the voxceleb dataset and I see a weird behaviour where the model is overfitting on the identity loss. The sync loss seems to be working fine.
And the weights assigned to each loss is 1.0.

Screenshot 2024-04-11 at 10 23 03 AM Screenshot 2024-04-11 at 10 23 33 AM

Evaluation Protocol for synchronization accuracy in Perfect Match Paper

Hello,

I have a couple of questions regarding the 75.8% synchronization accuracy reported in https://ieeexplore.ieee.org/abstract/document/9067055/

Perfect match Evaluation protocol: The task is to determine the correct synchronisation within a ±15 frame window, and the synchronisation is determined to be correct if the predicted offset is within 1 video frame of the ground truth. A random prediction would therefore yield 9.7% accuracy.

  1. How does changing M affect the model?
  2. The training is a 46-way classification. How exactly do you go from 46-way classification to ±15 way classification?
  3. Do you have the class-split for your evaluation data? Aren't all the test samples in sync? Where do you get out of sync ground truth frames from?
  4. The accuracy for N-way classification reported here is 49%. But your numbers are much higher. I'm wondering why there is a large discrepancy in the two numbers.
  5. The visual stream uses whole face pixels and not just mouth crops. Is that correct?

Thank you!

makeFileList.py doesn't work.

Hi,

I'm trying to convert voxceleb2 test set using makeFileList.py.
But, it raise error like this

raise ValueError(f"File format {repr(str1)} not understood. Only "
ValueError: File format b'\x00\x00\x00\x18' not understood. Only 'RIFF' and 'RIFX' supported.

I think the above error was caused by trying to read the m4a file in wav format.
In makeFileList.py, it looks assuming that the audio file format is wav.
But, downloaded audio's format is m4a.

I have to utilize this code to convert format of audio?

Thanks

Model overfitting when finetuned on smaller data

Hi
I am trying to finetune the model on a smaller dataset of 450 samples of greyhead renders that look like this
Screenshot 2024-02-06 at 10 42 08 AM

My training loss seems to be converging but the validation loss is diverging.
I am freezing everything but the final layers but otherwise using the exact same code as the repo.
Screenshot 2024-02-07 at 7 41 56 PM
Screenshot 2024-02-07 at 7 42 06 PM
When I plot the euclidean distance of the FC layer outputs I find that the distribution is pretty overlapping for matched (green) and unmatched pairs (red).
matched_and_unmatched_pair_dist

label

why you define your label in the loss function to be :

` def sync_loss(self,out_v,out_a,criterion):

    batch_size  = out_a.size()[0]
    time_size   = out_a.size()[2]

    label       = torch.arange(time_size).cuda()

    nloss = 0
    prec1 = 0

`

should the label be 0 or 1 depended on the data is synchronized or not ?

intuition behind trainable weights and bias for losses

Hey
I was wondering if you could shed some light on why did you add learnable weights and biases to the sync and identity losses? To me it seemed like you were possibly trying to scale and shift but I don't understand why the model doesn't train without it.
Also if you put learnable weights and biases, whats stopping the weight to be 0 and making loss 0?
I am using voxceleb dataset for training.

Below are the loss curves for when I removed the weights. The sync loss seems to be stagnant while the identity loss is increasing.
Screenshot 2024-04-15 at 12 46 35 PM
Screenshot 2024-04-15 at 12 46 51 PM

What is the txt files

I want to ask about VOX2_PATH/dev/txt .Where is the txt file?
Thanks a lot for your help

Where do I find the txt files?

I am trying the repo for the first time. While preparing the data I find that we need the text annotations of the voxceleb files. But I find the dataset appears to have only mp4s and wavs.

Is there a way to get the annotations and the offsets? Is it available elsewhere?

Related command -

python ./makeFileList.py --output data/test.txt --mp4_dir VOX2_PATH/test/mp4 --txt_dir VOX2_PATH/test/txt --wav_dir VOX2_PATH/test/wav

Negative audio samples for M way matching

Where are the negative audio samples being generated for M-way matching problem? I just see load_wav function samples the audio corresponding to the starting index in video frame.

I only see positive samples

What does text_path mean?

In the speaker recognition task, two parameters are test_list and test_path. What path is test_path?
The code uses "inp1 = loadWAV(os.path.join(test_path,file), self.max_frames*4, evalmode=True, num_eval=num_eval).cuda()", But some of the data in the file is a single string of numbers, and some is the full audio address.

makeFileList.py won't let any dev video out

Hi,

I'm about to train the model with Voxceleb2 dataset just to monitor the train pipeline.
I run this command on dev set:
python ./makeFileList.py --output data/dev.txt --mp4_dir VOX2_PATH/dev/mp4 --txt_dir VOX2_PATH/dev/txt --wav_dir VOX2_PATH/dev/wav

But for all files I'm getting this error :
Skipped ./vox2celeb/dev/mp4/id04484/ex2J3Oq2CAE/00084.mp4 - audio and video lengths different

I worth mentioning that I had extracted the .wav files from mp4 files by using ffmpeg instead of downloading audio files.

Could anyone help please?

Disentanglement loss

Hi, It is mentioned in the paper to use disentanglement loss, but the code is not used, can you provide the code to solve entanglement loss? Thank you.

Evaluation on list save

Hi, I am wondering what the reasoning behind the evaluation implemented in evaluateFromListSave is - it seems to me this is loading in 2 audio files, running the audio feature extractor on them, and computing the feature-wise cosine distance between them. Where is the video pipeline in this? How is this a good evaluation metric without using the visual stream?

Usage of disentangle loss

Hi, I'd like to know how can I add disentangle loss into the training process so as to know the true value of disentangling. It seems adding disentangle loss into training requires big adjusts to the code released here.
Thanks

preprocess for wav

Hi, joonson

  1. run python makeFileList.py, It always Skipped audio and video lengths different, I get wav from mp4 by ffmpeg, how do you get wav from m4a or from mp4 ?
  2. in dataLoader frame get by cap = cv2.VideoCapture(filename), here frame is whole frame or is only face rect?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.