joonson / syncnet_trainer Goto Github PK
View Code? Open in Web Editor NEWDisentangled Speech Embeddings using Cross-Modal Self-Supervision
License: MIT License
Disentangled Speech Embeddings using Cross-Modal Self-Supervision
License: MIT License
Hi,
I am looking through this repo and I am confused about the choice of loss function used. I am using SyncNet to measure lip-sync error and considering that this may be considered a binary classification problem, I am confused as to why the CrossEntropyLoss is used as opposed to the Binary Cross Entropy Loss.
Any clarification would be highly appreciated.
Thanks
The pre-trained model cannot be downloaded
Hi,
I'm a little confused about the meaning of "offset" in the txt file.
Could anyone please explain the meaning of it ?
Thank you.
Hello,
I have a couple of questions regarding the 75.8% synchronization accuracy reported in https://ieeexplore.ieee.org/abstract/document/9067055/
Perfect match Evaluation protocol: The task is to determine the correct synchronisation within a ±15 frame window, and the synchronisation is determined to be correct if the predicted offset is within 1 video frame of the ground truth. A random prediction would therefore yield 9.7% accuracy.
Thank you!
Most of video-audio pairs are not synchronized so they are ignored.
But you mentioned there should be 1000k+ lines?
Hi,
I'm trying to convert voxceleb2 test set using makeFileList.py.
But, it raise error like this
raise ValueError(f"File format {repr(str1)} not understood. Only "
ValueError: File format b'\x00\x00\x00\x18' not understood. Only 'RIFF' and 'RIFX' supported.
I think the above error was caused by trying to read the m4a file in wav format.
In makeFileList.py, it looks assuming that the audio file format is wav.
But, downloaded audio's format is m4a.
I have to utilize this code to convert format of audio?
Thanks
Hi
I am trying to finetune the model on a smaller dataset of 450 samples of greyhead renders that look like this
My training loss seems to be converging but the validation loss is diverging.
I am freezing everything but the final layers but otherwise using the exact same code as the repo.
When I plot the euclidean distance of the FC layer outputs I find that the distribution is pretty overlapping for matched (green) and unmatched pairs (red).
Ignore this
why you define your label in the loss function to be :
` def sync_loss(self,out_v,out_a,criterion):
batch_size = out_a.size()[0]
time_size = out_a.size()[2]
label = torch.arange(time_size).cuda()
nloss = 0
prec1 = 0
`
should the label be 0 or 1 depended on the data is synchronized or not ?
It is 00048.txt instead. Is there anything wrong with the dataset?
Hey
I was wondering if you could shed some light on why did you add learnable weights and biases to the sync and identity losses? To me it seemed like you were possibly trying to scale and shift but I don't understand why the model doesn't train without it.
Also if you put learnable weights and biases, whats stopping the weight to be 0 and making loss 0?
I am using voxceleb dataset for training.
Below are the loss curves for when I removed the weights. The sync loss seems to be stagnant while the identity loss is increasing.
I want to ask about VOX2_PATH/dev/txt .Where is the txt file?
Thanks a lot for your help
I am trying the repo for the first time. While preparing the data I find that we need the text annotations of the voxceleb files. But I find the dataset appears to have only mp4s and wavs.
Is there a way to get the annotations and the offsets? Is it available elsewhere?
Related command -
python ./makeFileList.py --output data/test.txt --mp4_dir VOX2_PATH/test/mp4 --txt_dir VOX2_PATH/test/txt --wav_dir VOX2_PATH/test/wav
Where are the negative audio samples being generated for M-way matching problem? I just see load_wav function samples the audio corresponding to the starting index in video frame.
I only see positive samples
In the speaker recognition task, two parameters are test_list and test_path. What path is test_path?
The code uses "inp1 = loadWAV(os.path.join(test_path,file), self.max_frames*4, evalmode=True, num_eval=num_eval).cuda()", But some of the data in the file is a single string of numbers, and some is the full audio address.
Hi,
I'm about to train the model with Voxceleb2 dataset just to monitor the train pipeline.
I run this command on dev set:
python ./makeFileList.py --output data/dev.txt --mp4_dir VOX2_PATH/dev/mp4 --txt_dir VOX2_PATH/dev/txt --wav_dir VOX2_PATH/dev/wav
But for all files I'm getting this error :
Skipped ./vox2celeb/dev/mp4/id04484/ex2J3Oq2CAE/00084.mp4 - audio and video lengths different
I worth mentioning that I had extracted the .wav files from mp4 files by using ffmpeg instead of downloading audio files.
Could anyone help please?
The link of pretrained model is out of work, can you provide it again? Thank you.
Hi, It is mentioned in the paper to use disentanglement loss, but the code is not used, can you provide the code to solve entanglement loss? Thank you.
Hi, I am wondering what the reasoning behind the evaluation implemented in evaluateFromListSave is - it seems to me this is loading in 2 audio files, running the audio feature extractor on them, and computing the feature-wise cosine distance between them. Where is the video pipeline in this? How is this a good evaluation metric without using the visual stream?
Hi, I'd like to know how can I add disentangle loss into the training process so as to know the true value of disentangling. It seems adding disentangle loss into training requires big adjusts to the code released here.
Thanks
What is the text dir?
Hi, joonson
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.