Git Product home page Git Product logo

localizing-visual-sounds-the-hard-way's Introduction

Localizing-Visual-Sounds-the-Hard-Way

Code and Dataset for "Localizing Visual Sounds the Hard Way".

The repo contains code and our pre-trained model.

Environment

  • Python 3.6.8
  • Pytorch 1.3.0

Flickr-SoundNet

We provide the pretrained model here.

To test the model, testing data and ground truth should be downloaded from learning to localize sound source.

Then run

python test.py --data_path "path to downloaded data with structure below/" --summaries_dir "path to pretrained models" --gt_path "path to ground truth" --testset "flickr"

VGG-Sound Source

We provide the pretrained model here.

To test the model, run

python test.py --data_path "path to downloaded data with structure below/" --summaries_dir "path to pretrained models" --testset "vggss"

(Note, some gt bounding boxes are updated recently, all results on VGG-SS cause a 2~3% difference on IoU.)

Both test data should be placed in the following structure.

data path
│
└───frames
│   │   image001.jpg
│   │   image002.jpg
│   │
└───audio
    │   audio011.wav
    │   audio012.wav

Citation

@InProceedings{Chen21,
              title        = "Localizing Visual Sounds the Hard Way",
              author       = "Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman",
              booktitle    = "CVPR",
              year         = "2021"}

localizing-visual-sounds-the-hard-way's People

Contributors

hche11 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

localizing-visual-sounds-the-hard-way's Issues

The fisrt test result isn't match to the one in your paper, is that OK?

Hey @hche11 ,I tested the results of the pre-trained model you provided on the test.py, but there was some difference from the tabular data in the paper. Specifically, I tested on the SoundNet-Flickr test set, and all the steps were completed, but when there were some bugs at runtime, I successfully ran test.py after modifying two code places, but the results were as shown in the figure, which was only close to the results of the setup model with the training set VGG-Sound Full in the paper. I wonder if this gap is normal?

by the way,the two codes I modified are mainly: 1.line 56 in model.py, because the Tensor object does not have a T method, so I replaced it with aud.t() 2.line 110 in dataloader.py, in the call The axis doesn't match error occurred during the aid_spectrogram method, so I expanded the dimension of the spectrogram object, that is, spectrogram = np.expand_dims(spectrogram,axis=2)
cbcb4b0f9e082a05b9d69408a384bb5


Pretrained models

Hi, thanks for your great work. I met some problems with the pretrained models. I can't unzip the pretrained models you provided. I tried to download it again several times but it still doesn't work. I'm not sure if the source file is corrupted or something else. I'll be very appreciated if you can solve my problem. Thanks a lot.

图片帧位置

您论文中的意思是每个视频片段的中间帧,就是说10s的视频是第151帧,想向您确认一下理解的有没有问题

Code

Hi,

Will the code be available soon?

Random threshold

i think there is a problem with the random threshold in model it is supposed to be under arg 'i think' but it has no set value please clarify this for me

Missing data from VGG-SS

Hi, I am trying to download all videos from the test portion of VGG-SS, however many samples are missing. Do you have all of the videos? And if so, how can I get access to them. Thanks.

Spectrogram Dimension

Hi,

I am wondering how could we get 257x300 tensor for spectrograms. I only got 257x200 for 16000 sample rate and 3s audio

Some questions about loss implementation

Hi, thank you for sharing your awesome code.

I'm facing some issues while understanding your code.

111

# Join them
A = torch.einsum('ncqa,nchw->nqa', [img, aud.unsqueeze(2).unsqueeze(3)]).unsqueeze(1)
A0 = torch.einsum('ncqa,ckhw->nkqa', [img, aud.T.unsqueeze(2).unsqueeze(3)])
# trimap
Pos = self.m((A - self.epsilon)/self.tau)
if self.trimap:
Pos2 = self.m((A - self.epsilon2)/self.tau)
Neg = 1 - Pos2
else:
Neg = 1 - Pos
Pos_all = self.m((A0 - self.epsilon)/self.tau)
# positive
sim1 = (Pos * A).view(*A.shape[:2],-1).sum(-1) / (Pos.view(*Pos.shape[:2],-1).sum(-1))
#negative
sim = ((Pos_all * A0).view(*A0.shape[:2],-1).sum(-1) / Pos_all.view(*Pos_all.shape[:2],-1).sum(-1) )* self.mask
sim2 = (Neg * A).view(*A.shape[:2],-1).sum(-1) / Neg.view(*Neg.shape[:2],-1).sum(-1)
if self.Neg:
logits = torch.cat((sim1,sim,sim2),1)/0.07
else:
logits = torch.cat((sim1,sim),1)/0.07
return A,logits,Pos,Neg

Referring to the code and your paper,
I understood that sim1 in L#69 is representing P_i, and sim2 in L#72 is implementing the left term of N_i.

It seems sim in L#71 is referring to the right term of N_i.
However, I cannot understand why Pos_all variable is a thresholded value of A0.
I thought it should be all-one matrices, according to the paper.

  • Could you clarify where sim belongs to in the loss objective?

One more question, please.

  • Regarding L#74-77, should I scale the logits with the temperature value (0.07) ?
    I am little confused as scaling logits with temperature value is not directly stated in the paper.

It would be very helpful if you can release the code of your loss function.
Thank you very much.
Have a nice day!

problem in downloading the VGGSS dataset

Thanks for sharing the codes.

Have you annotated the raw videos from youtube? or you have annotated the processed videos of vggs dataset? I am asking this because the names of the videos in the vggss.json file are not identical to the .csv files of the vggs dataset. Otherwise, if you have processed the raw videos from youtube, did you select the middle frame (as said in the paper) regardless of the vggs dataset?

I am confused about how to find the middle frame. I appreciate it if you could share the code that helps to download the dataset.

The reason why using theshold (0.5) in cal_CIoU

Hi!

Why do you use the theshold (0.5) in cal_CIoU, although the training doesn't give any information about the 0.5? In other words, is it just from the hyp-param tunning, or reasoned from mathematical properties behind the contrastive loss ?

The reason what I'm asking is that recent papers Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning , A Closer Look at Weakly-Supervised Audio-Visual Source Localization use relative prediction, which always choose the 50% region as the prediction results, without any thresholdm so I just become curious :)

loss function

Heollo, @hche11 .Thanks for your awesome work. I am very interested in this work.

Could you release the loss function if you don't mind? I would appreciate it if so!

training code

Hi, thanks for your awesome work. Do you plan to release the training codes?

Audio & Image ConvNet

Hello, thank you for the great work. I'm wondering where and how are models/audio_convnet.py and models/image_convnet.py used? Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.