Localizing Visual Sounds the Hard Way

License: Apache License 2.0

Python 100.00%

localizing-visual-sounds-the-hard-way's Introduction

Localizing-Visual-Sounds-the-Hard-Way

Code and Dataset for "Localizing Visual Sounds the Hard Way".

The repo contains code and our pre-trained model.

Environment

Python 3.6.8
Pytorch 1.3.0

Flickr-SoundNet

We provide the pretrained model here.

To test the model, testing data and ground truth should be downloaded from learning to localize sound source.

Then run

python test.py --data_path "path to downloaded data with structure below/" --summaries_dir "path to pretrained models" --gt_path "path to ground truth" --testset "flickr"

VGG-Sound Source

We provide the pretrained model here.

To test the model, run

python test.py --data_path "path to downloaded data with structure below/" --summaries_dir "path to pretrained models" --testset "vggss"

(Note, some gt bounding boxes are updated recently, all results on VGG-SS cause a 2~3% difference on IoU.)

Both test data should be placed in the following structure.

data path
│
└───frames
│   │   image001.jpg
│   │   image002.jpg
│   │
└───audio
    │   audio011.wav
    │   audio012.wav

Citation

@InProceedings{Chen21,
              title        = "Localizing Visual Sounds the Hard Way",
              author       = "Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman",
              booktitle    = "CVPR",
              year         = "2021"}

localizing-visual-sounds-the-hard-way's People

Contributors

Stargazers

Watchers

Forkers

zhengxl25 yangx1123 tonymisic noammy wh-forker wujian0601 ankitshah009 xuanhanyu realningzheng chester-w-xie lokender cdd1993 294coder lowkey001

localizing-visual-sounds-the-hard-way's Issues

Links to pre-trained models are expired

Hello, could you please update the dropbox links as they don't seem to work anymore?
Thank you very much in advance.

The fisrt test result isn't match to the one in your paper, is that OK?

Hey @hche11 ,I tested the results of the pre-trained model you provided on the test.py, but there was some difference from the tabular data in the paper. Specifically, I tested on the SoundNet-Flickr test set, and all the steps were completed, but when there were some bugs at runtime, I successfully ran test.py after modifying two code places, but the results were as shown in the figure, which was only close to the results of the setup model with the training set VGG-Sound Full in the paper. I wonder if this gap is normal?

by the way，the two codes I modified are mainly: 1.line 56 in model.py, because the Tensor object does not have a T method, so I replaced it with aud.t() 2.line 110 in dataloader.py, in the call The axis doesn't match error occurred during the aid_spectrogram method, so I expanded the dimension of the spectrogram object, that is, spectrogram = np.expand_dims(spectrogram,axis=2)

Pretrained models

Hi, thanks for your great work. I met some problems with the pretrained models. I can't unzip the pretrained models you provided. I tried to download it again several times but it still doesn't work. I'm not sure if the source file is corrupted or something else. I'll be very appreciated if you can solve my problem. Thanks a lot.

图片帧位置

您论文中的意思是每个视频片段的中间帧，就是说10s的视频是第151帧，想向您确认一下理解的有没有问题

Code

Hi,

Will the code be available soon?

Random threshold

i think there is a problem with the random threshold in model it is supposed to be under arg 'i think' but it has no set value please clarify this for me

Missing data from VGG-SS

Hi, I am trying to download all videos from the test portion of VGG-SS, however many samples are missing. Do you have all of the videos? And if so, how can I get access to them. Thanks.

Spectrogram Dimension

Hi,

I am wondering how could we get 257x300 tensor for spectrograms. I only got 257x200 for 16000 sample rate and 3s audio

Some questions about loss implementation

Hi, thank you for sharing your awesome code.

I'm facing some issues while understanding your code.

Localizing-Visual-Sounds-the-Hard-Way/model.py

Lines 54 to 79 in 509acf8

 # Join them 

 A = torch.einsum('ncqa,nchw->nqa', [img, aud.unsqueeze(2).unsqueeze(3)]).unsqueeze(1) 

 A0 = torch.einsum('ncqa,ckhw->nkqa', [img, aud.T.unsqueeze(2).unsqueeze(3)]) 

 # trimap 

 Pos = self.m((A - self.epsilon)/self.tau) 

 if self.trimap: 

 Pos2 = self.m((A - self.epsilon2)/self.tau) 

 Neg = 1 - Pos2 

 else: 

 Neg = 1 - Pos 

 Pos_all = self.m((A0 - self.epsilon)/self.tau) 

 # positive 

 sim1 = (Pos * A).view(*A.shape[:2],-1).sum(-1) / (Pos.view(*Pos.shape[:2],-1).sum(-1)) 

 #negative 

 sim = ((Pos_all * A0).view(*A0.shape[:2],-1).sum(-1) / Pos_all.view(*Pos_all.shape[:2],-1).sum(-1) )* self.mask 

 sim2 = (Neg * A).view(*A.shape[:2],-1).sum(-1) / Neg.view(*Neg.shape[:2],-1).sum(-1) 

 if self.Neg: 

 logits = torch.cat((sim1,sim,sim2),1)/0.07 

 else: 

 logits = torch.cat((sim1,sim),1)/0.07 

 return A,logits,Pos,Neg

Referring to the code and your paper,
I understood that sim1 in L#69 is representing P_i, and sim2 in L#72 is implementing the left term of N_i.

It seems sim in L#71 is referring to the right term of N_i.
However, I cannot understand why Pos_all variable is a thresholded value of A0.
I thought it should be all-one matrices, according to the paper.

Could you clarify where sim belongs to in the loss objective?

One more question, please.

Regarding L#74-77, should I scale the logits with the temperature value (0.07) ?
I am little confused as scaling logits with temperature value is not directly stated in the paper.

It would be very helpful if you can release the code of your loss function.
Thank you very much.
Have a nice day!

problem in downloading the VGGSS dataset

Thanks for sharing the codes.

Have you annotated the raw videos from youtube? or you have annotated the processed videos of vggs dataset? I am asking this because the names of the videos in the vggss.json file are not identical to the .csv files of the vggs dataset. Otherwise, if you have processed the raw videos from youtube, did you select the middle frame (as said in the paper) regardless of the vggs dataset?

I am confused about how to find the middle frame. I appreciate it if you could share the code that helps to download the dataset.

The reason why using theshold (0.5) in cal_CIoU

Hi!

Why do you use the theshold (0.5) in cal_CIoU, although the training doesn't give any information about the 0.5? In other words, is it just from the hyp-param tunning, or reasoned from mathematical properties behind the contrastive loss ?

The reason what I'm asking is that recent papers Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning , A Closer Look at Weakly-Supervised Audio-Visual Source Localization use relative prediction, which always choose the 50% region as the prediction results, without any thresholdm so I just become curious :)

	# Join them
	A = torch.einsum('ncqa,nchw->nqa', [img, aud.unsqueeze(2).unsqueeze(3)]).unsqueeze(1)
	A0 = torch.einsum('ncqa,ckhw->nkqa', [img, aud.T.unsqueeze(2).unsqueeze(3)])

	# trimap
	Pos = self.m((A - self.epsilon)/self.tau)
	if self.trimap:
	Pos2 = self.m((A - self.epsilon2)/self.tau)
	Neg = 1 - Pos2
	else:
	Neg = 1 - Pos

	Pos_all = self.m((A0 - self.epsilon)/self.tau)

	# positive
	sim1 = (Pos * A).view(A.shape[:2],-1).sum(-1) / (Pos.view(Pos.shape[:2],-1).sum(-1))
	#negative
	sim = ((Pos_all * A0).view(A0.shape[:2],-1).sum(-1) / Pos_all.view(Pos_all.shape[:2],-1).sum(-1) )* self.mask
	sim2 = (Neg * A).view(A.shape[:2],-1).sum(-1) / Neg.view(Neg.shape[:2],-1).sum(-1)

	if self.Neg:
	logits = torch.cat((sim1,sim,sim2),1)/0.07
	else:
	logits = torch.cat((sim1,sim),1)/0.07

	return A,logits,Pos,Neg

hche11 / localizing-visual-sounds-the-hard-way Goto Github PK