yapengtian / ave-eccv18 Goto Github PK

View Code? Open in Web Editor NEW

168.0 168.0 31.0 18.61 MB

Audio-Visual Event Localization in Unconstrained Videos, ECCV 2018

Home Page: https://sites.google.com/view/audiovisualresearch

Python 100.00%

audio-visual-events audio-visual-learning ave-dataset eccv-2018

ave-eccv18's People

Stargazers

Watchers

ave-eccv18's Issues

ValueError: could not broadcast input array from shape (10,6,4,512) into shape (10,128)

Hi, thanks for your great work.
While generating audio embedding from the code audio feature size is -- np.zeros([len_data, 10, 128])) but the result from vggish network (ie . shape of embedding_tensor) is (10,6,4,512)

for input audio i converted mp4 file into .wav and input_batch shape is (10, 96, 64)

Could you help me to run the script correctly for generating results for own video?

Questions about the ave dataset

Hi @YapengTian ,
Thanks for your great work first!

I wonder whether the number in file train_order.h5 is the row index of the file Annotations.txt, which can be used to get the corresponding training data and labels.
Thanks for your reply~

The AVEDataset only contain 4097 videos

Thank you for sharing your perfect work.
I have downloaded the AVEDataset by the provided link. After unzip, it only contains 4097 videos in the AVE folder.
It will be really appreciated if you can check this again.

list index out of range error in feature_extractor

Use imageio to extract video in feature_extractor. In extract_frame.append(imgs[n]), a list index out of range error will appear. It stands to reason that 160 frames should be extracted from vid (16 frames per second, total 10 seconds), but The fact is that some videos extract 301 frames and some videos extract 251 frames, resulting in extract_frame.shape being (1, 224, 224, 3) and imgs.shape (301, 224, 224, 3), how can I solve this?

request for a extra noisy visual feature package

I have downloaded the package of audio and visual feature,but I can't find the visual_feature_noisy.h5,which is applied in your script for weakly_supervised training.So how can I produce it by myself,please.Thanks.

About the number of video files

Hi, when I download the dataset from your shared link, I found there are only 4097 video files. Could I know whether it is correct? Because the paper mentioned the dataset has over 4143 files. Thank you very much.

How to detach the audio file from the original mp4?

I have downloaded the feature_extractor package,but in the audio extractor.py,there is a argument of "path of wav file".I don't know how to detach the audio channel for the whole video,please give me some instructions,thanks.

problem for generating audio-guided visual attention maps

Hello, thank you for uploading the source code here. I tried to generate audio-guided visual attention maps for running attention_visualization.py. I used the provided dataset and feature, but there comes up a error on the line 116 "extract_frames[cc, :, :, :] = imgs[n] IndexError: list index out of range". I am solving this problem if you have any ideas.

Question about dataloader and visual_feature_extractor

Hi,
Recently I'm trying to use the file visual_feature_extractor.py to extract AVE dataset features using ResNet-152, but the visual features I got has a dim of [4097, 10, 7, 7, 512], which is different to the dim of visual feature you provided([4143, 10, 7, 7, 512]) and thus caused an IndexError: index 4118 is out of bounds for axis 0 with size 4097.
As the video number of AVE dataset is 4097 and one video might have multiple labels according to your reply in other issues, while in visual_feature_extractor.py, dim 0 is directly defined by the length of video dataset(line 31-35, line 72), I wonder if there remains other operations on video features extracted from visual_feature_extractor.py that hasn't been listed in current codes so as to change the shape of video features from [4097, 10, 7, 7, 512] to [4143, 10, 7, 7, 512].

Only 4097 videos in Dataset?

Hi,

Thanks for the dataset and the code. The drive link you have shared for the AVE Dataset only contains 4097 mp4 files whereas the paper mentions that 4143 videos are present. Can you please help me out regarding finding the other 46 videos?

Thanks

File visual_feature_vec.h5

Hello, thanks for your great work. I met the issue here:

with h5py.File('data/visual_feature_vec.h5', 'r') as hf:
video_features = hf['avadataset'][:]

Is this visual_feature_vec.h5 just the same as visual_feature.h5?

Thanks.

humbly request, when I'm trying to run attention_visualization.py

Thank you did fantastic research.
I used pytorch 1.7.1 Since I have to follow the version of cuda. I got this error, could you give some tips about which code's part I should modify.

The dataset contains 4143 samples
402
Traceback (most recent call last):
File "attention_visualization.py", line 90, in
h_x = att_model(audio_inputs, video_inputs)
File "/home/jiang/anaconda3/envs/hw/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/jiang/桌面/AVE-ECCV18-master/models.py", line 68, in forward
self.lstm_video.flatten_parameters()
File "/home/jiang/anaconda3/envs/hw/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 117, in flatten_parameters
if len(self._flat_weights) != len(self._flat_weights_names):
File "/home/jiang/anaconda3/envs/hw/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in getattr
type(self).name, name))
torch.nn.modules.module.ModuleAttributeError: 'LSTM' object has no attribute '_flat_weights'

Request to Provide VideoIDs for Noisy Dataset

Hi,

Can you please provide the YouTube VideoIDs corresponding to the visual_feature_noisy.h5 ?

Thanks!

Why attention visualization maps are not coming

Failed to download audio_feature.h5

Does anyone have the link of Chinese Baidu disk or thunder of audio_feature.h5? I can only download it with Google browser. Because it's too big, I fail every time.

RuntimeError: cudnn RNN backward can only be called in training mode

Hi,

Thanks for your great work.
However, I met some problem when running this code.
I followed the instructions and put the required files in /data folder and run training command.

My environments: Pytorch 0.3.1, Cuda 9.0, Cudnn 7.1.2.
Could you help me to run the script correctly?

➜  ave_code git:(master) ✗ source activate pytorch0.3.1
(pytorch0.3.1) ➜  ave_code git:(master) ✗ python weak_supervised_main.py --train
/home/wuyu07/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
3517
=== Epoch {0}   Loss: {0.7096}  Running time: {2.684252}
0.06890547263681591
Traceback (most recent call last):
  File "weak_supervised_main.py", line 171, in <module>
    train(args)
  File "weak_supervised_main.py", line 93, in train
    loss.backward()
  File "/home/wuyu07/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/wuyu07/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cudnn RNN backward can only be called in training mode

I can't achieve the result in your paper. how to achieve the results in your paper, the specific configuration is as follows

I used the visual_feature.h5 and audio_feature.h5 that you provided. The test result under AV_att is 61.5, and 72.7 in your paper

I use nb_epoch = 500
pytorch version is 1.0.1
The operating system version is Ubuntu 18.04.4 LTS (GNU/Linux 5.3.0-61-generic x86_64)
Tensorflow version is 1.15.1
cuda version is 10.0

Because the pytorch 0.3.0 you provided is too old, my computer does not support it.
I hope you can help me solve the problem that the A+V-att 72.7 provided in your paper cannot be reproduced
Thank you

How do I generate my own labels.h5 files?

Hi @YapengTian ,
Thanks for your great work first!

I want to make my own data set. How do I generate my own labels.h5 files?

Thanks for your reply~

请问您是怎么提取的visual_feature_noisy.h5的？

我在网络中换了自己的特征，在Weakly-Supervised Event Localization时，不太清楚visual_feature_noisy.h5是怎么提取的。希望您回复，谢谢！

How to understand heatmap?

Hello, thank you for the paper you recommended. But I still have doubts about your project. I still want to raise it with you after thinking about it. I use your model and code. The results seem to show that the detection effect has nothing to do with sound, because I can detect faces without speaking, even with my back to the camera without speaking. Heatmap can always detect faces. I don't know if you have this kind of situation? What's the difference between the detected heatmap of talking and not speaking? How to understand it?

file labels_closs.h5

Regarding the file labels_closs.h5 that is required in the Cross-modality localization experiments:

with h5py.File('data/labels_closs.h5', 'r') as hf:
closs_labels = hf['avadataset'][:]

Can we download it?

How to extract feature of audio via vggish and then what we can do through the vggish

Hello authors, I appreciate your wonderful contribution. But I have a few questions about how to extract the feature of audio.
You said you get audio's feature via Vggish, could you explain the processing.
After we get the feature, How to do localization through features?

Loss Calculation: MultilabelSoftMarginLoss on top of Softmax Output?

Hi,

Thanks for your work. I had an enquiry as to how you are calculating the loss during training. The nn.MultilabelSoftMarginLoss operates on your softmax output. The nn.MultilabelSoftMarginLoss according to PyTorch source code performs BCELoss over the 2nd dimension (https://github.com/pytorch/pytorch/blob/f2af07d7f6ca775baf3abfb058cc5bb4aac7cc3a/torch/nn/functional.py#L2733) . In this, case, if your target is of shape (N, 10, 29), you will be applying a sigmoid operation (from the BCE) on your original softmax output (https://github.com/YapengTian/AVE-ECCV18/blob/master/models.py#L79), and summing the loss across the sequence length (10) instead of on the batch dimension.

My questions are:

Does applying a sigmoid (implicitly through the nn.MultilabelSoftMarginLoss) on a softmax output have some theoretical foundation? The softmax output would never accumulate a negative value and hence, your sigmoid output can never go below 0.5.
The use of nn.MultilabelSoftMarginLoss, on (N, 10, 29) output will average the loss over the sequence (10) dimension and not the batch dimension (N).

I am not sure if I am missing something obvious which may involve a deeper understanding of the underlying mechanics of PyTorch. Request you to please explain on the above 2 points.

Thanks a lot for your help!

如何获得4143组数据特征而不是4097组

我想换个网络来提取视频中的视觉与音频特征，但在处理视频是发现只有4097个，不太清楚visual_feature.h5中怎么有4143个视觉数据（数据集备注信息中提到有些视频包含不同的音视频事件所以不是4143个视频），您能告诉我是怎么处理的吗？希望您回复，谢谢！

How to get the reported results?

Hi,
Thanks for your great work and happy new year!
While I am running your code, I can get the reported results using the parameters you uploaded. But when I run your training code to get the new parameters and use them to get the results, I cannot get the reported results. So could you please tell me how to train your network?
Best,
Amose.

How to use it for real time when it takes 10s non-overlapping sequences?

My understanding is input for this model is 10s video &audio sequence so how do i modify this for real-time?

Request for information on mapping h5 content to video ids

Hi,

The labels in the h5 file are numerical while the Annotations.txt contain text only. Can you please let me know the mapping scheme? i.e. is Churchbell=label0 and Mandolin=label28 or is it in some other way?

Thanks!

yapengtian / ave-eccv18 Goto Github PK

ave-eccv18's People

Stargazers

Watchers

Forkers

ave-eccv18's Issues

Recommend Projects

Recommend Topics

Recommend Org