Git Product home page Git Product logo

ave-eccv18's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

ave-eccv18's Issues

ValueError: could not broadcast input array from shape (10,6,4,512) into shape (10,128)

Hi, thanks for your great work.
While generating audio embedding from the code audio feature size is -- np.zeros([len_data, 10, 128])) but the result from vggish network (ie . shape of embedding_tensor) is (10,6,4,512)

for input audio i converted mp4 file into .wav and input_batch shape is (10, 96, 64)

Could you help me to run the script correctly for generating results for own video?

Questions about the ave dataset

Hi @YapengTian ,
Thanks for your great work first!

I wonder whether the number in file train_order.h5 is the row index of the file Annotations.txt, which can be used to get the corresponding training data and labels.
Thanks for your reply~

The AVEDataset only contain 4097 videos

Thank you for sharing your perfect work.
I have downloaded the AVEDataset by the provided link. After unzip, it only contains 4097 videos in the AVE folder.
It will be really appreciated if you can check this again.

list index out of range error in feature_extractor

Use imageio to extract video in feature_extractor. In extract_frame.append(imgs[n]), a list index out of range error will appear. It stands to reason that 160 frames should be extracted from vid (16 frames per second, total 10 seconds), but The fact is that some videos extract 301 frames and some videos extract 251 frames, resulting in extract_frame.shape being (1, 224, 224, 3) and imgs.shape (301, 224, 224, 3), how can I solve this?

request for a extra noisy visual feature package

I have downloaded the package of audio and visual feature,but I can't find the visual_feature_noisy.h5,which is applied in your script for weakly_supervised training.So how can I produce it by myself,please.Thanks.

About the number of video files

Hi, when I download the dataset from your shared link, I found there are only 4097 video files. Could I know whether it is correct? Because the paper mentioned the dataset has over 4143 files. Thank you very much.

How to detach the audio file from the original mp4?

I have downloaded the feature_extractor package,but in the audio extractor.py,there is a argument of "path of wav file".I don't know how to detach the audio channel for the whole video,please give me some instructions,thanks.

problem for generating audio-guided visual attention maps

Hello, thank you for uploading the source code here. I tried to generate audio-guided visual attention maps for running attention_visualization.py. I used the provided dataset and feature, but there comes up a error on the line 116 "extract_frames[cc, :, :, :] = imgs[n] IndexError: list index out of range". I am solving this problem if you have any ideas.

Question about dataloader and visual_feature_extractor

Hi,
Recently I'm trying to use the file visual_feature_extractor.py to extract AVE dataset features using ResNet-152, but the visual features I got has a dim of [4097, 10, 7, 7, 512], which is different to the dim of visual feature you provided([4143, 10, 7, 7, 512]) and thus caused an IndexError: index 4118 is out of bounds for axis 0 with size 4097.
As the video number of AVE dataset is 4097 and one video might have multiple labels according to your reply in other issues, while in visual_feature_extractor.py, dim 0 is directly defined by the length of video dataset(line 31-35, line 72), I wonder if there remains other operations on video features extracted from visual_feature_extractor.py that hasn't been listed in current codes so as to change the shape of video features from [4097, 10, 7, 7, 512] to [4143, 10, 7, 7, 512].

Only 4097 videos in Dataset?

Hi,

Thanks for the dataset and the code. The drive link you have shared for the AVE Dataset only contains 4097 mp4 files whereas the paper mentions that 4143 videos are present. Can you please help me out regarding finding the other 46 videos?

Thanks

File visual_feature_vec.h5

Hello, thanks for your great work. I met the issue here:

with h5py.File('data/visual_feature_vec.h5', 'r') as hf:
video_features = hf['avadataset'][:]

Is this visual_feature_vec.h5 just the same as visual_feature.h5?

Thanks.

humbly request, when I'm trying to run attention_visualization.py

Thank you did fantastic research.
I used pytorch 1.7.1 Since I have to follow the version of cuda. I got this error, could you give some tips about which code's part I should modify.

The dataset contains 4143 samples
402
Traceback (most recent call last):
File "attention_visualization.py", line 90, in
h_x = att_model(audio_inputs, video_inputs)
File "/home/jiang/anaconda3/envs/hw/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/jiang/桌面/AVE-ECCV18-master/models.py", line 68, in forward
self.lstm_video.flatten_parameters()
File "/home/jiang/anaconda3/envs/hw/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 117, in flatten_parameters
if len(self._flat_weights) != len(self._flat_weights_names):
File "/home/jiang/anaconda3/envs/hw/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in getattr
type(self).name, name))
torch.nn.modules.module.ModuleAttributeError: 'LSTM' object has no attribute '_flat_weights'

Failed to download audio_feature.h5

Does anyone have the link of Chinese Baidu disk or thunder of audio_feature.h5? I can only download it with Google browser. Because it's too big, I fail every time.

RuntimeError: cudnn RNN backward can only be called in training mode

Hi,

Thanks for your great work.
However, I met some problem when running this code.
I followed the instructions and put the required files in /data folder and run training command.

My environments: Pytorch 0.3.1, Cuda 9.0, Cudnn 7.1.2.
Could you help me to run the script correctly?

➜  ave_code git:(master) ✗ source activate pytorch0.3.1
(pytorch0.3.1) ➜  ave_code git:(master) ✗ python weak_supervised_main.py --train
/home/wuyu07/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
3517
=== Epoch {0}   Loss: {0.7096}  Running time: {2.684252}
0.06890547263681591
Traceback (most recent call last):
  File "weak_supervised_main.py", line 171, in <module>
    train(args)
  File "weak_supervised_main.py", line 93, in train
    loss.backward()
  File "/home/wuyu07/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/wuyu07/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cudnn RNN backward can only be called in training mode

I can't achieve the result in your paper. how to achieve the results in your paper, the specific configuration is as follows

I used the visual_feature.h5 and audio_feature.h5 that you provided. The test result under AV_att is 61.5, and 72.7 in your paper

I use nb_epoch = 500
pytorch version is 1.0.1
The operating system version is Ubuntu 18.04.4 LTS (GNU/Linux 5.3.0-61-generic x86_64)
Tensorflow version is 1.15.1
cuda version is 10.0

Because the pytorch 0.3.0 you provided is too old, my computer does not support it.
I hope you can help me solve the problem that the A+V-att 72.7 provided in your paper cannot be reproduced
Thank you

How to understand heatmap?

Hello, thank you for the paper you recommended. But I still have doubts about your project. I still want to raise it with you after thinking about it. I use your model and code. The results seem to show that the detection effect has nothing to do with sound, because I can detect faces without speaking, even with my back to the camera without speaking. Heatmap can always detect faces. I don't know if you have this kind of situation? What's the difference between the detected heatmap of talking and not speaking? How to understand it?

file labels_closs.h5

Regarding the file labels_closs.h5 that is required in the Cross-modality localization experiments:

with h5py.File('data/labels_closs.h5', 'r') as hf:
closs_labels = hf['avadataset'][:]

Can we download it?

Loss Calculation: MultilabelSoftMarginLoss on top of Softmax Output?

Hi,

Thanks for your work. I had an enquiry as to how you are calculating the loss during training. The nn.MultilabelSoftMarginLoss operates on your softmax output. The nn.MultilabelSoftMarginLoss according to PyTorch source code performs BCELoss over the 2nd dimension (https://github.com/pytorch/pytorch/blob/f2af07d7f6ca775baf3abfb058cc5bb4aac7cc3a/torch/nn/functional.py#L2733) . In this, case, if your target is of shape (N, 10, 29), you will be applying a sigmoid operation (from the BCE) on your original softmax output (https://github.com/YapengTian/AVE-ECCV18/blob/master/models.py#L79), and summing the loss across the sequence length (10) instead of on the batch dimension.

My questions are:

  1. Does applying a sigmoid (implicitly through the nn.MultilabelSoftMarginLoss) on a softmax output have some theoretical foundation? The softmax output would never accumulate a negative value and hence, your sigmoid output can never go below 0.5.

  2. The use of nn.MultilabelSoftMarginLoss, on (N, 10, 29) output will average the loss over the sequence (10) dimension and not the batch dimension (N).

I am not sure if I am missing something obvious which may involve a deeper understanding of the underlying mechanics of PyTorch. Request you to please explain on the above 2 points.

Thanks a lot for your help!

如何获得4143组数据特征而不是4097组

我想换个网络来提取视频中的视觉与音频特征,但在处理视频是发现只有4097个,不太清楚visual_feature.h5中怎么有4143个视觉数据(数据集备注信息中提到有些视频包含不同的音视频事件所以不是4143个视频),您能告诉我是怎么处理的吗?希望您回复,谢谢!

How to get the reported results?

Hi,
Thanks for your great work and happy new year!
While I am running your code, I can get the reported results using the parameters you uploaded. But when I run your training code to get the new parameters and use them to get the results, I cannot get the reported results. So could you please tell me how to train your network?
Best,
Amose.

2 instruments in the video with 1 label

HI, I find that in the "-PE8geTWt-g.mp4" file, there are 2 instruments (guitar and mandolin) in the video, but I only get the "mandolin" label in the annotation file.

Mail box request check

Hi,I have sent an email to your mailbox at the University of Rochester,Please check it.Thanks a lot.

弱监督上的标注文件

我想在弱监督的AVE任务上替换自己的数据集,请问您可以提供弱监督上的标注文件吗 ?谢谢您!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.