rhgao / co-separation Goto Github PK

View Code? Open in Web Editor NEW

91.0 91.0 23.0 476 KB

Co-Separating Sounds of Visual Objects (ICCV 2019)

License: Creative Commons Attribution 4.0 International

Python 100.00%

audio-visual-learning cross-modality sound-separation

co-separation's People

Contributors

Stargazers

Watchers

co-separation's Issues

Evaluation set

To Calculate SDR, SIR and SAR, did you consider two videos only? Or consider whole test set. Because evaluateSeparation.py considers two audios only. It's not considering whole test set.

Hi Ruohan
This pre-trained model was trained for solo videos or multiple source videos (duet...)
I have tested this pretrain but the result with multi source is not the same result in paper?
Could you please share pre-trained model with multiple source.

Thanks

Originally posted by @manhnguyen1998 in #1 (comment)

About the format of .npy file of the object detector output

Hi!
I was trying to run the test.py on my own images. So I use the pre-trained detector on my images first and get the result. But I am a little bit confused that it is mentioned in readme.md that each video should have a .npy file to save the object detection result. What is the format that could be load successfully?
For example, it seems that the format of 1 object could be
[[frame_name1, ?, confidence_1, [xmin, ymin, xmax, ymax]],
[frame_name2, ?, confidence_2, [xmin, ymin, xmax, ymax]],
.....]
but how can I define the format if there are several objects?

I am a freshman so it confused me a lot. Is there anyone who can help me? Thanks a lot.

Is the validation set same as the training set?

Are the validation examples taken from the training data itself? I could not spot any differences in how you handle training vs validation data, except for the total number of examples used in each case. Because you just randomly sample a pair of videos in audioVisual_dataset.py, and the data loaders for training and validation use the same file paths, isn't it possible that some training videos are re-used for validation?

Thanks for the great work.

About instruments image dataset

“The object detector is trained on ∼30k images of 15 object categories from the Open Images dataset“
Where can I find this 'Open Images dateset' ?
Would u like to public this dataset?

About the config: --scene_path /your_root/hdf5/ADE.h5

Hi !
I want to test on the pre-trained model, but now I'm confused about the config:--scene_path /your_root/hdf5/ADE.h5, could you please tell me what the ADE.h5 is and how can I gain this file?

Thank you very much!

Audio visual speech separation

@rhgao Will it work on videos which contains multiple speakers to isolate them in multiple files equal to the number of the speakers present in the videos?

Class Label indices

Hi @rhgao ,

Sorry for bugging you with another query.

So I was trying to understand how the ground-truth labels are being assigned for the object-consistency classifier and had some queries about that. So it seems the pretrained Faster R-CNN detector predicts in the space of 16-objects, with the background as the first class (getDetectionResults.py, line 170). Now, when loading these detections in the loader, we first shift the output space of ground-truth labels by 1, such that the labels are now between [-1, ..., 14] rather than [0, ..., 15] (audioVisual_dataset.py, line 144). However, when constructing the label of the additional image (background), we assign it to be (self.opt.number_of_classes - 1), i.e. the last class, which is index = 15 - 1 = 14 (audioVisual_dataset.py, line 162). Should this not conflict with the 15th object class?
P.S.: I see that the opt.number_of_classes is increased by 1 for additional image case in the train.py file (line 253) but this is after the loaders have been defined (line 219).

Is this a typo or am I missing something? Would really appreciate your inputs on this.

Confusion about Audioset-Instrument data

Dear author,
I have a question about Audioset-instrument data. In "Co-Separating Sounds of Visual Objects", you mention there are 113,756/456 for unbalanced/balanced instrument clips, respectively. But in "Learning to Separate Object Sounds by
Watching Unlabeled Video", there are 104k+2.9k/1k for unbalanced/balanced respectively. I also tried download the balanced audioset-instrument, and I get 938 videos totally. Why they are not the same? Is it possible to provide the video name list?

Select object with higher confidence score

After reading the test.py file, I realized that if in a frame there are 2 objects corresponding to 2 instruments, the model will accept the object with higher confidence score as input. I wonder if this affects the loss function. I also read the train.py file, but I don't see the code where you choose the object with a higher confidence score. So can you tell me where it is or is it really working as I think or am I misunderstanding.

Thank you so much !!!

Variable number of objects

Hi,
First of all, great project!
I would like to ask about your dataset. it seems to me that each sample returned by the AudioVisualMUSICDataset's getitem method can be with variable number of objects(the "visuals" variable here:

co-separation/data/audioVisual_dataset.py

Line 189 in bd4f4fd

data['visuals'] = visuals

).
A question that arises from this is how the dataloader will batch these samples together.
I can see in the attached link that you address this problem by padding with -1 the lables(here:

co-separation/data/audioVisual_dataset.py

Line 183 in bd4f4fd

 labels = np.vstack(objects_labels) #labels for each object, -1 denotes padded object 

) but I did'nt figure it out where exactly you did this padding. Moreover, what about the variable objects problem mentioned above?

Thank you in advance

Evaluate on multi-source setting

The SDR in multi-source setting is calculated in two-solo-mixed data or random-mixed-data(could contain solo or duet)?
Did the training on multi-source setting know the sound source number of data? (like pretrain on solo, then train on solo+duet)

Object Class Label Mismatch

Hi @rhgao ,

Really appreciate your effort in sharing the code! But it seems there might be an issue.

In particular, the pre-trained object detector is trained on these classes: Banjo, Cello, Drum, Guitar, Harp, Harmonica, Oboe, Piano, Saxophone, Trombone, Trumpet, Violin, Flute, Accordion, and Horn
while the classes of the MUSIC dataset (solo) are: accordion, acoustic guitar, cello, clarinet, erhu, flute, saxophone, trumpet, tuba, violin and xylophone.

How are we to deal with classes that appear in MUSIC but not in the pre-trained detector, such as erhu, xylophone, etc.?

Any help would be much appreciated.

Permission to access the pre-trained Object detector

Hi Ruohan,

Nice work again! Could you please give us permission to access the pre-trained object detector? Thanks!

Evaluation Script to Calculate SDR, SIR and SAR

test.py basically separates two audios from the videos. Can you please share the script to calculate SDR, SIR and SAR on test set?

How to pre-process the music dataset?

Hi,
Thank you very much for sharing the source code.
Follow the code and the paper, I know that each video from the dataset should be process into 10s clips, and be detected by the faster-rcnn model to form a .npy file storing the results of 10 frames in the 10s clip.
But I am still confused about how to process the audio and how to number the frames ( 1.png-10.png for each video or number all the images from all the clips, maybe 1.png-500.png-1000.png).
About the audio, is it downsampled to 11025 when clipping into 10s clip?
It would be better that there is a script for process dataset.

rhgao / co-separation Goto Github PK

co-separation's People

Contributors

Stargazers

Watchers

Forkers

co-separation's Issues

Recommend Projects

Recommend Topics

Recommend Org