Git Product home page Git Product logo

pytorch-mfnet's Introduction

Multi-Fiber Networks for Video Recognition

This repository contains the code and trained models of:

Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, Jiashi Feng. "Multi-Fiber Networks for Video Recognition" (PDF).

Implementation

We use MXNet @92053bd for image classification and PyTorch 0.4.0a0@a83c240 for video classification.

Normalization

The inputs are substrated by mean RGB = [ 124, 117, 104 ], and then multiplied by 0.0167.

Usage

Train motion from scratch:

python train_kinetics.py

Fine-tune with pre-trained model:

python train_ucf101.py

or

python train_hmdb51.py

Evaluate the trained model:

cd test
# the default setting is to test trained model on ucf-101 (split1)
python evaluate_video.py

Results

Image Recognition (ImageNet-1k)

Single Model, Single Crop Validation Accuracy:

Model Params FLOPs Top-1 Top-5 MXNet Model
ResNet-18 (reproduced) 11.7 M 1.8 G 71.4 % 90.2 % GoogleDrive
ResNet-18 (MF embedded) 9.6 M 1.6 G 74.3 % 92.1 % GoogleDrive
MF-Net (N=16) 5.8 M 861 M 74.6 % 92.0 % GoogleDrive

Video Recognition (UCF-101, HMDB51, Kinetics)

Model Params Target Dataset Top-1
MF-Net (3D) 8.0 M Kinetics 72.8 %
MF-Net (3D) 8.0 M UCF-101 96.0 %*
MF-Net (3D) 8.0 M HMDB51 74.6 %*

* accuracy averaged on slip1, slip2, and slip3.

Trained Models

Model Target Dataset PyTorch Model
MF-Net (2D) ImageNet-1k GoogleDrive
MF-Net (3D) Kinetics GoogleDrive
MF-Net (3D) UCF-101 (split1) GoogleDrive
MF-Net (3D) HMDB51 (split1) GoogleDrive

Other Resources

ImageNet-1k Trainig/Validation List:

ImageNet-1k category name mapping table:

Kinetics Dataset:

UCF-101 Dataset:

HMDB51 Dataset:

FAQ

Do I need to convert the raw videos to specific format?

  • Our `dataiter' supports reading from raw videos and can tolerate corrupted videos.

How can I make the training faster?

  • Decoding frames from compressed videos consumes quite a lot CPU resources which is the bottleneck for the speed. You can try to convert the downloaded videos to other format or reduce the quality of the video. For example:
# convet to sort_edge_length = 360
ffmpeg -y -i ${SRC_VID} -c:v mpeg4 -filter:v "scale=min(iw\,(360*iw)/min(iw\,ih)):-1" -b:v 640k -an ${DST_VID}
# or, convet to sort_edge_length = 256
ffmpeg -y -i ${SRC_VID} -c:v mpeg4 -filter:v "scale=min(iw\,(256*iw)/min(iw\,ih)):-1" -b:v 512k -an ${DST_VID}
# or, convet to sort_edge_length = 160
ffmpeg -y -i ${SRC_VID} -c:v mpeg4 -filter:v "scale=min(iw\,(160*iw)/min(iw\,ih)):-1" -b:v 240k -an ${DST_VID}
  • Find another computer with better CPU.
  • The group convolution may not be well optimized.

Citation

If you use our code/model in your work or find it is helpful, please cite the paper:

@inproceedings{chen2018multifiber,
  title={Multi-Fiber networks for Video Recognition},
  author={Chen, Yunpeng and Kalantidis, Yannis and Li, Jianshu and Yan, Shuicheng and Feng, Jiashi},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2018}
}

pytorch-mfnet's People

Contributors

cypw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytorch-mfnet's Issues

When using the train_ucf101.py file for model training, the model evaluation part cannot enter the loop.

Traceback (most recent call last):
File "train_ucf101.py", line 146, in
train_model(sym_net=net, **kwargs)
File "/PyTorch-MFNet/train_model.py", line 117, in train_model
epoch_end=end_epoch,)
File "/PyTorch-MFNet/train/model.py", line 328, in fit
self.callback_kwargs['sample_elapse'] = sum_sample_elapse / sum_sample_inst
ZeroDivisionError: float division by zero

The error content is as above. When I use the train_ucf101.py file for model training, the loop of the model evaluation part in the /train/model.py file cannot enter, resulting in the sum_sample_inst parameter being zero, and subsequent division operations report an error. I checked the loop body content and The data set reads part of the code and found that all variables have corresponding values. How can I modify the code to make the error disappear?

2D Model

Hi, where can I find the 2D model? I don't mean the weights, I mean the code.
I see only mfnet3d.py file...

Thanks.

Why `softmax` was not implemented in the last layer to classify?

I didn't find that you use 'softmax' after the last layer(after 'classifier' in 'mfnet_3d.py') or in 'model.fit' (in 'model.py'), but you still use 'CrossEntropyLoss'. However, in 'evaluate_video_ucf101_split1.py' I found 'softmax' was used. Did I overlook anything? If you really did not use 'softmax' in training process, why did you do so? Thanks!

higher accuracy

what a outstanding job. i think you can better higher accuracy if you use (4*corner crop + 1 center crop)*2 flip transforming operations in test stage which used in TSN.

Training tricks

Do you have any training tricks on ucf101,when I train MFNET on ucf101,it turns out to overfitting easily. Is a pretraining on large dataset like kinetics necessary and what's your key tricks to improve accuracy?

VideoIter:: >> frame [] is error & backup is inavailable

VideoIter:: >> frame [] is error & backup is inavailable. [./dataset/UCF101/raw/data/CricketBowling/v_CricketBowling_g22_c07.avi]'
2019-03-09 22:30:13: >> I/O error(None): None
2019-03-09 22:30:13: VideoIter:: ERROR!! (Force using another index:
3279)
VideoIter:: >> frame [] is error & backup is inavailable. [./dataset/UCF101/raw/data/TableTennisShot/v_TableTennisShot_g11_c05.avi]'
2019-03-09 22:30:13: VideoIter:: ERROR!! (Force using another index:
2801)
VideoIter:: >> frame [] is error & backup is inavailable. [./dataset/UCF101/raw/data/ApplyEyeMakeup/v_ApplyEyeMakeup_g08_c05.avi]'

why I have this question?

testing on kinetics takes long time?

This is some main configs:

"batch_size": 8,
"clip_length": 16,
"dataset": "Kinetics",
"debug_mode": true,
"frame_interval": 2,
"model_prefix": "././../exps/models/MFNet3D_Kinetics-400_72.8.pth",
"network": "mfnet_3d",
"task_name": "./../exps/models/MFNet3D_Kinetics-400_72.8.pth"

I use 4 Telsa-P40 and 48 cpu cores to test the model on Kinetics val set (19761 videos) with 10 round.
It takes about 27hours (10 round)
Is this a normal speed?
Is there some trick to speed up the dataloader?

Where to start from a Beginner's point of view ?

Hi
I am a beginner in the ML field. So far I have just created my own models in Jupyter notebooks and ran them. But using a baseline model is kind of new for me. I don't know where to start from.
I understand I need to compile the models first. But can someone lay down some beginner steps from where to start with using this code.

Video level accuracy - small question

Hi, I'm trying to understand your video level accuracy calculations.
Generally, and please correct me if I'm wrong, you take a video and sample N clips, accumulate the outputs, and then take the top1, right? let's say we have 2 classes and the results are:
0.3 , 0.7
1 , 0
0.4, 0.6
The predicted label will be 1?
And now I'm trying to understand your calculations, I couldn't understand how come your multiplying part with 0.92 and a part with 0.08 on the test file.
I'll be happy for a short explanation.

Thanks!

getting lower top1 accuary on ucf101

Model: the MF-Net (3D) model(split1) provided on the google drive
Data: UCF101 split1 testlist01.txt
Code: python evaluate_video_ucf101_split1.py --task-name ./../exps/models/MFNet3D_UCF-101_Split-1_96.3.pth
i test the model on that data, but get lower top1-accuary:
[
[
"top1",
0.9333862014274386
]
],
[
[
"top5",
0.9949775310600053
]

the top1 acc is 0.933, much lower than the average acc 0.96 reported in paper.
what is the crop method used in the paper?
is there something i missed that cause the lower acc?

Initial weights

Hi.

I see that the code requires initialisation weights from kinetics (vY5_866M_Kinetics_v50_fm16-it123_ep-0019.pth).

Where can we find it?

Thanks

Fine tune Kinetics on HMDB51 split2 - Can't achieve same results

Hi, as mentioned in the paper, and in most of the papers in the action recognition field, the result is averaged on the 3 splits. because only split1 finetuning shared, I'm trying to fine tune my self, but I can't get the same results, the training is very "noisy", I'm using the pretrained model on kinetics that was uploaded, is there anything else I need to keep my attention on?
I'm using your environment of course. (btw in the annotation files, others, I believe means validation set?)
Thanks!

Why do you need loops in testing?

Hi ! my friend
Your code is well written.I'm a novice in behavioral recognition.
I want to ask why 'for i_round in range(total_round):' is set in the test.py file.
Isn't it enough to test once?
Although I don't understand why, I tested it with this loop.
I found that the final accuracy was affected by the number of cycles.
This makes me even more confused, for a fixed weight, how can the final accuracy be different?
Is it influenced by 'duplication = 0.92 * duplication + 0.08 * avg_score[video_subpath_i][3]'?
I'm ashamed to say that I didn't understand this line of code.

Question about counting the parameter number and FLOPs

I counted the parameter number and FLOPs of MFNet, the
parameter number is computed by the code

model = MFNET_3D(num_classes=101)
params = sum(p.numel() for p in model.parameters())

outputs a same result 7996368 as shown in the paper: 8.0 M.

But the FLOPs I got is different from the result in the paper, so whether you can show me the code you compute the FLOPS, especially for the nn.Conv3d layer, I think I made some mistake of computing the FLOPS of nn.Conv3d layer.

Augmentations on GPU

Hi, great code !
I have been noticing GPU usage is a bit low (around 40%), and trying to optimize.
I've been noticing that HLSTransform is very CPU intensive.
Are you aware of any way to have it executed on GPU instead of CPU ?
Do you think it could help ?
Thanks

Results on the Kinetics dataset

Hi.

I see in the paper that the accuracy on kinetics dataset is 72.8%.
As seen in this table
result

But on the graph below, it seems that the results are presented on the training set and not validation set.

graph

So I wanted to know if I misunderstood something, and if the aforementioned result is on the training set or the validation set. And if you presented the accuracy on the training set, what accuracy on validation set did you get?

Thanks in advance.

Some questions and a suggestion

Hi Yunpeng, just read your paper and have a couple of quick questions:

  • Why do you implement batchnorm + relu before the convolution in the BN_AC_CONV3D class here?

  • In your paper you say: we set the number of the first-layer output channels to be 4 times smaller than its input channels, ...
    Hence, shouldn't this line be written as shown below?

# current version: num_ix = int(num_mid/4)
num_ix = int(num_in/4)

I believe self.conv_i1 and self.conv_i2 are the layers for the multiplexer. Or am I getting it wrong?

  • This is a theoretical question: is the role of multiplexer to rearrange information across channels so as to minimize information loss for each fiber?

Lastly, a small suggestion: since you are using PyTorch v0.4, you need not use Variable anymore. Hence you can write this line as:

data = torch.randn(1,3,16,224,224)

Thank you.

net.load_checkpoint(epoch=args.load_epoch) fails?

while excecuting python evaluate_video_ucf101_split1.py

File "evaluate_video_ucf101_split1.py", line 107, in
net.load_checkpoint(epoch=args.load_epoch)
File "../train/model.py", line 62, in load_checkpoint
assert os.path.exists(load_path), "Failed to load: {} (file not exist)".format(load_path)
AssertionError: Failed to load: ./../exps/<your_tesk_name>_ep-0000.pth (file not exist)

Kindly help

question

How to solve this problem:
File "MFNet/train/metric.py", line122, in update
self.num_inst += loss.shape[0]
IndexError: tuple index out of range

High gpu memory usage

I've noticed the higher memory usage of MFNet compared to that of ResNet in image processing.

Settings:
Framework: pytorch
resnet model: resnet18 from torchvision
mfnet model: modification of /network/mfnet_3d.py for 2d processing
input size: 128x3x224x224
with enabled gradients computation.

Results:
GPU memory consumption observed: 8GB for mfnet vs 3.4GB for resnet

The number of params matches to the paper for each model, but the actual memory consumption of mfnet doesn't reflect the reduced FLOPS. Have you observed the same behaviour, or could this be caused by the 2d conversion of the model? I'm confident that the modifications I made follows the description of 2D architecture on Table2 and it shouldn't be tricky. Any idea?
(By the way I still appreciate if you could release an official 2D version of MFNet, even though that's not the main point of your work.)

Out of memory

@cypw
Hello!
Out of memory will occur during the training and testing of ucf-101 data. I only have a 32G CPU. Is this a normal phenomenon? How much memory is needed to train the network properly?
Thank you very much!

Where does MFNet output the features of input videos?

Executed your pre-trained model and tested it on some videos of HMDB51. I want to get the output features of the videos I am inserting to it. Where can I get it ( resulting feature vector or output or anything like that) ?

2019-03-15 15:54:10 INFO: VideoIter:: found 32 videos in `../dataset/HMDB51/raw/list_cvt/testlist01.txt'
2019-03-15 15:54:10 INFO: VideoIter:: iterator initialized (phase: 'test', num: 32)
2019-03-15 15:54:10 INFO: round #0/5
2019-03-15 15:54:24 INFO: 0.0%, 1.0 	| Batch [0,0]    	Avg: loss-ce = 4.13500, top1 = 0.00000, top5 = 0.12500
2019-03-15 15:54:25 INFO: round #1/5
2019-03-15 15:54:35 INFO: 0.0%, 1.5 	| Batch [0,0]    	Avg: loss-ce = 4.67265, top1 = 0.00000, top5 = 0.06250
2019-03-15 15:54:36 INFO: round #2/5
2019-03-15 15:54:46 INFO: 0.0%, 2.5 	| Batch [0,0]    	Avg: loss-ce = 4.67265, top1 = 0.00000, top5 = 0.03125
2019-03-15 15:54:47 INFO: round #3/5
2019-03-15 15:54:57 INFO: 0.0%, 3.4 	| Batch [0,0]    	Avg: loss-ce = 4.67265, top1 = 0.00000, top5 = 0.09375
2019-03-15 15:54:59 INFO: round #4/5
2019-03-15 15:55:07 INFO: 0.0%, 4.4 	| Batch [0,0]    	Avg: loss-ce = 4.67265, top1 = 0.00000, top5 = 0.06250
2019-03-15 15:55:09 INFO: Evaluation Finished!
2019-03-15 15:55:09 INFO: Total time cost: 7.2 sec
2019-03-15 15:55:09 INFO: Speed: 22.3596 samples/sec
2019-03-15 15:55:09 INFO: Accuracy:
2019-03-15 15:55:09 INFO: [
    [
        [
            "loss-ce",
            4.672654986381531
        ]
    ],
    [
        [
            "top1",
            0.0
        ]
    ],
    [
        [
            "top5",
            0.0625
        ]
    ]
]

Some Questions about the Training Process

Hi Yunpeng, I am new to video recognition tasks. I ran the code and have some questions about the whole procedure.

  1. For training, do you randomly sample 16 frames from the whole video to do classification? And each time it may be different 16 frames for the same video?

  2. When I was trying to run the codes train_hmdb51, there are many logs like 'frame[30] is error, use backup item XXX.avi'. What does this mean? Does this mean that there are some errors in my video data?(I downloaded it from the official website)

  3. It seems that the train_hmdb51 is doing both training and evaluation after each epoch. So why do we need another evaluation code like evaluate_video.py to do test?

Thanks a lot for your help!

load model warninng

2018-11-01 12:03:35 WARNING: >> Failed to load: ['module.conv1.bn.num_batches_tracked', 'module.conv2.B01.conv_i1.bn.num_batches_tracked', 'module.conv2.B01.conv_i2.bn.num_batches_tracked', 'module.conv2.B01.conv_m1.bn.num_batches_tracked', 'module.conv2.B01.conv_m2.bn.num_batches_tracked', 'module.conv2.B01.conv_w1.bn.num_batches_tracked', 'module.conv2.B02.conv_i1.bn.num_batches_tracked', 'module.conv2.B02.conv_i2.bn.num_batches_tracked', 'module.conv2.B02.conv_m1.bn.num_batches_tracked', 'module.conv2.B02.conv_m2.bn.num_batches_tracked', 'module.conv2.B03.conv_i1.bn.num_batches_tracked', 'module.conv2.B03.conv_i2.bn.num_batches_tracked', 'module.conv2.B03.conv_m1.bn.num_batches_tracked', 'module.conv2.B03.conv_m2.bn.num_batches_tracked', 'module.conv3.B01.conv_i1.bn.num_batches_tracked', 'module.conv3.B01.conv_i2.bn.num_batches_tracked', 'module.conv3.B01.conv_m1.bn.num_batches_tracked', 'module.conv3.B01.conv_m2.bn.num_batches_tracked', 'module.conv3.B01.conv_w1.bn.num_batches_tracked', 'module.conv3.B02.conv_i1.bn.num_batches_tracked', 'module.conv3.B02.conv_i2.bn.num_batches_tracked', 'module.conv3.B02.conv_m1.bn.num_batches_tracked', 'module.conv3.B02.conv_m2.bn.num_batches_tracked', 'module.conv3.B03.conv_i1.bn.num_batches_tracked', 'module.conv3.B03.conv_i2.bn.num_batches_tracked', 'module.conv3.B03.conv_m1.bn.num_batches_tracked', 'module.conv3.B03.conv_m2.bn.num_batches_tracked', 'module.conv3.B04.conv_i1.bn.num_batches_tracked', 'module.conv3.B04.conv_i2.bn.num_batches_tracked', 'module.conv3.B04.conv_m1.bn.num_batches_tracked', 'module.conv3.B04.conv_m2.bn.num_batches_tracked', 'module.conv4.B01.conv_i1.bn.num_batches_tracked', 'module.conv4.B01.conv_i2.bn.num_batches_tracked', 'module.conv4.B01.conv_m1.bn.num_batches_tracked', 'module.conv4.B01.conv_m2.bn.num_batches_tracked', 'module.conv4.B01.conv_w1.bn.num_batches_tracked', 'module.conv4.B02.conv_i1.bn.num_batches_tracked', 'module.conv4.B02.conv_i2.bn.num_batches_tracked', 'module.conv4.B02.conv_m1.bn.num_batches_tracked', 'module.conv4.B02.conv_m2.bn.num_batches_tracked', 'module.conv4.B03.conv_i1.bn.num_batches_tracked', 'module.conv4.B03.conv_i2.bn.num_batches_tracked', 'module.conv4.B03.conv_m1.bn.num_batches_tracked', 'module.conv4.B03.conv_m2.bn.num_batches_tracked', 'module.conv4.B04.conv_i1.bn.num_batches_tracked', 'module.conv4.B04.conv_i2.bn.num_batches_tracked', 'module.conv4.B04.conv_m1.bn.num_batches_tracked', 'module.conv4.B04.conv_m2.bn.num_batches_tracked', 'module.conv4.B05.conv_i1.bn.num_batches_tracked', 'module.conv4.B05.conv_i2.bn.num_batches_tracked', 'module.conv4.B05.conv_m1.bn.num_batches_tracked', 'module.conv4.B05.conv_m2.bn.num_batches_tracked', 'module.conv4.B06.conv_i1.bn.num_batches_tracked', 'module.conv4.B06.conv_i2.bn.num_batches_tracked', 'module.conv4.B06.conv_m1.bn.num_batches_tracked', 'module.conv4.B06.conv_m2.bn.num_batches_tracked', 'module.conv5.B01.conv_i1.bn.num_batches_tracked', 'module.conv5.B01.conv_i2.bn.num_batches_tracked', 'module.conv5.B01.conv_m1.bn.num_batches_tracked', 'module.conv5.B01.conv_m2.bn.num_batches_tracked', 'module.conv5.B01.conv_w1.bn.num_batches_tracked', 'module.conv5.B02.conv_i1.bn.num_batches_tracked', 'module.conv5.B02.conv_i2.bn.num_batches_tracked', 'module.conv5.B02.conv_m1.bn.num_batches_tracked', 'module.conv5.B02.conv_m2.bn.num_batches_tracked', 'module.conv5.B03.conv_i1.bn.num_batches_tracked', 'module.conv5.B03.conv_i2.bn.num_batches_tracked', 'module.conv5.B03.conv_m1.bn.num_batches_tracked', 'module.conv5.B03.conv_m2.bn.num_batches_tracked', 'module.tail.bn.num_batches_tracked']
2018-11-01 12:03:35 INFO: Only model state resumed from: ././../exps/models/MFNet3D_UCF-101_Split-1_96.3.pth_ep-0000.pth' 2018-11-01 12:03:35 WARNING: >> Epoch information inconsistant: 30 vs 0 2018-11-01 12:03:35 WARNING: VideoIter:: >> check_video' is off, `tolerant_corrupted_video' is automatically activated.

Should I neglect the warning while loading models?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.