nightsnack / yolobile Goto Github PK

This is the implementation of YOLObile: Real-Time Object Detection on Mobile Devices via Compression-Compilation Co-Design

License: GNU General Public License v3.0

Dockerfile 0.86% Python 97.12% Shell 2.03%

deep-learning object-detection yolov4

yolobile's Introduction

YOLObile

This is the implementation of YOLObile: Real-Time Object Detection on Mobile Devices via Compression-Compilation Co-Design using ultralytics/yolov3. Thanks to the original author.

arXiv: https://arxiv.org/abs/2009.05697 In Proceeding in AAAI 2021

For those who may be interested in the compiler code (How to deploy it onto Android?): The compiler source code is associated with our collaborator at William & Mary, and has joint IP related stuff. We have no plans to open source this part now. Sorry for the inconvenience.

For IOS developer: We only use Android platform to build and test the compiler because of the advantages of highly open source. We also believe the same techniques can be applied on Apple IOS platform, but we haven't tested it yet.

Introduction

The rapid development and wide utilization of object detection techniques have aroused attention on both accuracy and speed of object detectors. However, the current state-of-the-art object detection works are either accuracy-oriented using a large model but leading to high latency or speed-oriented using a lightweight model but sacrificing accuracy. In this work, we propose YOLObile framework, a real-time object detection on mobile devices via compression-compilation co-design. A novel block-punched pruning scheme is proposed for any kernel size. To improve computational efficiency on mobile devices, a GPU-CPU collaborative scheme is adopted along with advanced compiler-assisted optimizations. Experimental results indicate that our pruning scheme achieves 14x compression rate of YOLOv4 with 49.0 mAP. Under our YOLObile framework, we achieve 17 FPS inference speed using GPU on Samsung Galaxy S20. By incorporating our proposed GPU-CPU collaborative scheme, the inference speed is increased to 19.1 FPS, and outperforms the original YOLOv4 by 5x speedup.

Environments

Python 3.7 or later with all pip install -U -r requirements.txt packages including torch == 1.4. Docker images come with all dependencies preinstalled. Docker requirements are:

Nvidia Driver >= 440.44
Docker Engine - CE >= 19.03

Download Coco Dataset: (18 GB)

cd ../ && sh YOLObile/data/get_coco2014.sh

The default path for coco data folder is outside the project root folder.

/Project
/Project/YOLObile (Project root)
/Project/coco (coco data)

Download Model Checkpoints:

Google Drive: Google Drive Download

Baidu Netdisk: Baidu Netdisk Download code: r3nk

After downloads, please put the weight file under ./weights folder

Docker build instructions

1. Install Docker and Nvidia-Docker

Docker images come with all dependencies preinstalled, however Docker itself requires installation, and relies of nvidia driver installations in order to interact properly with local GPU resources. The requirements are:

Nvidia Driver >= 440.44 https://www.nvidia.com/Download/index.aspx
Nvidia-Docker https://github.com/NVIDIA/nvidia-docker
Docker Engine - CE >= 19.03 https://docs.docker.com/install/

2. Build the project

# Build and Push
t=YOLObile && sudo docker build -t $t .

3. Run Container

# Pull and Run with local directory access
t=YOLObile && sudo docker run -it --gpus all --ipc=host -v "$(pwd)"your/cocodata/path:/usr/src/coco $t bash

4. Run Commands

Once the container is launched and you are inside it, you will have a terminal window in which you can run all regular bash commands, such as:

ls .
ls ../coco
python train.py
python test.py
python detect.py

Configurations:

Train Options and Model Config:

./cfg/csdarknet53s-panet-spp.cfg (model configuration)
./cfg/darknet_admm.yaml (pruning configuration)
./cfg/darknet_retrain.yaml (retrain configuration)

Weights:

./weights/yolov4dense.pt (dense model)
./weights/best8x-514.pt (pruned model)

Prune Config

./prune_config/config_csdarknet53pan_v*.yaml

Training

The training process includes two steps:

Pruning: python train.py --img-size 320 --batch-size 64 --device 0,1,2,3 --epoch 25 --admm-file darknet_admm --cfg cfg/csdarknet53s-panet-spp.cfg --weights weights/yolov4dense.pt --data data/coco2014.data

The pruning process does NOT support resume.

Masked Retrain: python train.py --img-size 320 --batch-size 64 --device 0,1,2,3 --epoch 280 --admm-file darknet_retrain --cfg cfg/csdarknet53s-panet-spp.cfg --weights weights/yolov4dense.pt --data data/coco2014.data --multi-scale.

The masked retrain process support resume.

You can run the total process via sh ./runprune.sh

Check model Weight Parameters & Flops:

python check_compression.py

Test model MAP:

python test.py --img-size 320 --batch-size 64 --device 0 --cfg cfg/csdarknet53s-panet-spp.cfg --weights weights/best8x-514.pt --data data/coco2014.data

               Class    Images   Targets         P         R   [email protected]        F1: 100%|| 79/79 [00:
                 all     5e+03  3.51e+04     0.501     0.544     0.508     0.512
              person     5e+03  1.05e+04     0.643     0.697     0.698     0.669
             bicycle     5e+03       313     0.464     0.409     0.388     0.435
                 car     5e+03  1.64e+03     0.492     0.547     0.503     0.518
          motorcycle     5e+03       388     0.602     0.635     0.623     0.618
            airplane     5e+03       131     0.676     0.786     0.804     0.727
                 bus     5e+03       259      0.67     0.788     0.792     0.724
               train     5e+03       212     0.731     0.797     0.805     0.763
               truck     5e+03       352     0.414     0.526     0.475     0.463
          toothbrush     5e+03        77      0.35     0.301     0.269     0.323
Speed: 3.6/1.4/5.0 ms inference/NMS/total per 320x320 image at batch-size 64

COCO mAP with pycocotools...
loading annotations into memory...
Done (t=3.87s)
creating index...
index created!
Loading and preparing results...
DONE (t=3.74s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=83.06s).
Accumulating evaluation results...
DONE (t=9.39s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.334
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.514
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.350
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.117
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.374
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.519
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.295
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.466
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.504
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.240
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.583
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.727

FPS vs mAP on COCO dataset

Already Known Issues

The accuracy printed in retraining process is not accurate. Please run the test.py individually to check the accuracy. I raised this issue in the old versions of Ultralytics/YOLOv3 repository, and I am not sure if they had already solved yet.
When you use multi-card training（4 cards or more ), the training process may stop after a few hours without any errors printed. I suggest using docker instead if you use 4 cards or more. The docker build instructions can be found above.
Pytorch 1.5+ might have multi card issues

Acknowledgements

https://github.com/ultralytics/yolov3

https://github.com/AlexeyAB/darknet

Contact Me

Github: https://github.com/nightsnack

Email : [email protected]

yolobile's People

Contributors

Stargazers

Watchers

Forkers

mrlee12138 playezio zhipengchen chenjianqiang199526 ringwraith johndpope vu1seek drfirestream cv-ip runrunrun1994 thailand88 aust-hansen zuoshiyang lidaweinuc jacky-ch93 winnerineast loulansuiye somebody-deep xiebaiyuan tubbz-alt xrosliang trendingtechnology pht1936 galaxy-ding wuqiman ossdc yangyangsu29 pham-cong-nguyen taotaoxu job2003 hxl1990 ygest adas-eye shadowkun freewind2016 wzb1005 doublepoints felixzhang7 kylin33 blac4t hell-to-heaven bgyooptr xinxin12345 damonmin alievilya deepbehavier ella2le xiaojake wf-hahaha xhdyc hust-wayne baodijun intjun simonggx caojinpei daibin88 yanqi1811 mc261670164 jawaechan wolfworld6 boosting unsky peterzhousz greendream182 lwzbuaa kevinzhaozzl big-data-ai yanggui19891007 zczhangcheng mzpmzk jack16888 dev233 adrianxsalazar tincochan snow-1314 hzj1558718 shenmayufei peternara roguewindrunner zehengli liuguoyou lazylazypig caiyou1001 jhyuuu hjx422 fardman69420 zggg1p tienhoangvan tahvane1 bosecorp adamabo shrishtikarkera iq-scm lakehongsuho

yolobile's Issues

train.py

Hi, Thank you for approaching me efficient pruning method.

I implemented 1 of 2 pruning steps(Training & Masked retraining). While I train with train.py, 4 batch size, and 25 epoch, It implements 0-24 steps and again and again.. When will it stop itself? and what does this iteration means?

question about the weight_pruning function in admm.py

Hi, I saw you have several lines in admm.py at which you reshape the weight. For example, at line 212:

weight = weight.reshape(shape)

should it be weight = weight2d.reshape(shape) ?

Speeds of (GPU 8x, 14x and yolov4dense) running on desktop GPU (RTX2080Ti) are same

I run:
detect.py --weights 'weights/best14x-49.pt' -- img-size 512 --> runing time (11ms on RTX2080ti)
detect.py --weights 'weights/best8x-514.pt -- img-size 512 --> runing time (11ms on RTX2080ti)
detect.py --weights 'weights/yolov4dense.pt ' -- img-size 512 --> runing time (11ms on RTX2080ti)

But when using check_compression.py, I see that FLOPS of these weights is still good

I just pip install -U -r requirements.txt, without docker build

So can you explain to me about this problem?

Google drive link is not working

The drive link given in model checkpoints is not working. Another link was given there but Baidu is not supported in my country.

problem when I tried to train my single-class datasets

I made my own datasets which had only single class.Then I changed csdarknet53s-panet-spp.cfg correspondingly. When I tried pruning，the initial weights were not compatible. So how can I get compatible initial weights? And I don't know how to do training or pruning with no initial weights. Should I change pruning config either?
In general, When I try to train my own single-class datasets, what should I pay attention to?
I am a new hand, sorry for the dump question.

How to deploy on Android or IOS?

Running Yolov5 with yolobile

Can this be implemented on the yolov5 model? I am looking to run the yolov5s model.

csdarknet53s-panet-spp.cfg

Hello, where is the csdarknet53s-panet-spp.cfg file? I didn't find it.

problem when compute COCO mAP

      hair drier     5e+03        11         0         0    0.0385         0
      toothbrush     5e+03        57     0.335     0.386     0.337     0.359

Speed: 43.3/3.9/47.2 ms inference/NMS/total per 320x320 image at batch-size 64

COCO mAP with pycocotools...
WARNING: pycocotools must be installed with numpy==1.17 to run correctly. See cocodataset/cocoapi#356

i have install numpyt 1.19, should I degrade numpy to 1.17?

about model size

Hello!

After reading your paper about Yolo, yolobile, I have a problem. The pruning method in this paper is to set the weight value of part to 0 to cut off the connection, so the size of the overall weight matrix is unchanged. The size of space storage be smaller after saving the model directly? I directly use pytorch's save function and find that the size of the model don’t changed. This problem bothers me for a long time. In the paper, it is mentioned that the sparse matrix format is used to save, but this part of the content is not found in the code. I hope you can help me, thank you!!!

Problem with running get_coco2014.sh

When I executed the get_coco2014.sh script, the following message appeared, possibly indicating that there are issues with the file downloaded from Google Drive. It seems that the file on Google Drive does not exist. Is there any way to resolve this?

不能被block_size整除

您好：
我在调试您的yolobile代码的时候，换用VOC数据集，但是当有一层shape是（75, 256）的时候出现了"the layer size is not divisible"，想请教一下您该如何处理这种不能被block_size整除的情况呢？

请问csdarknet53s-panet-spp.cfg是什么模型？

csdarknet53s-panet-spp.cfg和yolov4的config不一致，用yolov4dense.pt计算出的mAP为45.7（python test.py --img-size 320 --batch-size 64 --device 0 --cfg cfg/csdarknet53s-panet-spp.cfg --weights weights/yolov4dense.pt --data data/coco2017.data）,而yolov4 mAP仅为38.2。所以该模型不是yolov4,请问是什么模型呢？

prune ratios

Hi, I have question about the prune ratios. How did you determine the prune ratios for different layers as specified in the configuration files such as "config_csdarknet53pan_v2.yaml"? There are so many layers and it seems infeasible to do trial and error to determine the ratios. I am sorry for the dumb question but I am relatively new in this kind of study.

The problem of speed in TX2

when I run your .cfg and weight on NVIDIA Jetson TX2 , the speed is only 5 fps . But when i run Yolov3-tiny .cfg and .weight , the speed can reach 30 fps. May I ask you this reason? Thanks.

模型文件大

第一步训练生成的4个模型文件的大小为256M，第二步生成的模型文件大小为500M左右，请问为啥会折磨大呢，这不是要部署在手机上吗，这样大的模型文件怎麼部署呢，请作者看到后能不能回复一下，万分感谢

How to export onnx models or to trained onnx files

Hi, guys
I want to use the tencent ncnn library to run YOLObile in the android mobile can you provide me a trained onnx file? or tell me how to export onnx file?
Thanks

你好，请问在哪里可以看到论文中讲到的CPU-GPU协作计算的代码呢？

请问可以告诉我这部分的代码优化在哪里么？

how to keep total sparse ratio in a certain number

Thanks for your great work,I have a question, as you mentioned in another issue, the sparse ration of each layer is manually set, first set with same sparse ratio, and then changed layer by lay. But how to keep total sparse ratio same during the adjust?

Question on config_csdarknet53pan_v*.yaml and yolov4dense.pt

Thanks for your great work,

I am wondering the meaning of the following code in yaml file, and how do you decide which Conv2d layer to prune and the number in line (e.g 0.4)

module_list.1.Conv2d.weight:
    0.4

Finally, what's the purpose of "yolov4dense.pt"(dense model)? Does it work as a pre-trained weight? What's the difference between it and yolov4.pt file that trained from other PyTorch version YOLOv4 project.