mit-han-lab / mcunet Goto Github PK

[NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices; [NeurIPS 2021] MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning

Home Page: https://mcunet.mit.edu

License: MIT License

Python 100.00%

deep-learning neural-architecture-search pytorch tinyml microncontroller

mcunet's Introduction

MCUNet: Tiny Deep Learning on IoT Devices

This is the official implementation of the MCUNet series.

TinyML Project Website | MCUNetV1 | MCUNetV2 | MCUNetV3

News

If you are interested in getting updates, please sign up here to get notified!

(2024/03) We release a new demo video of On-Device Training Under 256KB Memory.
(2023/10) Tiny Machine Learning: Progress and Futures [Feature] appears at IEEE CAS Magazine.
(2022/12) We simplified the net_id of models (new version: mcunet-in0, mcunet-vww1, etc.) for an upcoming review paper (stay tuned!).
(2022/10) Our new work On-Device Training Under 256KB Memory is highlighted on the MIT homepage!
(2022/09) Our new work On-Device Training Under 256KB Memory is accepted to NeurIPS 2022! It enables tiny on-device training for IoT devices.
(2022/08) We release the source code of TinyEngine in this repo. Please take a look!
(2022/08) Our new course on TinyML and Efficient Deep Learning will be released soon in September 2022: efficientml.ai.
(2022/07) We also include the person detection model used in the video demo above. We will also include the deployment code in TinyEngine release.
(2022/06) We refactor the MCUNet repo as a standalone repo (previous repo: https://github.com/mit-han-lab/tinyml)
(2021/10) MCUNetV2 is accepted to NeurIPS 2021: https://arxiv.org/abs/2110.15352 !
(2020/10) MCUNet is accepted to NeurIPS 2020 as spotlight: https://arxiv.org/abs/2007.10319 !
Our projects are covered by: MIT News, MIT News (v2), WIRED, Morning Brew, Stacey on IoT, Analytics Insight, Techable, etc.

Overview

Microcontrollers are low-cost, low-power hardware. They are widely deployed and have wide applications.

But the tight memory budget (50,000x smaller than GPUs) makes deep learning deployment difficult.

MCUNet is a system-algorithm co-design framework for tiny deep learning on microcontrollers. It consists of TinyNAS and TinyEngine. They are co-designed to fit the tight memory budgets.

With system-algorithm co-design, we can significantly improve the deep learning performance on the same tiny memory budget.

Our TinyEngine inference engine could be a useful infrastructure for MCU-based AI applications. It significantly improves the inference speed and reduces the memory usage compared to existing libraries like TF-Lite Micro, CMSIS-NN, MicroTVM, etc. It improves the inference speed by 1.5-3x, and reduces the peak memory by 2.7-4.8x.

Model Zoo

Usage

You can build the pre-trained PyTorch fp32 model or the int8 quantized model in TF-Lite format.

from mcunet.model_zoo import net_id_list, build_model, download_tflite
print(net_id_list)  # the list of models in the model zoo

# pytorch fp32 model
model, image_size, description = build_model(net_id="mcunet-in3", pretrained=True)  # you can replace net_id with any other option from net_id_list

# download tflite file to tflite_path
tflite_path = download_tflite(net_id="mcunet-in3")

Evaluate

To evaluate the accuracy of PyTorch fp32 models, run:

python eval_torch.py --net_id mcunet-in2 --dataset {imagenet/vww} --data-dir PATH/TO/DATA/val

To evaluate the accuracy of TF-Lite int8 models, run:

python eval_tflite.py --net_id mcunet-in2 --dataset {imagenet/vww} --data-dir PATH/TO/DATA/val

Model List

Note that all the latency, SRAM, and Flash usage are profiled with TinyEngine on STM32F746.
Here we only provide the int8 quantized modes. int4 quantized models (as shown in the paper) can further push the accuracy-memory trade-off, but lacking a general format support.
For accuracy (top1, top-5), we report the accuracy of fp32/int8 models respectively

The ImageNet model list:

net_id	MACs	#Params	SRAM	Flash	Res.	Top-1 (fp32/int8)	Top-5 (fp32/int8)
# baseline models
mbv2-w0.35	23.5M	0.75M	308kB	862kB	144	49.7%/49.0%	74.6%/73.8%
proxyless-w0.3	38.3M	0.75M	292kB	892kB	176	57.0%/56.2%	80.2%/79.7%
# mcunet models
mcunet-in0	6.4M	0.75M	266kB	889kB	48	41.5%/40.4%	66.3%/65.2%
mcunet-in1	12.8M	0.64M	307kB	992kB	96	51.5%/49.9%	75.5%/74.1%
mcunet-in2	67.3M	0.73M	242kB	878kB	160	60.9%/60.3%	83.3%/82.6%
mcunet-in3	81.8M	0.74M	293kB	897kB	176	62.2%/61.8%	84.5%/84.2%
mcunet-in4	125.9M	1.73M	456kB	1876kB	160	68.4%/68.0%	88.4%/88.1%

The VWW model list:

Note that the VWW dataset might be hard to prepare. You can download our pre-built minival set from here, around 380MB.

net_id	MACs	#Params	SRAM	Flash	Resolution	Top-1 (fp32/int8)
mcunet-vww0	6.0M	0.37M	146kB	617kB	64	87.4%/87.3%
mcunet-vww1	11.6M	0.43M	162kB	689kB	80	88.9%/88.9%
mcunet-vww2	55.8M	0.64M	311kB	897kB	144	91.7%/91.8%

For TF-Lite int8 models, we do not use quantization-aware training (QAT), so some results is slightly lower than paper numbers.

Detection Model

We also share the person detection model used in the demo. To visualize the model's prediction on a sample image, please run the following command:

python eval_det.py

It will visualize the prediction here: assets/sample_images/person_det_vis.jpg.

The model takes in a small input resolution of 128x160 to reduce memory usage. It does not achieve state-of-the-art performance due to the limited image and model size but should provide decent performance for tinyML applications (please check the demo for a video recording). We will also release the deployment code in the upcoming TinyEngine release.

Requirement

Python 3.6+
PyTorch 1.4.0+
Tensorflow 1.15 (if you want to test TF-Lite models; CPU support only)

Acknowledgement

We thank MIT-IBM Watson AI Lab, Intel, Amazon, SONY, Qualcomm, NSF for supporting this research.

Citation

If you find the project helpful, please consider citing our paper:

@article{lin2020mcunet,
  title={Mcunet: Tiny deep learning on iot devices},
  author={Lin, Ji and Chen, Wei-Ming and Lin, Yujun and Gan, Chuang and Han, Song},
  journal={Advances in Neural Information Processing Systems},
  volume={33},
  year={2020}
}

@inproceedings{
  lin2021mcunetv2,
  title={MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning},
  author={Lin, Ji and Chen, Wei-Ming and Cai, Han and Gan, Chuang and Han, Song},
  booktitle={Annual Conference on Neural Information Processing Systems (NeurIPS)},
  year={2021}
} 

@article{
  lin2022ondevice, 
  title = {On-Device Training Under 256KB Memory},
  author = {Lin, Ji and Zhu, Ligeng and Chen, Wei-Ming and Wang, Wei-Chen and Gan, Chuang and Han, Song}, 
  journal = {arXiv:2206.15472 [cs]},
  url = {https://arxiv.org/abs/2206.15472},
  year = {2022}
}

Related Projects

On-Device Training Under 256KB Memory (NeurIPS'22)

TinyTL: Reduce Memory, Not Parameters for Efficient On-Device Learning (NeurIPS'20)

Once for All: Train One Network and Specialize it for Efficient Deployment (ICLR'20)

ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware (ICLR'19)

AutoML for Architecting Efficient and Specialized Neural Networks (IEEE Micro)

AMC: AutoML for Model Compression and Acceleration on Mobile Devices (ECCV'18)

HAQ: Hardware-Aware Automated Quantization (CVPR'19, oral)

mcunet's People

Contributors

Stargazers

Watchers

mcunet's Issues

How to train MCUNet from Scratch (and perform search with TinyNAS)

As commented in the title, I'm having trouble running the TinyNAS search code, and training the searched network. I think all I see in the TinyNAS folder are code for the model structure but not the NAS code (presumably it is in a similar fashion to ProxylessNAS). Any help would be appreciated

the mcu side software paltform for deploy

Thanks for your great works, i am trying to deploy mcunet to stm32f746g-disco, so i want to know what software platform you use to generate the runtime code(tf-micro, tvm), or you just write it line by line?

Different inference result on my own model using TinyEngine compare to python

Hi, @meenchen. Thanks for your great jobs. As title, when I implemented my own task in the STM32cubeIDE and checked the network inference results, I found that the inference result would appear some biases compare to result of inferring TFlite model using python, especially happens when the deeper the network layer. I would like to ask if these biases are caused by some slight differences between the op in TinyEngine and the op in Tflite? Or have you ever encountered this problem? I would appreciate if you could provide some help. The device I am using is STM32F746G-DISCO, and my tensorflow version is 2.11.0.

I can not download "https://hanlab.mit.edu/projects/tinyml/mcunet/release/person-det.tflite" cause page not found

Code for MCUNet-V3

I got excited by the Tiny Training Engine (TTE) described in the V3 paper. I was wondering if the code for MCUNet-V3 has been released somewhere else or if it is yet to be released.

Thanks

Is it necessary to include CMSIS？

Hi gays, thanks so much for this great project.
The CMSIS NN library is required in the generated code, but my device doesn't support it(rsic-v).

Is there any other way to replace these references?

About the 4b quantized model.

Would you provide us the 4b quantized model?

Clarification of ImageNet dataset used

Hi,

I am trying to evaluate the mcunet-320kB implemented in TFLite using your eval_tflite.py script. I followed the steps provided in this Git issue ( #11 ), but I am still getting a very low accuracy of 0.08%. I downloaded the 2012 version of the ImageNet dataset from this link https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar and have prepared the repo using this script https://github.com/pytorch/examples/blob/main/imagenet/extract_ILSVRC.sh . I am doing something wrong?

Code on training supernetworks

Hello! I impressed by your excellent work in the field of TinyML and studied the source code of MCUNet, but I didn't find the code to retrain the supernetwork and the evolutionary algorithm to search the subnetwork, could you please release this part of the code? It would be very helpful for my next work and I would greatly appreciate if you could reply me.

MCUNet-v2 analytical PMU computation for MbV2 on ImageNet r224

Hi, thanks for publishing MCUNet-v2, it is very interesting work!

I had a question about how you computed the 172kB per-patch peak SRAM usage for the MbV2 model in Table 1 here: https://arxiv.org/pdf/2110.15352.pdf

If I understood correctly, you are considering 4 patches in X and Y dimensions: a total of 4x4=16 patches of size 75 x 75 x 3. The patches are read from the input 224 x 224 x 3. Then, per-patch execution ends with a patch of size 7 x 7 x 32 which is then written to the final 28 x 28 x 32 tensor.

I have made my calculations (happy to share in a follow-up comment) which give a higher peak memory usage. I suspect the mismatch happened because the full input image may not be included in the numbers reported in the paper. Could you confirm if you include the full image tensor (224x224x3 = 147kB @ int8) in your reported peak memory usage?

You mention the input doesn't have to be fully stored because it can be partially decoded from JPEG. If you assume this in reported numbers, could you share additional memory usage coming from the microcontroller receiving and storing a JPEG-compressed input instead, and then extra latency/MACs required to decode it?

Thanks for your help!

Training codes of MCUNetv2

Hi, it's very happy to see the excellent work you have done. However, I only find the evaluation codes on your pre-trained codes. So when will you public the training codes of MCUNetv2, especially for the Inference Scheduling Search part. Thanks a lot!

how to Profile memory allocation of each layer

Hey there,
In the McuNet paper, it is mentioned that it decreaces the peak memory of the first serveral layers, my question would be how to profile it during the implementation of the experiments.

The model for person detection

Thanks for the great work. Could you also provide the model and demo codes for person detection task?

Unable to download vww pretrained models

I am getting a timeout error when downloading https://hanlab.mit.edu/projects/tinyml/mcunet/release/mcunet-10fps_vww.json. Is the link currently down?

Integrating TinyEngine into OpenMV

How can I integrate TinyEngine to OpenMV? Should I delete deafult engine? If so, how can I delete?

MCUNet for detection on custom dataset

Greetings,
First of all, your research is truly fascinating and groundbreaking, I enjoyed reading your papers and I'm looking forward to your future works.
I would like to use your MCUNet model for human detection from a UAV camera. However, I couldn't figure out find how to train the model using a custom dataset. as well as how to import the custom dataset (Images+annotation) for training.
I sincerely hope you can enlighten me on how to achieve that.

Thank you

Why the residual_size is added here?

why the residual_size is added here?

Training and testing on different datasets

Thanks for answering @tonylins to my previous question, can you please tell if i had to test these mcunet models on tiny-image-net-200 dataset, where images are of size 64x64, can i directly use the trained models that are provided or i would have to train from scratch.

The repo does not have training or testing on datasets other than used in paper.

Also ,i fed validation data of tiny-imagenet-200 to 512kb and other models, the accuracy was like almost zero, what could be the reason, because even the input resolution is 64x64 ,shouldn't there should be some accuracy, as the model is trained on imagenet-1000 classes dataset, it must have some accuracy on the same but low resolution data.

Would you tell me how to make a baseline model In MCUNetV2 Paper?

Hello, Thanks for your work.

I want to know how to make baseline model of MCUNetV2 paper's experiment.(NOT MCUNetV2 NAS)

exactly, Exactly, I want to reproduce the mobilenetV2 r144 w0.5 model shown in figure 6 of your paper(MCUNetV2,figure 6).

I generated Mbv2 model by tensorflow library, but it is too big to use in my board(F746gz-disc board)

I know that paper'model be used QAT, but I need other information for model generation.

Here is my model's inforamition. After learning fp32 MBV2 for only 1 epoch on the fp32 Imagenet-1000 dataset,I used post training quantization to quantize the model with int8 weights and activations. (but It was 2.2MB, not under 1MB.)

Could you please tell me what I need to do to make this model less than 1 MB like the MCUNetV2 paper?

*board information

*my model information

and I uploaded my model tfml file and tensorflow code.

*tfml model file
mobilenet_v2_quantized.zip

Here is my model generation code

import os import tensorflow as tf import config as c from tqdm import tqdm from tensorflow.keras import optimizers from utils.data_utils import train_iterator from utils.eval_utils import cross_entropy_batch, correct_num_batch, l2_loss from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2

from model.ResNet import ResNet
from model.ResNet_v2 import ResNet_v2
from test import test
import numpy as np
import os
import PIL
import PIL.Image
import tensorflow as tf
import pathlib

os.environ['CUDA_VISIBLE_DEVICES'] = '0'
os.environ['CUDA_VISIBLE_DEVICES'] = '1'

checkpoint_path = "training_1/cp.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)

data_dir = pathlib.Path("/home/poweroverwhelming/modelgen/ImageNet_ResNet_Tensorflow2.0/imagenet10/train")

image_count = len(list(data_dir.glob('/.JPEG')))

#training param
batch_size = 32
img_height = 144
img_width = 144
num_classes =1000

#directroy making
train_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)

val_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="validation",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)

#print class name

#nomalization
normalization_layer = tf.keras.layers.Rescaling(1./255)

normalized_ds = train_ds.map(lambda x, y: (normalization_layer(x), y))
image_batch, labels_batch = next(iter(normalized_ds))
first_image = image_batch[0]
#Notice the pixel values are now in [0,1].
#print(np.min(first_image), np.max(first_image))

AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

#training
model = MobileNetV2(input_shape=(144, 144, 3), alpha=0.5, weights=None, classes=1000)
model.compile(
optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])

cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
save_weights_only=True,
verbose=1)

model.fit(train_ds,
validation_data=val_ds,
epochs=50)

#loss, acc = model.evaluate(test_images, test_labels, verbose=2)
#print("Untrained model, accuracy: {:5.2f}%".format(100 * acc))

#Save the entire model as a SavedModel.
#!mkdir -p saved_model
model.save('saved_model/my_model')

def representative_data_gen():
for input_value in tf.data.Dataset.from_tensor_slices(train_ds).batch(1).take(100):
# Model has only one input so each data point has one element.
yield [input_value]

def representative_dataset():
for _ in range(100):
data = np.random.rand(1, 144, 144, 3)
yield [data.astype(np.float32)]

#PTQ part
#Convert the model to TensorFlow Lite format
converter = tf.lite.TFLiteConverter.from_saved_model('saved_model/my_model')

#Perform post-training quantization
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
converter.representative_dataset = representative_dataset

#Save the quantized model
tflite_model = converter.convert()
open("mobilenet_v2_imgnet10_r144_w05_quantizaed.tflite", "wb").write(tflite_model)

once again, Thank you for your work.

Performing NAS for board with different memory constraints

Hello,

I would like to perform tinyNAS on a custom board with different memory constraints. Is there any documentation/tutorial on how to do this?

validation dataset clarification for evaluation

I am confused about the val dataset, is it image net-1000 classes or tiny-image net-200, also what is the input size of images trained is it 64x64 ?.

I tested the models and ran the script given in eval on github page, my dataset is tiny-imagenet-200 dataset, the accuracy shown is very low.

How to change the person detection model to any coustom object detection model

I need to change the person detection example tyo my own coustom object detection. how to train the model and get a tflite model that should move forward in the code generation.

Code Generator for TinyEngine

I was impressed with MCUNet's approach to solve the memory limitation of microcontroller. Could you please let me know when the TinyEngine will be released?

When I checked the generated code using old version of tinyml repository (https://github.com/mit-han-lab/tinyml), there is no header files (genNN.h, kernel_buffer.h, tinyengine_function.h). And there was no implementation for the functions below I would appreciate it if you could let me know how you implemented it.

convolve_s8_kernel3_inputch3_stride2_pad1
fast_depthwise_conv_s8_kernel3_stride1_pad1_a8w8_8bit_HWC_inplace
convolve_1x1_s8
fast_depthwise_conv_s8_kernel3_stride2_pad1_a8w8_8bit_HWC_inplace

where is MCUNetv2?

Hello! at first, thanks for your works🙏🏻

Having reviewed the paper(MCUNetV2), I looked around this repository, but can't find where is MCUNetV2.
I think that models named mcunet_in0~4 in model_zoo.py are sub-models of MCUNet and MCUNetV2 has yet released in here,
is it right?

if those are released in here, i'm sorry for bothering you😢

Training a custom mcunet model

Hello,

I'd like to train a custom model using mcunet. Is there any tutorial/documentation on how to do this? Have the relevant parts of the code been uploaded or are they still unreleased?

Evaluation about mcunet-320KB(Imagenet)

Thanks for the great work.
I run this line to evluate the performance if this model
python eval_torch.py --net_id mcunet-320kB --dataset {imagenet/} --data-dir PATH/TO/DATA/val
But the accuracy just gets about 11%,

And I use this github to preprate the Imagenet dataset
https://gist.github.com/antoinebrl/7d00d5cb6c95ef194c737392ef7e476a
The validation just like this setting,it split to 1000 folders and each folder have about 50 images

Could you tell me the possible reason?
Or I use the wrong way to split the Imagenet on validation?

Retrain the model and Using custom data

Hello,

I would like to train a custom model using mcunet with custom data. Is there any tutorial/documentation on how to do this?

In addition to this, How can I find your default training codes ?

mit-han-lab / mcunet Goto Github PK

mcunet's Introduction

MCUNet: Tiny Deep Learning on IoT Devices

TinyML Project Website | MCUNetV1 | MCUNetV2 | MCUNetV3

News

Overview

Model Zoo

Usage

Evaluate

Model List

Detection Model

Requirement

Acknowledgement

Citation

Related Projects

mcunet's People

Contributors

Stargazers

Watchers

Forkers

mcunet's Issues

Recommend Projects

Recommend Topics

Recommend Org