This is a list of awesome real-time AI and DNN inference related projects & papers.
- Edge Intelligence: Architectures, Challenges, and Applications by Xu, Dianlei, et al., arxiv 2020
- A Survey of Multi-Tenant Deep Learning Inference on GPU by Yu, Fuxun, et al., arxiv 2022
- Machine Learning in Real-Time Internet of Things (IoT) Systems: A Survey by Bian, Jiang, et al., IOTJ 2022
- AI Augmented Edge and Fog Computing: Trends and Challenges by Tuli S, Mirhakimi F, Pallewatta S, et al., arxiv 2022
- Enable deep learning on mobile devices: Methods, systems, and applications by Cai, Han, et al., TODAES 2022
- Multi-DNN Accelerators for Next-Generation AI Systems by Venieris, Stylianos I., Christos-Savvas Bouganis, and Nicholas D. Lane., arxiv 2022
- A Survey of GPU Multitasking Methods Supported by Hardware Architecture Zhao, Chen, et al., IEEE TPDS 2021
- TASO: The Tensor Algebra SuperOptimizer for Deep Learning by Zhihao Jia et al., SOSP 2019
- AStitch: Enabling a New Multi-dimensional Optimization Space for Memory-Intensive ML Training and Inference on Modern SIMT Architectures by Zhen Zheng et al., ASPLOS 2022
- PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections by Haojie Wang et al., OSDI 2021
- Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks by Lingxiao Ma et al., OSDI 2020
- TASO: The Tensor Algebra SuperOptimizer for Deep Learning by Zhihao Jia et al., SOSP 2019
- Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance by Jiarong Xing et al., MLSys 2022
- Ansor: Generating High-Performance Tensor Programs for Deep Learning by Lianmin Zheng et al., OSDI 2020
- TenSet: A Large-scale Program Performance Dataset for Learned Tensor Compilers by Lianmin Zheng., NeurIPS 2021
- Romou: Rapidly Generate High-Performance Tensor Kernels for Mobile GPUs by Liang, Rendong, et al., MobiCom 2022
- Asymo: scalable and efficient deep-learning inference on asymmetric mobile cpus by Wang, Manni, et al., MobiCom 2021
- Ios: Inter-operator scheduler for cnn acceleration by Ding, Yaoyao, et al., MLSys 2021
- Moses: Efficient Exploitation of Cross-device Transferable Features for Tensor Program Optimization by Zhao, Zhihe, et al., arxiv 2022
- DeepCuts: A Deep Learning Optimization Framework for Versatile GPU Workloads by Jung, Wookeun, Thanh Tuan Dao, and Jaejin Lee., PLDI 2021
- CASE: a compiler-assisted SchEduling framework for multi-GPU systems by Chen, Chao, Chris Porter, and Santosh Pande., PPoPP 2022
- Chameleon: Adaptive code optimization for expedited deep neural network compilation by Ahn, Byung Hoon, et al., arxiv 2020
- Analytical characterization and design space exploration for optimization of CNNs by Li, Rui, et al., ASPLOS 2021
- DNNFusion: accelerating deep neural networks execution with advanced operator fusion by Niu, Wei, et al., PLDI 2021
- AutoGTCO: Graph and Tensor Co-Optimize for Image Recognition with Transformers on GPU by Bai, Yang, et al., ICCAD 2021
- DietCode: Automatic Optimization for Dynamic Tensor Programs by Zheng, Bojian, et al., MLSys 2022
- ROLLER: Fast and Efficient Tensor Compilation for Deep Learning by Zhu, Hongyu, et al., OSDI 2022
- FamilySeer: Towards Optimized Tensor Codes by Exploiting Computation Subgraph Similarity by Zhang, Shanjun, et al., arxiv 2022
- Reusing Auto-Schedules for Efficient DNN Compilation by Gibson, Perry, and José Cano., arxiv 2022
- Hidet: Task Mapping Programming Paradigm for Deep Learning Tensor Programs by Ding, Yaoyao, et al., arxiv 2022
- EdgeML: An AutoML framework for real-time deep learning on the edge by Zhao, Zhihe, et al., IoTDI 2021
- SPINN: synergistic progressive inference of neural networks over device and cloud by Laskaridis, Stefanos, et al., MobiCom 2020
- Clio: Enabling automatic compilation of deep learning pipelines across iot and cloud by Huang, Jin, et al., MobiCom 2020
- Neurosurgeon: Collaborative intelligence between the cloud and mobile edge by Kang, Yiping, et al., ASPLOS 2017
- Mistify: Automating DNN Model Porting for On-Device Inference at the Edge by Guo, Peizhen et al., NSDI 2021
- Deep compressive offloading: Speeding up neural network inference by trading edge computation for network latency. by Yao, Shuochao, et al., SenSys 2020
- Elf: accelerate high-resolution mobile deep vision with content-aware parallel offloading by Zhang, Wuyang, et al., MobiCom 2021
- Edge assisted real-time object detection for mobile augmented reality by Liu, Luyang, Hongyu Li, and Marco Gruteser., MobiCom 2019
- VELTAIR: towards high-performance multi-tenant deep learning services via adaptive compilation and scheduling by Liu, Zihan, et al., ASPLOS 2021
- RT-mDL: Supporting Real-Time Mixed Deep Learning Tasks on Edge Platforms by Ling, Neiwen, et al., SenSys 2021
- Horus: Interference-aware and prediction-based scheduling in deep learning systems by Yeung, Gingfung, et al., IEEE TPDS 2021
- Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU by Yu, Fuxun, et al., ICCAD 2021
- Interference-aware scheduling for inference serving by Mendoza, Daniel, et al., EuroMLSys 2021
- Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences by Han, Mingcong, et al., OSDI 2022
- Planaria: Dynamic architecture fission for spatial multi-tenant acceleration of deep neural networks by Ghodrati, Soroush, et al., MICRO 2020
- Heimdall: mobile GPU coordination platform for augmented reality applications by Yi, Juheon, and Youngki Lee., MobiCom 2020
- Deepeye: Resource efficient local execution of multiple deep vision models using wearable commodity hardware by Mathur, Akhil, et al., MobiSys 2017
- PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications by Bai, Zhihao, et al., OSDI 2020
- Enable simultaneous DNN services based on deterministic operator overlap and precise latency prediction by Cui, Weihao, et al., SC 2021
- LegoDNN: block-grained scaling of deep neural networks for mobile vision by Han, Rui, et al., MobiCom 2021
- NeuOS: A Latency-Predictable Multi-Dimensional Optimization Framework for DNN-driven Autonomous Systems by Bateni, Soroush, and Cong Liu., ATC 2020
- Multi-Neural Network Acceleration Architecture by Baek, Eunjin, Dongup Kwon, and Jangwoo Kim., ISCA 2020
- Pipelined data-parallel CPU/GPU scheduling for multi-DNN real-time inference by Xiang, Yecheng, and Hyoseung Kim., RTSS 2019
- Nestdnn: Resource-aware multi-tenant on-device deep learning for continuous mobile vision by Fang, Biyi, Xiao Zeng, and Mi Zhang., MobiCom 2018
- Flep: Enabling flexible and efficient preemption on gpus by Wu, Bo, et al., ASPLOS 2017
- Prophet: Precise qos prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers by Chen, Quan, et al., ASPLOS 2017
- PAME: precision-aware multi-exit DNN serving for reducing latencies of batched inferences by Zhang, Shulai, et al., ICS 2022
- Layerweaver: Maximizing resource utilization of neural processing units via layer-wise scheduling by Oh, Young H., et al., HPCA 2021
- LiteReconfig: cost and content aware reconfiguration of video object detection systems for mobile GPUs by Xu, Ran, et al., EuroSys 2022
- ApproxNet: Content and contention-aware video object classification system for embedded clients Xu, Ran, et al.
- Lalarand: Flexible layer-by-layer cpu/gpu scheduling for real-time dnn tasks by Kang, Woosung, et al., RTSS 2021
- DUET: A Compiler-Runtime Subgraph Scheduling Approach for Tensor Programs on a Coupled CPU-GPU Architecture by Zhang, Minjia, Zehua Hu, and Mingqin Li., IPDPS 2021
- Band: coordinated multi-DNN inference on heterogeneous mobile processors by Jeong, Joo Seong, et al., MobiSys 2022
- ODMDEF: On-Device Multi-DNN Execution Framework Utilizing Adaptive Layer-Allocation on General Purpose Cores and Accelerator by Lim, Cheolsun, and Myungsun Kim., IEEE ACCESS 2021
- μlayer: Low latency on-device inference using cooperative single-layer acceleration and processor-friendly quantization by Kim, Youngsok, et al., EuroSys 2019
- OPTiC: Optimizing collaborative CPU–GPU computing on mobile devices with thermal constraints by Wang, Siqi, Gayathri Ananthanarayanan, and Tulika Mitra., TCAD 2019
- Accelerating Sequence-to-Graph Alignment on Heterogeneous Processors by Feng, Zonghao, and Qiong Luo., ICPP 2021
- Efficient Execution of Deep Neural Networks on Mobile Devices with NPU by Tan, Tianxiang, and Guohong Cao., IPSN 2021
- CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices by Jia, Fucheng, et al., MobiSys 2022
- Coda: Improving resource utilization by slimming and co-locating dnn and cpu jobs by Zhao, Han, et al. ICDCS 2020
- GPUReplay: a 50-KB GPU stack for client ML by Park, Heejin, and Felix Xiaozhu Lin., ASPLOS 2022
- Real-time high performance computing using a Jetson Xavier AGX by Cetre, Cyril, et al., ERTS 2022
- GPU scheduling on the NVIDIA TX2: Hidden details revealed by Amert, Tanya, et al., RTSS 2017
- Nimble: Lightweight and parallel gpu task scheduling for deep learning by Kwon, Woosuk, et al., NeurIPS 2020
- Addressing GPU on-chip shared memory bank conflicts using elastic pipeline by Gou, Chunyang, and Georgi N. Gaydadjiev., IJPP 2013
- A study of persistent threads style GPU programming for GPGPU workloads by Gupta, Kshitij, Jeff A. Stuart, and John D. Owens., IEEE 2012
- Demystifying the placement policies of the NVIDIA GPU thread block scheduler for concurrent kernels by Gilman, Guin, et al., ACM SIGMETRICS Performance Evaluation Review 2021
- Exploiting Intra-SM Parallelism in GPUs via Persistent and Elastic Blocks by Zhao, Han, et al., ICDC 2021
- Online Thread Auto-Tuning for Performance Improvement and Resource Saving by Luan, Guangqiang, et al., IEEE TPDS 2021
- Hsm: A hybrid slowdown model for multitasking gpus by Zhao, Xia, Magnus Jahre, and Lieven Eeckhout., ASPLOS 2020
- Enabling and exploiting flexible task assignment on GPU through SM-centric program transformations by Wu, Bo, et al., ACM ICS 2015
- Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming by Xu, Qiumin, et al., ISCA 2016
- Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling by Zhong, Jianlong, and Bingsheng He. IEEE TPDS 2013
- Improving GPGPU concurrency with elastic kernels by Pai, Sreepathi, Matthew J. Thazhuthaveetil, and Ramaswamy Govindarajan., ACM SIGARCH Computer Architecture News 2013
- Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Kayıran, Onur, et al. ICPCT 2013
- Orion: A framework for gpu occupancy tuning by Hayes, Ari B., et al., International Middleware Conference. 2016
- Efficient performance estimation and work-group size pruning for OpenCl kernels on GPUs by Wang, Xiebing, et al., IEEE TPDS 2019
- Online evolutionary batch size orchestration for scheduling deep learning workloads in GPU clusters by Bian, Zhengda, et al., SC 2021
- Autotuning GPU kernels via static and predictive analysis by Lim, Robert, Boyana Norris, and Allen Malony., IEEE ICPP 2017
- Gslice: controlled spatial sharing of gpus for a scalable inference platform by Dhakal, Aditya, Sameer G. Kulkarni, and K. K. Ramakrishnan., SOCC 2020
- MAPLE-X: Latency Prediction with Explicit Microprocessor Prior Knowledge by Abbasi, Saad, Alexander Wong, and Mohammad Javad Shafiee., arxiv 2022
- MAPLE-Edge: A Runtime Latency Predictor for Edge Devices by Nair, Saeejith, et al., CVPR 2022
- Maple: Microprocessor a priori for latency estimation by Abbasi, Saad, Alexander Wong, and Mohammad Javad Shafiee., CVPR 2022
- nn-Meter: towards accurate latency prediction of deep-learning model inference on diverse edge devices by Zhang, Li Lyna, et al., MobiSys 2021
- Wei, Mengze, et al. "Predicting and reining in application-level slowdown on spatial multitasking GPUs. by Wei, Mengze, et al., JPDC 2020
- Mcunet: Tiny deep learning on iot devices by Lin, Ji, et al. , NeurIPS 2020
- TinyML: Current Progress, Research Challenges, and Future Roadmap by Shafique, Muhammad, et al., DAC 2021
- Benchmarking TinyML systems: Challenges and direction by Banbury, Colby R., et al., arxiv 2020
- μNAS: Constrained Neural Architecture Search for Microcontrollers by Liberis, Edgar, Łukasz Dudziak, and Nicholas D. Lane., EuroMLSys 2021
- Memory-efficient Patch-based Inference for Tiny Deep Learning by Lin, Ji, et al., NeurIPS 2021
- Deep Learning on Microcontrollers: A Study on Deployment Costs and Challenge by Filip Svoboda, Javier Fernandez-Marques, Edgar Liberis, Nicholas D Lane, EuroMLSys 2022
- Dynamic Multimodal Fusion by Xue, Zihui, and Radu Marculescu., arxiv 2022
- LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action by Shah, Dhruv, et al., arxiv 2022
- Accelerating mobile audio sensing algorithms through on-chip gpu offloading by Georgiev, Petko, et al., MobiSys 2017
- SparTA: Deep-Learning Model Sparsity via Tensor-with-Sparsity-Attribute by Zheng, Ningxin, et al., OSDI 2022
- ESCALATE: Boosting the Efficiency of Sparse CNN Accelerator with Kernel Decomposition by Li, Shiyu, et al., MICRO 2021
- A high-performance sparse tensor algebra compiler in Multi-Level IR by Tian, Ruiqin, et al., arxiv 2021
- Efficient Sparse Matrix Kernels based on Adaptive Workload-Balancing and Parallel-Reduction by Huang, Guyue, et al., arxiv 2021
- COEXE: An Efficient Co-execution Architecture for Real-Time Neural Network Services by Liu, Chubo, et al., DAC 2020
- TorchSparse: Efficient Point Cloud Inference Engine by Tang, Haotian, et al., MLSys 2022
- Understanding and Optimizing Deep Learning Cold-Start Latency on Edge Devices by Yi, Rongjie, et al., arxiv 2022
- Towards efficient vision transformer inference: a first study of transformers on mobile devices by Wang, Xudong, et al., HotMobile 2022
- Edgebert: Sentence-level energy optimizations for latency-aware multi-task nlp inference by Tambe, Thierry, et al., MICRO 2021
- EDGEWISE: A Better Stream Processing Engine for the Edge by Fu, Xinwei, et al., ATC 2019
- LiteFlow: towards high-performance adaptive neural networks for kernel datapath by Zhang, Junxue, et al., SIGCOMM 2022
- CoCoPIE: Making Mobile AI Sweet As PIE--Compression-Compilation Co-Design Goes a Long Way by Liu, Shaoshan, et al., arxiv 2020
- Beyond Data and Model Parallelism for Deep Neural Networks by Jia, Zhihao, Matei Zaharia, and Alex Aiken, MLSys 2019
- Discovering faster matrix multiplication algorithms with reinforcement learning by Fawzi, Alhussein, et al., Nature 2022