papers-notebook-with-scheduling
此專案記錄閱讀論文彙整,並透過閱讀過程記錄研究方法與重點,其整理Papers
分類幫助自己正確朝著研究方向深入探討。
- AS: Auto Scaling
- DL: Deep Learning
- DS: Distributed system
- NE: Network efficient
- RM: Resource management
- RU: Resource Utilization
- RC: Resource Contention
- RS: Resource scheduling
- DMLCS: Distributed machine learning Centralized scheduling
- PA: Performance Analysis
- PT: Parallelized Training
Keywords |
Paper Title |
PDF |
Slide |
Year |
DL, Scheduling |
Gandiva: Introspective Cluster Scheduling for Deep Learning |
[pdf] |
[slide] |
2018 |
DL, CPU, RS |
Scheduling CPU for GPU-based Deep Learning Jobs |
[pdf] |
[slide] |
2018 |
DL, NE, Scheduling |
DLTAP: A Network-efficient Scheduling Method for Distributed Deep Learning Workload in Containerized Cluster Environment |
[pdf] |
[slide] |
2018 |
DL,Training System |
Project Adam: Building an Efficient and Scalable Deep Learning Training System |
[pdf] |
[Video] |
2014 |
DL, PS, Rack-Scale |
Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training |
[pdf] |
[slide] |
2018 |
ML, PS |
Scaling Distributed Machine Learning with the Parameter Server |
[pdf] |
[slide] |
2014 |
ML, Infra |
Applied Machine Learning at Facebook:A Datacenter Infrastructure Perspective |
[pdf] |
[slide] |
2014 |
RM |
Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Cluster |
[pdf] |
[slide] |
2018 |
DS, PS |
Scaling Distributed Machine Learning with the Parameter Server |
[pdf] |
[slide][Video] |
2014 |
Scheduling, GPU, PA, RC |
Topology-Aware GPU Scheduling for Learning Workloads in Cloud Environments |
[pdf] |
[slide] |
2017 |
DL, GPU |
Multi-tenant GPU Clusters for Deep Learning Workloads: Analysis and Implications |
[pdf] |
[slide] |
2018 |
DL, DS, GPU |
Tiresias: A GPU Cluster Manager for Distributed Deep Learning |
[pdf] |
[slide] |
2019 |
DL, GPU |
Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications |
[pdf] |
[slide] |
2019 |
Keywords |
Paper Title |
PDF |
Slide |
Year |
DL, AS, kubernetes |
Deep Learning Based Auto-Scaling Load Balancing Mechanism for Distributed Software-Defined Storage Service |
[pdf] |
[slide] |
2018 |
ML, benchmarking, kubernetes |
Kubebench: A Benchmarking Platform for ML Workloads |
[pdf] |
[slide] |
2018 |
RM, DMLCS,RU, kubernetes |
GAI: A Centralized Tree-Based Scheduler for Machine Learning Workload in Large Shared Clusters |
[pdf] |
[slide] |
2018 |
DL, Scheduling, Algorithm |
Online Job Scheduling in Distributed Machine Learning Clusters |
[pdf] |
[slide] |
2018 |
Autoscaling, kubernetes |
Containers Orchestration with Cost-Efficient Autoscaling in Cloud Computing Environments |
[pdf] |
[slide] |
2018 |
DL, PT, kubernetes |
Parallelized Training of Deep NN – Comparison of Current Concepts and Frameworks |
[pdf] |
[slide] |
2018 |
DL, Resource Orchestration, Job Scheduling, Autoscaling |
DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster |
[pdf] |
[slide] |
2019 |
Keywords |
Paper Title |
PDF |
Slide |
Year |
DL, DS |
Multi-tenant GPU Clusters for Deep Learning Workloads: Analysis and Implications |
[pdf] |
[slide] |
2018 |
DL, DS |
GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server |
[pdf] |
[slide] |
2015 |
DL |
Poseidon: A system architecture for efficient GPU-based deep learning on multiple machines |
[pdf] |
[slide] |
2015 |
Mesos, Marathon, Ceph |
Toward High-Availability Container as a Service on Mesos Cluster with Distributed Shared Volumes |
[pdf] |
[slide] |
2015 |
- Traditional scheduling architecture
- Machine learning Distributed Cluster
- Model training
- Farmwork
- Parameters Server / AllReduce
- Combination of both
- Scheulder affinity
- Scheduler Policy
- Hardware GPU topology
- Kube-batch