alibaba / clusterdata Goto Github PK

cluster data collected from production clusters in Alibaba for cluster management research

Jupyter Notebook 93.14% Python 6.44% Shell 0.41%

clusterdata's Introduction

Alibaba Cluster Trace Program

Overview

The Alibaba Cluster Trace Program is published by Alibaba Group. By providing cluster trace from real production, the program helps the researchers, students and people who are interested in the field to get better understanding of the characterastics of modern internet data centers (IDC's) and the workloads.

So far, four versions of traces have been released:

cluster-trace-v2017 includes about 1300 machines in a period of 12 hours. The trace-v2017 firstly introduces the collocation of online services (aka long running applications) and batch workloads. To see more about this trace, see related documents (trace_2017). Download link is available after a short survey (survey link).
cluster-trace-v2018 includes about 4000 machines in a period of 8 days. Besides having larger scaler than trace-v2017, this piece trace also contains the DAG information of our production batch workloads. See related documents for more details (trace_2018). Download link is available after a survey (less than a minute, survey link).
cluster-trace-gpu-v2020 includes over 6500 GPUs (on ~1800 machines) in a period of 2 months. It describes the AI/ML workloads in the MLaaS (Machine-Learning-as-a-Service) provided by the Alibaba PAI (Platform for Artificial Intelligence) on GPU clusters. See the subdirectory (pai_gpu_trace_2020) for the released data, schema, and scripts for processing and visualization. Our analysis paper published in USENIX NSDI '22 is available here.
cluster-trace-microservices-v2021 contains 20000+ microservices in a period of 12 hours. The traces the first released to introduce the runtime metrics of microservices in the production cluster, including call dependencies, respond time, call rates, and so on. See the subdirectory (trace_2021) for more details. Our analysis paper, accepted by SoCC '21, is available here.
cluster-trace-microarchitecture-v2022 first provides AMTrace (Alibaba Microarchitecutre Trace). AMTrace is the first fine-granulairty and large-scale microarchitectural metrics of Alibaba Colocation Datacenter. Based AMTrace, researchers can analysis: CPU performance, microarchitecture contention, memory bandwidth contention and so on. Our paper is accepted by ICPP'22. See the subdirectory (trace_2022) for more details.
cluster-trace-gpu-v2023 includes over 6200 GPUs (on ~1200 machines). It describes the AI/ML workloads with diverse resource specifications in a heterogeneous GPU cluster. In our "Beware of Fragmentation" paper (published in USENIX ATC '23), we modeled this trace in a Kubernetes Scheduler Simulator and demonstrated that our proposed Fragmentation Gradient Descent (FGD) policy outperforms classic scheduling policies like Best-Fit, Dot-Product, etc. See fgd_gpu_trace_2023 for the released data, schema, and scripts for processing.

We encourage anyone to use the traces for study or research purposes, and if you had any question when using the trace, please contact us via email: alibaba-clusterdata, or file an issue on Github. Filing an issue is recommanded as the discussion would help all the community. Note that the more clearly you ask the question, the more likely you would get a clear answer.

It would be much appreciated if you could tell us once any publication using our trace is available, as we are maintaining a list of related publicatioins for more researchers to better communicate with each other.

In future, we will try to release new traces at a regular pace, please stay tuned.

Our motivation

As said at the beginning, our motivation on publishing the data is to help people in related field get a better understanding of modern data centers and provide production data for researchers to varify their ideas. You may use trace however you want as long as it is for reseach or study purpose.

From our perspective, the data is provided to address the challenges Alibaba face in IDC's where online services and batch jobs are collocated. We distill the challenges as the following topics:

Workload characterizations. How to characterize Alibaba workloads in a way that we can simulate various production workload in a representative way for scheduling and resource management strategy studies.
New algorithms to assign workload to machines. How to assign and reschedule workloads to machines for better resource utilization and ensuring the performance SLA for different applications (e.g. by reducing resource contention and defining proper proirities).
Collaboration between online service scheduler (Sigma) and batch jobs scheduler (Fuxi). How to adjust resource allocation between online service and batch jobs to improve throughput of batch jobs while maintain acceptable QoS (Quality of Service) and fast failure recovery for online service. As the scale of collocation (workloads managed by different schedulers) keeps growing, the design of collaboration mechanism is becoming more and more critical.

Last but not least, we are always open to work together with researchers to improve the efficiency of our clusters, and there are positions open for research interns. If you had any idea in your mind, please contact us via aliababa-clusterdata or Haiyang Ding (Haiyang maintains this cluster trace and works for Alibaba's resource management & scheduling group).

Outcomes from the trace

Papers using Alibaba cluster trace

The fundamental idea of our releasing cluster data is to enable researchers & practitioners doing resaerch, simulation with more realistic data and thus making the result closer to industry adoption. It is a huge encouragement to us to see more works using our data. Here is a list of existing works using Alibaba cluster data. If your paper uses our trace, it would be great if you let us know by sending us email (aliababa-clusterdata).

cluster trace GPU v2023
- Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent, Qizhen Weng* and Lingyun Yang* (co-first author), Yinghao Yu, Wei Wang, Xiaochuan Tang, Guodong Yang, and Liping Zhang. In the 2023 USENIX Annual Technical Conference (ATC '23), Boston, MA, USA, July 2023.
microarchitecture trace v2022
- Characterizing Job Microarchitectural Profiles at Alibaba Scale: Dataset and Analysis, Kangjin Wang, Ying Li, Cheng Wang, Tong Jia, Kingsum Chow, Yang Wen, Yaoyong Dou, Guoyao Xu, Chuanjia Hou, Jie Yao, and Liping Zhang. In 51st International Conference on Parallel Processing (ICPP ’22), August 29-September 1, 2022, Bordeaux, France. ACM, New York, NY, USA, 11 pages.
microservices trace v2021
- Characterizing Microservice Dependency and Performance: Alibaba Trace Analysis, Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Yu Ding, Jian He, Chengzhong Xu. SoCC'21
- μBench: an open-source factory of benchmark microservice applications, A. Detti, L. Funari and L. Petrucci, IEEE Transactions on Parallel and Distributed Systems
cluster trace GPU v2020
- MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters, Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. In the 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI '22), Renton, WA, USA, April 2022.
cluster trace v2018
- Who Limits the Resource Efficiency of My Datacenter: An Analysis of Alibaba Datacenter Traces, Jing Guo, Zihao Chang, Sa Wang, Haiyang Ding, Yihui Feng, Liang Mao, Yungang Bao, IEEE/ACM International Symposium on Quality of Service, IWQoS 2019
- DeepJS: Job Scheduling Based on Deep Reinforcement Learning in Cloud Data Center, by Fengcun Li and Bo Hu.
  - There is an interesting simulator released with this paper: CloudSimPy. You can check it at CloudSimPy
- Characterizing and Synthesizing Task Dependencies of Data-Parallel Jobs in Alibaba Cloud, by Huangshi Tian, Yunchuan Zheng, and Wei Wang, to appear in ACM Symposium on Cloud Computing (SoCC '19), Santa Cruz, California, November 2019.
- Aladdin: Optimized Maximum Flow Management for Shared Production Clusters, Heng WU, Wenbo ZHANG, Yuanjia XU, Hao XIANG, Tao HUANG, Haiyang DING, Zheng ZHANG, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
cluster trace v2017
- LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation, Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang, Purdue University. OSDI'18 (Best paper award!)
- The Elasticity and Plasticity in Semi-Containerized Co-locating Cloud Workload: a View from Alibaba Trace, Qixiao Liu and Zhibin Yu. SoCC2018
- Zeno: A Straggler Diagnosis System for Distributed Computing Using Machine Learning, Huanxing Shen and Cong Li, Proceedings of the Thirty-Third International Conference, ISC High Performance 2018
- Characterizing Co-located Datacenter Workloads: An Alibaba Case Study, Yue Cheng, Zheng Chai, Ali Anwar. APSys2018
- Imbalance in the Cloud: an Analysis on Alibaba Cluster Trace, Chengzhi Lu et al. BIGDATA 2017
- Jiang C, Han G, Lin J, et al. Characteristics of Co-allocated Online Services and Batch Jobs in Internet Data Centers: A Case Study from Alibaba Cloud[J]. IEEE Access, 2019, 7: 22495-22508.

Tech reports and projects on analysing the trace

So far this session is empty. In future, we are going to link some reports and open source repo on how to anaylsis the trace here, with the permission of the owner.

The purpose of this is to help more beginners to get start on learning either basic data analysis or how to inspect cluster from statistics perspective.

clusterdata's People

Contributors

Stargazers

Watchers

Forkers

metayd xieydd jokerwhy233 chapter09 genefeng jhz-raoyu cugwind jiaxuml skymysky devenlu crocodileone wxjiang0905 ziyujia rebel-uranus tankmatrixlee guomingtang douhui2002 qilewuqiong zwjtech xia0204 lu839684437 newwayy jebtang chocochino gaocegege fuzhengjia will-grindelwald hanqiaohuan lioncruise yilingyiling solertis shaowujie haiyangding yongjichou tanenbaum scapeqin azywait cyber-proxy ik2sb zjrodger chou-chou mrleedom xuesq xueshiqing zhenshenyu youngdou songyuanli-test xuezhizeng sysu-ndc-lab longyingren anranshise allencloud shaokaiyang baraasaeed lolo-pop henglicad jianghuspr carsonshan ringtail baojialiustc wangderi xiechengsheng ziyunling menguer tianyuzhang1214 swidasya wwjiang007 wswmjc fredcoding sharego strivegithub weltond zozozoo sgano1 alanbrown1 midasc ml1268 alexhe jiaodaxiaozi minbingong changsongyang neopracticegit hi-weijun taot168 caozq19 jerry-ban weizai118 bevinlee zwcdp wangyu0817 all-less echocipher hansxing forging2012 chainsky ustcwxy li-shu14 zzh1994 bearsmall crosscrowds

clusterdata's Issues

Memory bandwidth usage value (mem_gps)

Glad to see the new trace includes memory bandwidth usage information. I've checked several machine_usage entries and found non-empty values.
I'm somehow confused with its description "Normalized to maximum memory bandwidth usage of all machines". What does this value exactly mean? For example, if the mem_gps is 5, does it mean 5% of the memory bandwidth of this machine is used? Or it just means the bandwidth is 5GB/s?

BTW, two minor concerns.

This value is a float but is said to be an integer in 'trace_2018.md';
The name 'mem_gps' for memory bandwidth, what does the 'gps' mean?

Where is the download link for the 2018 trace? :)

Any plan for releasing bandwidth traffic and disk write/read frequency (IOPS)?

net_in/net_out of machine_usage are accumulated number?

I notice that the net_in / net_out of machine_usage arise by time, it is weird, I think, because there is no diurnal traffic peak, which is supposed to occur. I wonder the number are accumulated? or ...? thanks

On preparing new version of data.

The purpose of this is issue is to collect & discuss about "what to include" in the next version of releasing data.

As stated before, we have plans about releasing a version of data to fulfill the need from both academic and industry fields.

Currently we have heard of the following suggestions / requirements:

DAG info: #3
GPU and/or heterogeneous cluster data: #1

Please feel free to post your ideas here on what kind of information should be included in the next version of data, and we will evaluate these ideas and try to fulfill them depending the difficulty of collecting related information.

We are looking forward to hearing more from the community.

@gaocegege , @lioncruise , @uchuhimo , @CoffeeCandy , @xiandong79 , @lastweek, @allenbunny

Question about JobID and TaskID in batch_instance.csv

Hi, I found that 211305 records(1.31%) in batch_instance.csv has no JobID and TaskID.

There are some examples:

['65351', '65517', '', '', '269', 'Running', '1', '1', '', '', '', '']
['65351', '65517', '', '', '160', 'Running', '1', '1', '', '', '', '']
['65351', '65517', '', '', '224', 'Running', '1', '1', '', '', '', '']
['65351', '65517', '', '', '656', 'Running', '1', '1', '', '', '', '']
['65351', '65517', '', '', '1061', 'Running', '1', '1', '', '', '', '']
['65351', '65517', '', '', '416', 'Running', '1', '1', '', '', '', '']
['65351', '65517', '', '', '1014', 'Running', '1', '1', '', '', '', '']
['65351', '65517', '', '', '963', 'Running', '1', '1', '', '', '', '']
['65351', '65517', '', '', '1067', 'Running', '1', '1', '', '', '', '']

And their info about CPU and memory is missed, too. Is there any documentation about this kind of records?

I'd appreciate if you could help me :)

[Request] Add the online chat tool for better communication

Now we have to create issues to ask questions about the data set, but some of them are not actual "issues", could we add gitter, slack or any other online chat tools for better communication?

Max usage (CPU/mem) lower than average usage

I find records in batch_instance.csv for which the attributes for maximum utilization (real_cpu_max, real_mem_max) are lower than the attributes for average utilization (real_cpu_avg, real_mem_avg).

For instance, line 4475 of batch_instance.csv:

22556,22557,2632,15822,459,Terminated,1,1,0.92,0.93,0.002039706986255274,0.002153599254523335

Can you explain if this is normal and if not why this happens? I expect the values about maximum utilization to be equal or larger than the average.

Adding Server Capacity and Attributes

I am wondering if machine attributes such as resource capacity and information (num of cores, num of disks, num of CPUs, kernel version, clock speed, eth_speed, architecuture etc. ) can be provided? Google tracelog included the similar descriptions and it is quite important when invesitigating the compaction effectiveness and scheduling quality. Many thanks.

Any plan for publishing job DAG information?

The scheduler may use DAG information in the job to achieve better resource utilization, which has been explored in Graphene. Publishing job DAG information will inspire further exploration to use DAG information in scheduling.

I cannot download the data and it shows“AccessDenied"

I open the v2018 download link and it shows the following information:
This XML file does not appear to have any style information associated with it. The document tree is shown below.

AccessDenied

You have no right to access this object because of bucket acl.

5C51C2AB307432057CAAF29B
clusterdata2018pubcn.oss-cn-beijing.aliyuncs.com

Unit of "mem_size" in container_meta.csv

The doc says mem_size is Normalized to the largest memory size of all machines, but the values are >1, while the normalized plan_mem in batch_task.csv is in [0, 1]. Are different normalization schemes used on the two datasets? Is it correct to assume the mem_size in container_meta.csv is the actual normalized value multiplied by 100? Thanks!

About dependences between tasks in the DAG of one job, and their start time.

"M5_3_4: means that task5 depends on both task3 and task4, that is, task5 cannot start before all instances of both task3 and task4 are completed.", which is read from "trace_2018.md(2.3 DAG of batch worloads)" to illuminate the DAG of a job, which emphasizes that task5 cannot start before task3 and task4 terminated.
However, in the real records in the batck_task.csv, there are many records not obey the rule of dependence above, for example, "R2_1,1,j_2,1,Terminated,87076,87086,50,0.2", "M1,1,j_2,1,Terminated,87076,87083,50,0.2", which are two tasks in the same job(j_2). We can see that task2(R2_1) depends on task1(M1), but the start time of task2(R2_1) is the same as task1's, both of them are 87076, which is confused me.
Can you please explain my confusion.

A job contains multiple tasks

I don't quite understand " A job contains multiple tasks". Can anyone give an example about what is a job and what is a task? Thanks in advance.

[Confirmed] Unit Problem about CPU in batch_task.csv

Hi, I am trying to find ideas for my research paper via alibaba's cluster data. And I have some problems when I tried to understand the CPU of the batch tasks in batch_task.csv.

我正在尝试在阿里巴巴的集群数据中找一些灵感，但是发现了一些问题。

I found that all machines has 64 cores in the server_event.csv, but there are many batch tasks which needs 100 cores. For example, the task with ID 79304:

在数据中，所有的机器都是 64 核的，但是有很多 batch task 的需求是 100 核，以 TaskID 为 79304 的 task 为例说明：

['59581', '59592', '12702', '79304', '29', 'Terminated', '100', '0.005404705821447985']

It only has one record in batch_task.csv which state is Terminated and CPU is 100. But all of the instances of the task only requires 1 CPU, I am not sure if I misunderstand something.

它在 batch_task.csv 中只有一条记录，状态是 Terminated，CPU 是 100。但是它的 29 个实例都只需要 1 个 CPU。我想知道是不是我误解了什么

['59581', '59592', '12702', '79304', '446', 'Terminated', '1', '1', '1.01', '0.12', '0.035296249320529084', '0.03176558900421919']
['59581', '59592', '12702', '79304', '928', 'Terminated', '1', '1', '1.14', '0.13', '0.035296249320529084', '0.03171381979137007']
['59581', '59592', '12702', '79304', '724', 'Terminated', '1', '1', '0.87', '0.09', '0.026573136955452593', '0.022990707426293583']
['59581', '59592', '12702', '79304', '163', 'Terminated', '1', '1', '0.86', '0.14', '0.035296249320529084', '0.03171381979137007']
['59581', '59592', '12702', '79304', '43', 'Terminated', '1', '1', '0.85', '0.1', '0.026573136955452593', '0.022990707426293583']
['59581', '59592', '12702', '79304', '426', 'Terminated', '1', '1', '1.05', '0.21', '0.06459244687184532', '0.05919291797168224']
['59581', '59592', '12702', '79304', '1123', 'Terminated', '1', '1', '0.87', '0.16', '0.035296249320529084', '0.03144461988455466']
['59581', '59592', '12702', '79304', '1130', 'Terminated', '1', '1', '0.98', '0.14', '0.035296249320529084', '0.03144461988455466']
['59581', '59592', '12702', '79304', '1132', 'Terminated', '1', '1', '1', '0.13', '0.03310641161701136', '0.027753475008412496']
['59581', '59592', '12702', '79304', '1170', 'Terminated', '1', '1', '0.91', '0.19', '0.050402505629901895', '0.04456811534180623']
['59581', '59592', '12702', '79304', '485', 'Terminated', '1', '1', '1.05', '0.1', '0.03305464240416225', '0.029523982087852355']
['59581', '59592', '12702', '79304', '168', 'Terminated', '1', '1', '1', '0.14', '0.03638340279036057', '0.031030466181761706']
['59581', '59592', '12702', '79304', '175', 'Terminated', '1', '1', '1', '0.1', '0.026583490798022417', '0.023001061268863407']
['59581', '59592', '12702', '79304', '792', 'Terminated', '1', '1', '0.82', '0.14', '0.03638340279036057', '0.032531773354386145']
['59581', '59592', '12702', '79304', '1240', 'Terminated', '1', '1', '0.98', '0.22', '0.05653715735252245', '0.051137628452359384']
['59581', '59592', '12702', '79304', '180', 'Terminated', '1', '1', '0.9', '0.13', '0.035296249320529084', '0.03144461988455466']
['59581', '59592', '12702', '79304', '569', 'Terminated', '1', '1', '1.17', '0.25', '0.05863898739419667', '0.05146895141459374']
['59581', '59592', '12702', '79304', '781', 'Terminated', '1', '1', '0.72', '0.12', '0.030155566484611603', '0.026303937048637177']
['59581', '59592', '12702', '79304', '790', 'Terminated', '1', '1', '0.93', '0.12', '0.035301426241813996', '0.029948489633215128']
['59581', '59592', '12702', '79304', '1217', 'Terminated', '1', '1', '2.5', '0.09', '0.026303937048637177', '0.022219346154841717']
['59581', '59592', '12702', '79304', '366', 'Terminated', '1', '1', '1', '0.1', '0.026619729247016798', '0.02303729971785779']
['59581', '59592', '12702', '79304', '271', 'Terminated', '1', '1', '0.78', '0.15', '0.03594854140242797', '0.0317707659255041']
['59581', '59592', '12702', '79304', '534', 'Terminated', '1', '1', '0.86', '0.08', '0.026045090984391582', '0.02412445318768928']
['59581', '59592', '12702', '79304', '1280', 'Terminated', '1', '1', '0.98', '0.17', '0.04632826857867626', '0.0409753319700774']
['59581', '59592', '12702', '79304', '641', 'Terminated', '1', '1', '2', '0.1', '0.029197836046902906', '0.02520642973623586']
['59581', '59592', '12702', '79304', '412', 'Terminated', '1', '1', '0.93', '0.1', '0.02659384464059224', '0.022742215204617815']
['59581', '59592', '12702', '79304', '1030', 'Terminated', '1', '1', '0.92', '0.12', '0.035296249320529084', '0.03171381979137007']
['59581', '59592', '12702', '79304', '992', 'Terminated', '1', '1', '1.18', '0.12', '0.03642481816063987', '0.032852742474050685']
['59581', '59592', '12702', '79304', '815', 'Terminated', '1', '1', '0.85', '0.1', '0.02659384464059224', '0.022742215204617815']

I'd appreciate it if you could answer.

Ce Gao

不胜感激
高策

Data Inconsistency in container_event and container_usage

Hi, I tried to analyse the data about long-running jobs in alibaba trace. But I found that there are some records in container_usage.csv that the instance_id of the record does not exists in container_event.csv.

For instance, there is an entry in container_usage.csv:

['44700', '9088', '9.199999999999992', '35.70000076289956', '5.659999847414268', '0.3200000047682117', '0.3600000083444044', '0.4000000059600084', '0.2258861563823704', '0.2572154391695187', '2.60268044472', '2.83470249176']

The instance_id is 9088 but there is no entry in container_event.csv which instance_id is 9088.

There are 1480906 entries in container_usage.csv totally, and 24570 of them are missed in container_event.csv. What does it mean?

I'd appreciate it if you could help me. @furykerry

Best,
Ce Gao

Question about CPU allocation on containerized online service

According to container_event.csv, there are always only "Create" type events during the 12 h, without any "Remove" type event. Then I use the following SQL to statistics on each machine's containerized online service CPU allocation number (To facilitate data analysis, I have imported the csv file intact into the MySQL table).

SELECT machine_id, sum(cpu_requested) AS total_cpu_requested  
FROM container_event 
GROUP BY machine_id 
ORDER BY total_cpu_requested DESC 
LIMIT 10;

query result:

+------------+---------------------+
| machine_id | total_cpu_requested |
+------------+---------------------+
|        676 |                 124 |
|        673 |                 120 |
|        679 |                 116 |
|        813 |                  84 |
|        829 |                  72 |
|        797 |                  72 |
|       1134 |                  72 |
|        102 |                  68 |
|         69 |                  68 |
|        671 |                  68 |
+------------+---------------------+

According to server_event.csv, we could see that each machine has 64 CPU cores. So how does it allocate more than 64 CPU numbers on a machine as above?

In case that a machine may crash during the 12 h period, it may cause exception in resource allocation on containerized online service. I specifically query these machines in server_event.csv, but these machines seem to have no problem.

SELECT timestamp,machine_id,event_type, event_detail
FROM server_event
WHERE machine_id
IN ( SELECT machine_id
      FROM container_event
      GROUP BY machine_id
      HAVING sum(cpu_requested) > 64
)

query result:

+-----------+------------+------------+--------------+
| timestamp | machine_id | event_type | event_detail |
+-----------+------------+------------+--------------+
|         0 |         19 | add        | NULL         |
|         0 |         56 | add        | NULL         |
|         0 |         69 | add        | NULL         |
|         0 |        102 | add        | NULL         |
|         0 |        323 | add        | NULL         |
|         0 |        671 | add        | NULL         |
|         0 |        673 | add        | NULL         |
|         0 |        676 | add        | NULL         |
|         0 |        679 | add        | NULL         |
|         0 |        797 | add        | NULL         |
|         0 |        813 | add        | NULL         |
|         0 |        829 | add        | NULL         |
|         0 |       1241 | add        | NULL         |
|         0 |       1251 | add        | NULL         |
|         0 |       1295 | add        | NULL         |
|         0 |       1120 | add        | NULL         |
|         0 |       1134 | add        | NULL         |
+-----------+------------+------------+--------------+

Very grateful if anyone could give an explanation.

About batch workload

The instance of batch workload is VMs, and may I ask what is the type of VM in alibaba cluster(kvm, xen or others)?
As mentioned in Papers using Alibaba cluster data, the resource management and job scheduling system of batch workload is Fuxi, and it can deal with different application(MapReduce, Spark etc), may I ask what is the type of applications in trace_201708? And I wonder what is the batch processing system mainly used in Alibaba Cluster?

How is the container CPU usage is calculated?

Hi All

Have a question regards to the container CPU usage “cpu_util_precent:container_usage.csv”
How is this value calculated?

I am a bit confusing about this value. For example, the machine “m_644” at timestamp “90080” hosts 7 containers:

container_id	machine_id	timestamp	cpu_utli_precent
c_31920	m_644	90080	13
c_5644	m_644	90080	8
c_28792	m_644	90080	7
c_21826	m_644	90080	7
c_35230	m_644	90080	7
c_58607	m_644	90080	19
c_3	m_644	90080	18

The total CPU usage for the containers at time 90080 is 79. While cpu_util_precent:machine_usage.csv at the same time “90080” is 42, So can anyone explain how is the calculation of “cpu_util_precent: container_usage.csv”?

Thank you

Description of different states

Hi,
Is it possible to give a short description about the following instance status?

Ready
Waiting
Running
Terminated
Failed
Cancelled
Interrupted

Add license

This dataset is very valuable not only for academic purpose, but also to further develop online economy as well. Thus I think this dataset should have license to moderate its use. I suggest using CC-BY-4.0 as recommended by Github's guide.

无法下载

点击下载链接时为什么下载不了？还是用什么特定的方法下载？

Add paper about the trace in README

I found a paper Imbalance in the Cloud: an Analysis on Alibaba Cluster Trace (BIGDATA'17).

And I think we could add the link to the paper in README.

Resource usage for task can be higher than resources requested

There are many tasks in the dataset that utilize more resources than what was requested.

For instance, job_id:10771 task_id:66551 has plan_cpu:0.75 [1] from the following entry in batch_task.csv:

6301,6352,10771,66551,137,Terminated,75,0.01600704061294748

However, this task utilizes 7.66 (Max) and 0.99 (average) CPU as can be seen in batch_instance.csv:

6302,6339,10771,66551,427,Terminated,1,1,7.66,0.99,0.019309916392721248,0.012926772448424922

Can you clarify if the amount of resources used by tasks can be higher than the amount of resources requested? If not, what can explain these numbers?

Can I interpret the amount of resources requested as resources allocated by the scheduler?

[1] I divided the plan_cpu value by 100 as explained in this issue: #11

very slow to download

it will take me 12 days to download (from US). Is there a better way to host the data? e.g., via DropBox? of host a copy in US or so?

Missing column in Trace_2018 machine_usage.csv

Hi Alibaba experts,
I notice that each record in the machine_usage.csv contains 9 colums. As shown below:

m_1932,388730,37,83,,,43.08,33.12,3
m_1932,388770,34,83,,,43.08,33.12,2

However, the trace_2018.md shows that there should be 10 columns.
Is there anything missing?
Thanks

About the batch_task.csv

About the 2nd item in batch_task.csv:

Task table(batch_task.csv)

create_timestamp: the create time of a task
modify_timestamp: latest state modification time

in schema.csv:

So the 2nd is "latest state modification time" or "task end time"?

The task create time is timetamp. It seems longer than 24 hours.

Adding Network data

The current data set includes CPU, mem, and disk, would it be possible to add network information?
The NIC speeds and the network utilization?

Lifecycle of Batch Task

With some analysis it seems that each pair job-id/task-id only appears once in batch_task.csv.

Shouldn't tasks go through different stages, for instance, Waiting->Ready->Running->Terminated, and thus appear multiple times in this file?

What do the create_timestamp and modify_timestamp mean for an entry with Running status vs one with Terminated status? Do tasks that have 'Running' status ever finish? If they do, when does that happen?

Server information

In the server event table, we can see that all machines have 64 cpu's are they actual physical cores or virtual cpu's?
In the server event table, we can see that some machines have memory as 0.6. Is this is the free available memory and remaining memory was occupied by softwares?
Can you tell me the type of machine(like Dell power edge) used in the Alibaba server?

Question about sigma.png

Hello guys, can anyone give a description of the architecture diagram below? I wonder whether the scheduling of the 1.3k-machine cluster is done by two separated schedulers sigmaMaster and FuxiMaster, while sigmaMaster responses for containerized online service instances and FuxiMaster responses for traditional batch jobs. Two schedulers have their own restrictions and allocations rules on the cluster resources respectively and do not affect each other. I'm not quite sure about my guess, very grateful if someone could resolve my doubt.

machines

Does "Machines" and "Server" mean physical machine here?

Question about State machine of batch task and instance

As what is given in trace_201708.md, we found that both task and instance all have status of "Waiting". and what is declared is:
task -> Waiting: A task in not initialized yet
instance -> Waiting: The instance can't run because some of its dependencies have not finished
IIUC, if a instance's status is "Waiting", we can be sure that there is some dependency among tasks that has not been satisfied. so whenever task's status is no longer "Waiting", its instances' status can change.
so the "Waiting" tasks mean that they are waiting for other tasks finishing ?

In addition, I found that some instance reboot after it arrived at "Failed" status, but others not. Is there any mechanism for judging whether a instance should reboot?

I will appreciate if someone could help.

The schema for online service is missed.

Service instance event/usage tables mentioned in trace_201708.md cannot be found in schema.csv.

Adding LLC, Memory Bandwidth and Network Usage

The current data set only includes cpu, memory and disk usage data. I'm wondering if more kinds of resource usage data such as LLC, memory bandwidth and network can be provided. It's important for investigating how batch jobs influence online services on multidimensional resources. Many thanks.

Adding description for failures: task and instances

Would it be possible to add a field broadly classifying the root cause of the different failures? Both task failures and instance failures.

Is there some problem with the time_stamp in v2018??

8 days has 691200 second, and in machine_usage.csv, a mahcine's time_stamp is exactly from 0 to 691190.
But, it's strange that most of the containers in container_usage.csv has bigger time_stamp than 691200. For example c_5911(container_id = 5911), it's time_stamp is from 86400 to 777590.
And I found that 777590-86400=691190=8 days
It means that the information of this container is from day 2 to day 9.
why there is no information about the container in day 1 ?
And why there is an extra day(day 9) for only container information ?
So, should we change the time_stamp of container_usage???
@HaiyangDING

Negative timestamps in batch_instance.csv

It seems there are records with negative timestamps (both start and end) in the batch_instance.csv.

For instance:

-792,-788,6850,40552,546,Terminated,1,1,2,0.58,,

I didn't see any mention of this in the documentation (only about timestamps of 0 if the events occur before the start of the trace).

Can you please clarify what these entries mean?

when the next version dataset will publish?

hi,may i ask that when the next version dataset will publish?

Time illegal in container_usage.csv

I found in container_usage.csv,there is no one record's timestamp located in day 1 but a lot of record in day 9,is this a mistake?.

Repeated entries in container_event.csv and resources requests above machine capacity

In the file container_event.csv I find that 1) some online instances seem to be created twice in the same timestamp/machine and 2) values of plan_mem can be higher than the machine's capacity.

For instance, in the file container_event.csv I find the following consecutive lines:

0,Create,10772,85,4,0.042409339165997983,0.0340851218221338,32|33|34|35,
0,Create,10772,85,4,0.9999633571040303,0.0340851218221338,32|33|34|35,

The lines are the same, except the plan_mem value. Also when looking into the capacity for machine #85 in file server_event.csv I see:

0,85,add,,64,0.6899697150104833,1

It seems that this machine doesn't have enough memory for the creation of the service (0.68 < 0.99).

My questions are:

Can a service be created twice (like the example above)?
Is this request for 0.99 of memory valid considering that the machine only has 0.68?

Can you please explain this?

Thank you in advance for the clarifications.

Question Regarding Normalized Memory Usage

My question is in column plan_mem in batch_task.csv. I am still a little bit confused about the normalization standard. The schema mentions this column, plan_mem, specifies normalized memory requested for each instance of the task and it's Normalized to the largest memory size of all machines.

Few tasks have an even larger number than 15 in plan_mem column (e.g., 15.45, 17.17).

task_NDg2ODM2NDIyMDczNDQ4NzMzOA==,50,j_3003670,11,Terminated,233037,234282,700,17.17
task_NDg2ODM2NDIyMDczNDQ4NzMzOA==,50,j_3685110,11,Terminated,471423,493080,700,15.45

Does it mean each instance in this task take around 17.17 times of the largest memory in a physical machine in the whole memory? If I just assume the biggest memory on a physical machine in the cluster is 128GB, does each instance will take 2TB memory and the whole task will take 100TB memory eventually?

Thank you very much.