Git Product home page Git Product logo

clusterdata's Introduction

Alibaba Cluster Trace Program

Overview

The Alibaba Cluster Trace Program is published by Alibaba Group. By providing cluster trace from real production, the program helps the researchers, students and people who are interested in the field to get better understanding of the characterastics of modern internet data centers (IDC's) and the workloads.

So far, four versions of traces have been released:

  • cluster-trace-v2017 includes about 1300 machines in a period of 12 hours. The trace-v2017 firstly introduces the collocation of online services (aka long running applications) and batch workloads. To see more about this trace, see related documents (trace_2017). Download link is available after a short survey (survey link).
  • cluster-trace-v2018 includes about 4000 machines in a period of 8 days. Besides having larger scaler than trace-v2017, this piece trace also contains the DAG information of our production batch workloads. See related documents for more details (trace_2018). Download link is available after a survey (less than a minute, survey link).
  • cluster-trace-gpu-v2020 includes over 6500 GPUs (on ~1800 machines) in a period of 2 months. It describes the AI/ML workloads in the MLaaS (Machine-Learning-as-a-Service) provided by the Alibaba PAI (Platform for Artificial Intelligence) on GPU clusters. See the subdirectory (pai_gpu_trace_2020) for the released data, schema, and scripts for processing and visualization. Our analysis paper published in USENIX NSDI '22 is available here.
  • cluster-trace-microservices-v2021 contains 20000+ microservices in a period of 12 hours. The traces the first released to introduce the runtime metrics of microservices in the production cluster, including call dependencies, respond time, call rates, and so on. See the subdirectory (trace_2021) for more details. Our analysis paper, accepted by SoCC '21, is available here.
  • cluster-trace-microarchitecture-v2022 first provides AMTrace (Alibaba Microarchitecutre Trace). AMTrace is the first fine-granulairty and large-scale microarchitectural metrics of Alibaba Colocation Datacenter. Based AMTrace, researchers can analysis: CPU performance, microarchitecture contention, memory bandwidth contention and so on. Our paper is accepted by ICPP'22. See the subdirectory (trace_2022) for more details.
  • cluster-trace-gpu-v2023 includes over 6200 GPUs (on ~1200 machines). It describes the AI/ML workloads with diverse resource specifications in a heterogeneous GPU cluster. In our "Beware of Fragmentation" paper (published in USENIX ATC '23), we modeled this trace in a Kubernetes Scheduler Simulator and demonstrated that our proposed Fragmentation Gradient Descent (FGD) policy outperforms classic scheduling policies like Best-Fit, Dot-Product, etc. See fgd_gpu_trace_2023 for the released data, schema, and scripts for processing.

We encourage anyone to use the traces for study or research purposes, and if you had any question when using the trace, please contact us via email: alibaba-clusterdata, or file an issue on Github. Filing an issue is recommanded as the discussion would help all the community. Note that the more clearly you ask the question, the more likely you would get a clear answer.

It would be much appreciated if you could tell us once any publication using our trace is available, as we are maintaining a list of related publicatioins for more researchers to better communicate with each other.

In future, we will try to release new traces at a regular pace, please stay tuned.

Our motivation

As said at the beginning, our motivation on publishing the data is to help people in related field get a better understanding of modern data centers and provide production data for researchers to varify their ideas. You may use trace however you want as long as it is for reseach or study purpose.

From our perspective, the data is provided to address the challenges Alibaba face in IDC's where online services and batch jobs are collocated. We distill the challenges as the following topics:

  1. Workload characterizations. How to characterize Alibaba workloads in a way that we can simulate various production workload in a representative way for scheduling and resource management strategy studies.
  2. New algorithms to assign workload to machines. How to assign and reschedule workloads to machines for better resource utilization and ensuring the performance SLA for different applications (e.g. by reducing resource contention and defining proper proirities).
  3. Collaboration between online service scheduler (Sigma) and batch jobs scheduler (Fuxi). How to adjust resource allocation between online service and batch jobs to improve throughput of batch jobs while maintain acceptable QoS (Quality of Service) and fast failure recovery for online service. As the scale of collocation (workloads managed by different schedulers) keeps growing, the design of collaboration mechanism is becoming more and more critical.

Last but not least, we are always open to work together with researchers to improve the efficiency of our clusters, and there are positions open for research interns. If you had any idea in your mind, please contact us via aliababa-clusterdata or Haiyang Ding (Haiyang maintains this cluster trace and works for Alibaba's resource management & scheduling group).

Outcomes from the trace

Papers using Alibaba cluster trace

The fundamental idea of our releasing cluster data is to enable researchers & practitioners doing resaerch, simulation with more realistic data and thus making the result closer to industry adoption. It is a huge encouragement to us to see more works using our data. Here is a list of existing works using Alibaba cluster data. If your paper uses our trace, it would be great if you let us know by sending us email (aliababa-clusterdata).

Tech reports and projects on analysing the trace

So far this session is empty. In future, we are going to link some reports and open source repo on how to anaylsis the trace here, with the permission of the owner.

The purpose of this is to help more beginners to get start on learning either basic data analysis or how to inspect cluster from statistics perspective.

clusterdata's People

Contributors

andreadetti avatar changzihao avatar cll24 avatar danibachar avatar felidsche avatar furykerry avatar haiyangding avatar kmmelcher avatar lioncruise avatar mygoditsfull0fstars avatar niewuya avatar packagewjx avatar qzweng avatar violet-guo avatar wang-kangjin avatar yzs981130 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clusterdata's Issues

Memory bandwidth usage value (mem_gps)

Glad to see the new trace includes memory bandwidth usage information. I've checked several machine_usage entries and found non-empty values.
I'm somehow confused with its description "Normalized to maximum memory bandwidth usage of all machines". What does this value exactly mean? For example, if the mem_gps is 5, does it mean 5% of the memory bandwidth of this machine is used? Or it just means the bandwidth is 5GB/s?

BTW, two minor concerns.

  1. This value is a float but is said to be an integer in 'trace_2018.md';
  2. The name 'mem_gps' for memory bandwidth, what does the 'gps' mean?

On preparing new version of data.

The purpose of this is issue is to collect & discuss about "what to include" in the next version of releasing data.

As stated before, we have plans about releasing a version of data to fulfill the need from both academic and industry fields.

Currently we have heard of the following suggestions / requirements:

  • DAG info: #3
  • GPU and/or heterogeneous cluster data: #1

Please feel free to post your ideas here on what kind of information should be included in the next version of data, and we will evaluate these ideas and try to fulfill them depending the difficulty of collecting related information.

We are looking forward to hearing more from the community.

@gaocegege , @lioncruise , @uchuhimo , @CoffeeCandy , @xiandong79 , @lastweek, @allenbunny

Question about JobID and TaskID in batch_instance.csv

Hi, I found that 211305 records(1.31%) in batch_instance.csv has no JobID and TaskID.

There are some examples:

['65351', '65517', '', '', '269', 'Running', '1', '1', '', '', '', '']
['65351', '65517', '', '', '160', 'Running', '1', '1', '', '', '', '']
['65351', '65517', '', '', '224', 'Running', '1', '1', '', '', '', '']
['65351', '65517', '', '', '656', 'Running', '1', '1', '', '', '', '']
['65351', '65517', '', '', '1061', 'Running', '1', '1', '', '', '', '']
['65351', '65517', '', '', '416', 'Running', '1', '1', '', '', '', '']
['65351', '65517', '', '', '1014', 'Running', '1', '1', '', '', '', '']
['65351', '65517', '', '', '963', 'Running', '1', '1', '', '', '', '']
['65351', '65517', '', '', '1067', 'Running', '1', '1', '', '', '', '']

And their info about CPU and memory is missed, too. Is there any documentation about this kind of records?

I'd appreciate if you could help me :)

Max usage (CPU/mem) lower than average usage

I find records in batch_instance.csv for which the attributes for maximum utilization (real_cpu_max, real_mem_max) are lower than the attributes for average utilization (real_cpu_avg, real_mem_avg).

For instance, line 4475 of batch_instance.csv:

22556,22557,2632,15822,459,Terminated,1,1,0.92,0.93,0.002039706986255274,0.002153599254523335

Can you explain if this is normal and if not why this happens? I expect the values about maximum utilization to be equal or larger than the average.

Adding Server Capacity and Attributes

I am wondering if machine attributes such as resource capacity and information (num of cores, num of disks, num of CPUs, kernel version, clock speed, eth_speed, architecuture etc. ) can be provided? Google tracelog included the similar descriptions and it is quite important when invesitigating the compaction effectiveness and scheduling quality. Many thanks.

Any plan for publishing job DAG information?

The scheduler may use DAG information in the job to achieve better resource utilization, which has been explored in Graphene. Publishing job DAG information will inspire further exploration to use DAG information in scheduling.

I cannot download the data and it shows“AccessDenied"

I open the v2018 download link and it shows the following information:
This XML file does not appear to have any style information associated with it. The document tree is shown below.

AccessDenied

You have no right to access this object because of bucket acl.

5C51C2AB307432057CAAF29B
clusterdata2018pubcn.oss-cn-beijing.aliyuncs.com

Unit of "mem_size" in container_meta.csv

The doc says mem_size is Normalized to the largest memory size of all machines, but the values are >1, while the normalized plan_mem in batch_task.csv is in [0, 1]. Are different normalization schemes used on the two datasets? Is it correct to assume the mem_size in container_meta.csv is the actual normalized value multiplied by 100? Thanks!

About dependences between tasks in the DAG of one job, and their start time.

"M5_3_4: means that task5 depends on both task3 and task4, that is, task5 cannot start before all instances of both task3 and task4 are completed.", which is read from "trace_2018.md(2.3 DAG of batch worloads)" to illuminate the DAG of a job, which emphasizes that task5 cannot start before task3 and task4 terminated.
However, in the real records in the batck_task.csv, there are many records not obey the rule of dependence above, for example, "R2_1,1,j_2,1,Terminated,87076,87086,50,0.2", "M1,1,j_2,1,Terminated,87076,87083,50,0.2", which are two tasks in the same job(j_2). We can see that task2(R2_1) depends on task1(M1), but the start time of task2(R2_1) is the same as task1's, both of them are 87076, which is confused me.
Can you please explain my confusion.

A job contains multiple tasks

I don't quite understand " A job contains multiple tasks". Can anyone give an example about what is a job and what is a task? Thanks in advance.

[Confirmed] Unit Problem about CPU in batch_task.csv

Hi, I am trying to find ideas for my research paper via alibaba's cluster data. And I have some problems when I tried to understand the CPU of the batch tasks in batch_task.csv.

我正在尝试在阿里巴巴的集群数据中找一些灵感,但是发现了一些问题。

I found that all machines has 64 cores in the server_event.csv, but there are many batch tasks which needs 100 cores. For example, the task with ID 79304:

在数据中,所有的机器都是 64 核的,但是有很多 batch task 的需求是 100 核,以 TaskID 为 79304 的 task 为例说明:

['59581', '59592', '12702', '79304', '29', 'Terminated', '100', '0.005404705821447985']

It only has one record in batch_task.csv which state is Terminated and CPU is 100. But all of the instances of the task only requires 1 CPU, I am not sure if I misunderstand something.

它在 batch_task.csv 中只有一条记录,状态是 Terminated,CPU 是 100。但是它的 29 个实例都只需要 1 个 CPU。我想知道是不是我误解了什么

['59581', '59592', '12702', '79304', '446', 'Terminated', '1', '1', '1.01', '0.12', '0.035296249320529084', '0.03176558900421919']
['59581', '59592', '12702', '79304', '928', 'Terminated', '1', '1', '1.14', '0.13', '0.035296249320529084', '0.03171381979137007']
['59581', '59592', '12702', '79304', '724', 'Terminated', '1', '1', '0.87', '0.09', '0.026573136955452593', '0.022990707426293583']
['59581', '59592', '12702', '79304', '163', 'Terminated', '1', '1', '0.86', '0.14', '0.035296249320529084', '0.03171381979137007']
['59581', '59592', '12702', '79304', '43', 'Terminated', '1', '1', '0.85', '0.1', '0.026573136955452593', '0.022990707426293583']
['59581', '59592', '12702', '79304', '426', 'Terminated', '1', '1', '1.05', '0.21', '0.06459244687184532', '0.05919291797168224']
['59581', '59592', '12702', '79304', '1123', 'Terminated', '1', '1', '0.87', '0.16', '0.035296249320529084', '0.03144461988455466']
['59581', '59592', '12702', '79304', '1130', 'Terminated', '1', '1', '0.98', '0.14', '0.035296249320529084', '0.03144461988455466']
['59581', '59592', '12702', '79304', '1132', 'Terminated', '1', '1', '1', '0.13', '0.03310641161701136', '0.027753475008412496']
['59581', '59592', '12702', '79304', '1170', 'Terminated', '1', '1', '0.91', '0.19', '0.050402505629901895', '0.04456811534180623']
['59581', '59592', '12702', '79304', '485', 'Terminated', '1', '1', '1.05', '0.1', '0.03305464240416225', '0.029523982087852355']
['59581', '59592', '12702', '79304', '168', 'Terminated', '1', '1', '1', '0.14', '0.03638340279036057', '0.031030466181761706']
['59581', '59592', '12702', '79304', '175', 'Terminated', '1', '1', '1', '0.1', '0.026583490798022417', '0.023001061268863407']
['59581', '59592', '12702', '79304', '792', 'Terminated', '1', '1', '0.82', '0.14', '0.03638340279036057', '0.032531773354386145']
['59581', '59592', '12702', '79304', '1240', 'Terminated', '1', '1', '0.98', '0.22', '0.05653715735252245', '0.051137628452359384']
['59581', '59592', '12702', '79304', '180', 'Terminated', '1', '1', '0.9', '0.13', '0.035296249320529084', '0.03144461988455466']
['59581', '59592', '12702', '79304', '569', 'Terminated', '1', '1', '1.17', '0.25', '0.05863898739419667', '0.05146895141459374']
['59581', '59592', '12702', '79304', '781', 'Terminated', '1', '1', '0.72', '0.12', '0.030155566484611603', '0.026303937048637177']
['59581', '59592', '12702', '79304', '790', 'Terminated', '1', '1', '0.93', '0.12', '0.035301426241813996', '0.029948489633215128']
['59581', '59592', '12702', '79304', '1217', 'Terminated', '1', '1', '2.5', '0.09', '0.026303937048637177', '0.022219346154841717']
['59581', '59592', '12702', '79304', '366', 'Terminated', '1', '1', '1', '0.1', '0.026619729247016798', '0.02303729971785779']
['59581', '59592', '12702', '79304', '271', 'Terminated', '1', '1', '0.78', '0.15', '0.03594854140242797', '0.0317707659255041']
['59581', '59592', '12702', '79304', '534', 'Terminated', '1', '1', '0.86', '0.08', '0.026045090984391582', '0.02412445318768928']
['59581', '59592', '12702', '79304', '1280', 'Terminated', '1', '1', '0.98', '0.17', '0.04632826857867626', '0.0409753319700774']
['59581', '59592', '12702', '79304', '641', 'Terminated', '1', '1', '2', '0.1', '0.029197836046902906', '0.02520642973623586']
['59581', '59592', '12702', '79304', '412', 'Terminated', '1', '1', '0.93', '0.1', '0.02659384464059224', '0.022742215204617815']
['59581', '59592', '12702', '79304', '1030', 'Terminated', '1', '1', '0.92', '0.12', '0.035296249320529084', '0.03171381979137007']
['59581', '59592', '12702', '79304', '992', 'Terminated', '1', '1', '1.18', '0.12', '0.03642481816063987', '0.032852742474050685']
['59581', '59592', '12702', '79304', '815', 'Terminated', '1', '1', '0.85', '0.1', '0.02659384464059224', '0.022742215204617815']

I'd appreciate it if you could answer.

Ce Gao

不胜感激
高策

Data Inconsistency in container_event and container_usage

Hi, I tried to analyse the data about long-running jobs in alibaba trace. But I found that there are some records in container_usage.csv that the instance_id of the record does not exists in container_event.csv.

For instance, there is an entry in container_usage.csv:

['44700', '9088', '9.199999999999992', '35.70000076289956', '5.659999847414268', '0.3200000047682117', '0.3600000083444044', '0.4000000059600084', '0.2258861563823704', '0.2572154391695187', '2.60268044472', '2.83470249176']

The instance_id is 9088 but there is no entry in container_event.csv which instance_id is 9088.

There are 1480906 entries in container_usage.csv totally, and 24570 of them are missed in container_event.csv. What does it mean?

I'd appreciate it if you could help me. @furykerry

Best,
Ce Gao

Question about CPU allocation on containerized online service

According to container_event.csv, there are always only "Create" type events during the 12 h, without any "Remove" type event. Then I use the following SQL to statistics on each machine's containerized online service CPU allocation number (To facilitate data analysis, I have imported the csv file intact into the MySQL table).

SELECT machine_id, sum(cpu_requested) AS total_cpu_requested  
FROM container_event 
GROUP BY machine_id 
ORDER BY total_cpu_requested DESC 
LIMIT 10;

query result:

+------------+---------------------+
| machine_id | total_cpu_requested |
+------------+---------------------+
|        676 |                 124 |
|        673 |                 120 |
|        679 |                 116 |
|        813 |                  84 |
|        829 |                  72 |
|        797 |                  72 |
|       1134 |                  72 |
|        102 |                  68 |
|         69 |                  68 |
|        671 |                  68 |
+------------+---------------------+

According to server_event.csv, we could see that each machine has 64 CPU cores. So how does it allocate more than 64 CPU numbers on a machine as above?

In case that a machine may crash during the 12 h period, it may cause exception in resource allocation on containerized online service. I specifically query these machines in server_event.csv, but these machines seem to have no problem.

SELECT timestamp,machine_id,event_type, event_detail
FROM server_event
WHERE machine_id
IN ( SELECT machine_id
      FROM container_event
      GROUP BY machine_id
      HAVING sum(cpu_requested) > 64
)

query result:

+-----------+------------+------------+--------------+
| timestamp | machine_id | event_type | event_detail |
+-----------+------------+------------+--------------+
|         0 |         19 | add        | NULL         |
|         0 |         56 | add        | NULL         |
|         0 |         69 | add        | NULL         |
|         0 |        102 | add        | NULL         |
|         0 |        323 | add        | NULL         |
|         0 |        671 | add        | NULL         |
|         0 |        673 | add        | NULL         |
|         0 |        676 | add        | NULL         |
|         0 |        679 | add        | NULL         |
|         0 |        797 | add        | NULL         |
|         0 |        813 | add        | NULL         |
|         0 |        829 | add        | NULL         |
|         0 |       1241 | add        | NULL         |
|         0 |       1251 | add        | NULL         |
|         0 |       1295 | add        | NULL         |
|         0 |       1120 | add        | NULL         |
|         0 |       1134 | add        | NULL         |
+-----------+------------+------------+--------------+

Very grateful if anyone could give an explanation.

About batch workload

  1. The instance of batch workload is VMs, and may I ask what is the type of VM in alibaba cluster(kvm, xen or others)?
  2. As mentioned in Papers using Alibaba cluster data, the resource management and job scheduling system of batch workload is Fuxi, and it can deal with different application(MapReduce, Spark etc), may I ask what is the type of applications in trace_201708? And I wonder what is the batch processing system mainly used in Alibaba Cluster?

How is the container CPU usage is calculated?

Hi All

Have a question regards to the container CPU usage “cpu_util_precent:container_usage.csv
How is this value calculated?

I am a bit confusing about this value. For example, the machine “m_644” at timestamp “90080” hosts 7 containers:

container_id machine_id timestamp cpu_utli_precent
c_31920 m_644 90080 13
c_5644 m_644 90080 8
c_28792 m_644 90080 7
c_21826 m_644 90080 7
c_35230 m_644 90080 7
c_58607 m_644 90080 19
c_3 m_644 90080 18

The total CPU usage for the containers at time 90080 is 79. While cpu_util_precent:machine_usage.csv at the same time “90080” is 42, So can anyone explain how is the calculation of “cpu_util_precent: container_usage.csv”?

Thank you

Description of different states

Hi,
Is it possible to give a short description about the following instance status?

  • Ready
  • Waiting
  • Running
  • Terminated
  • Failed
  • Cancelled
  • Interrupted

Add license

This dataset is very valuable not only for academic purpose, but also to further develop online economy as well. Thus I think this dataset should have license to moderate its use. I suggest using CC-BY-4.0 as recommended by Github's guide.

无法下载

点击下载链接时为什么下载不了?还是用什么特定的方法下载?

Resource usage for task can be higher than resources requested

There are many tasks in the dataset that utilize more resources than what was requested.

For instance, job_id:10771 task_id:66551 has plan_cpu:0.75 [1] from the following entry in batch_task.csv:

6301,6352,10771,66551,137,Terminated,75,0.01600704061294748

However, this task utilizes 7.66 (Max) and 0.99 (average) CPU as can be seen in batch_instance.csv:

6302,6339,10771,66551,427,Terminated,1,1,7.66,0.99,0.019309916392721248,0.012926772448424922

Can you clarify if the amount of resources used by tasks can be higher than the amount of resources requested? If not, what can explain these numbers?

Can I interpret the amount of resources requested as resources allocated by the scheduler?

[1] I divided the plan_cpu value by 100 as explained in this issue: #11

very slow to download

it will take me 12 days to download (from US). Is there a better way to host the data? e.g., via DropBox? of host a copy in US or so?

Missing column in Trace_2018 machine_usage.csv

Hi Alibaba experts,
I notice that each record in the machine_usage.csv contains 9 colums. As shown below:

m_1932,388730,37,83,,,43.08,33.12,3
m_1932,388770,34,83,,,43.08,33.12,2

However, the trace_2018.md shows that there should be 10 columns.
Is there anything missing?
Thanks

About the batch_task.csv

  1. About the 2nd item in batch_task.csv:

Task table(batch_task.csv)

create_timestamp: the create time of a task
modify_timestamp: latest state modification time

in schema.csv:
image

So the 2nd is "latest state modification time" or "task end time"?

  1. The task create time is timetamp. It seems longer than 24 hours.

Adding Network data

The current data set includes CPU, mem, and disk, would it be possible to add network information?
The NIC speeds and the network utilization?

Lifecycle of Batch Task

In the task table (batch_task.csv) each task can have different status: (Ready | Waiting | Running | Terminated | Failed | Cancelled).

With some analysis it seems that each pair job-id/task-id only appears once in batch_task.csv.

Shouldn't tasks go through different stages, for instance, Waiting->Ready->Running->Terminated, and thus appear multiple times in this file?

What do the create_timestamp and modify_timestamp mean for an entry with Running status vs one with Terminated status? Do tasks that have 'Running' status ever finish? If they do, when does that happen?

Server information

  1. In the server event table, we can see that all machines have 64 cpu's are they actual physical cores or virtual cpu's?
  2. In the server event table, we can see that some machines have memory as 0.6. Is this is the free available memory and remaining memory was occupied by softwares?
  3. Can you tell me the type of machine(like Dell power edge) used in the Alibaba server?

Question about sigma.png

Hello guys, can anyone give a description of the architecture diagram below? I wonder whether the scheduling of the 1.3k-machine cluster is done by two separated schedulers sigmaMaster and FuxiMaster, while sigmaMaster responses for containerized online service instances and FuxiMaster responses for traditional batch jobs. Two schedulers have their own restrictions and allocations rules on the cluster resources respectively and do not affect each other. I'm not quite sure about my guess, very grateful if someone could resolve my doubt.

image

machines

Does "Machines" and "Server" mean physical machine here?

Question about State machine of batch task and instance

As what is given in trace_201708.md, we found that both task and instance all have status of "Waiting". and what is declared is:
task -> Waiting: A task in not initialized yet
instance -> Waiting: The instance can't run because some of its dependencies have not finished
IIUC, if a instance's status is "Waiting", we can be sure that there is some dependency among tasks that has not been satisfied. so whenever task's status is no longer "Waiting", its instances' status can change.
so the "Waiting" tasks mean that they are waiting for other tasks finishing ?

In addition, I found that some instance reboot after it arrived at "Failed" status, but others not. Is there any mechanism for judging whether a instance should reboot?

I will appreciate if someone could help.

Adding LLC, Memory Bandwidth and Network Usage

The current data set only includes cpu, memory and disk usage data. I'm wondering if more kinds of resource usage data such as LLC, memory bandwidth and network can be provided. It's important for investigating how batch jobs influence online services on multidimensional resources. Many thanks.

Is there some problem with the time_stamp in v2018??

8 days has 691200 second, and in machine_usage.csv, a mahcine's time_stamp is exactly from 0 to 691190.
But, it's strange that most of the containers in container_usage.csv has bigger time_stamp than 691200. For example c_5911(container_id = 5911), it's time_stamp is from 86400 to 777590.
And I found that 777590-86400=691190=8 days
It means that the information of this container is from day 2 to day 9.
why there is no information about the container in day 1 ?
And why there is an extra day(day 9) for only container information ?
So, should we change the time_stamp of container_usage???
@HaiyangDING

Negative timestamps in batch_instance.csv

It seems there are records with negative timestamps (both start and end) in the batch_instance.csv.

For instance:

-792,-788,6850,40552,546,Terminated,1,1,2,0.58,,

I didn't see any mention of this in the documentation (only about timestamps of 0 if the events occur before the start of the trace).

Can you please clarify what these entries mean?

Repeated entries in container_event.csv and resources requests above machine capacity

In the file container_event.csv I find that 1) some online instances seem to be created twice in the same timestamp/machine and 2) values of plan_mem can be higher than the machine's capacity.

For instance, in the file container_event.csv I find the following consecutive lines:

0,Create,10772,85,4,0.042409339165997983,0.0340851218221338,32|33|34|35,
0,Create,10772,85,4,0.9999633571040303,0.0340851218221338,32|33|34|35,

The lines are the same, except the plan_mem value. Also when looking into the capacity for machine #85 in file server_event.csv I see:

0,85,add,,64,0.6899697150104833,1

It seems that this machine doesn't have enough memory for the creation of the service (0.68 < 0.99).

My questions are:

  1. Can a service be created twice (like the example above)?
  2. Is this request for 0.99 of memory valid considering that the machine only has 0.68?

Can you please explain this?

Thank you in advance for the clarifications.

Question Regarding Normalized Memory Usage

My question is in column plan_mem in batch_task.csv. I am still a little bit confused about the normalization standard. The schema mentions this column, plan_mem, specifies normalized memory requested for each instance of the task and it's Normalized to the largest memory size of all machines.

Few tasks have an even larger number than 15 in plan_mem column (e.g., 15.45, 17.17).

  1. task_NDg2ODM2NDIyMDczNDQ4NzMzOA==,50,j_3003670,11,Terminated,233037,234282,700,17.17
  2. task_NDg2ODM2NDIyMDczNDQ4NzMzOA==,50,j_3685110,11,Terminated,471423,493080,700,15.45

Does it mean each instance in this task take around 17.17 times of the largest memory in a physical machine in the whole memory? If I just assume the biggest memory on a physical machine in the cluster is 128GB, does each instance will take 2TB memory and the whole task will take 100TB memory eventually?

Thank you very much.

Cpu numbers

in batch instance table 9th column is "real_cpu_max: maximum cpu numbers of actual instance running"

most of the values in this column are less than 1 it was also mention that cpu is not normalized. can anyone explain the reason for it?

question about the time of trace

I test the trace finding that the time in server usage csv is from 39600 to 82500,almost 11 hours,But the README says there are 24 hours,then i test the batch_instance,the time is also 39600 to 82500,may i ask was i testing wrong?

cpu utilization vs linux cpu load

what is the difference between cpu utilization vs linux cpu load average? These fields are present in machine usage table with the names as : util:CPU and load1: linux cpu load average of 1 minute

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.