👋 Hi! I'm Marco!
I'm a Senior DevOps Engineer at Ada Health in Berlin, Germany.
I love open-source, cloud computing, string instruments, and Oxford commas.
🌎 Links | 💻 Contributions | ✏️ Blog posts |
---|---|---|
|
☸ Kubernetes periodic benchmarking tool with Prometheus Pushgateway results exposer
License: MIT License
I'm a Senior DevOps Engineer at Ada Health in Berlin, Germany.
I love open-source, cloud computing, string instruments, and Oxford commas.
🌎 Links | 💻 Contributions | ✏️ Blog posts |
---|---|---|
|
.pkb
log files and retrieve essential data
.csv
file with results of different experimentsrsync
scriptsdk8s-cronjob
image does not manage to resolve PerfkitBenchmarker
's dependencies:
Installing PerfKitBenchmarker dependencies...
Collecting absl-py (from -r requirements.txt (line 14))
Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection obje
ct at 0x7fd56d277990>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',)': /simple/absl-py/
Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection obje
ct at 0x7fd56d277f50>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',)': /simple/absl-py/
Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection obje
ct at 0x7fd56d367090>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',)': /simple/absl-py/
Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection obje
ct at 0x7fd56d3671d0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',)': /simple/absl-py/
Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection obje
ct at 0x7fd56d367310>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',)': /simple/absl-py/
Could not find a version that satisfies the requirement absl-py (from -r requirements.txt (line 14)) (from versions: )
No matching distribution found for absl-py (from -r requirements.txt (line 14))
When launching benchmarks in a CronJob, results will be stored in dk8s-cronjob
containers. The final results collector must be able to retrieve a container's results as soon as a job finishes.
Wrong flags are being used.
For cluster_boot
:
perfkitbenchmarker.errors.UnrecognizedOption: Unrecognized options were found in cluster_boot: cluster_boot_time_reboot.
For redis
:
perfkitbenchmarker.errors.MissingOption: Required options were missing from redis.vm_groups.clients: vm_spec.
One of those:
benchetes
: BENCHmarks on kubernETESkubemarks
: KUBErnetes benchMARKS---
Benchmarks might be executed periodically using CronJob
s.
Practically, this could be done by wrapping commands currently issued by PerfkitBenchmarker
in the following way:
kubectl run hello --schedule="*/1 * * * *" --restart=OnFailure --image=busybox -- /bin/sh -c "date; echo Hello from the Kubernetes cluster"
Also, a ConfigMap
should contain a list of benchmarks to execute, as well as their frequency.
List of best practices here.
This page in the Prometheus documentation shows the text format details.
metric_name [
"{" label_name "=" `"` label_value `"` { "," label_name "=" `"` label_value `"` } [ "," ] "}"
] value [ timestamp ]
An example would be:
# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="post",code="200"} 1027 1395066363000
http_requests_total{method="post",code="400"} 3 1395066363000
# Escaping in label values:
msdos_file_access_time_seconds{path="C:\\DIR\\FILE.TXT",error="Cannot find file:\n\"FILE.TXT\""} 1.458255915e9
# Minimalistic line:
metric_without_timestamp_and_labels 12.47
# A weird metric from before the epoch:
something_weird{problem="division by zero"} +Inf -3982045
# A histogram, which has a pretty complex representation in the text format:
# HELP http_request_duration_seconds A histogram of the request duration.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 24054
http_request_duration_seconds_bucket{le="0.1"} 33444
http_request_duration_seconds_bucket{le="0.2"} 100392
http_request_duration_seconds_bucket{le="0.5"} 129389
http_request_duration_seconds_bucket{le="1"} 133988
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53423
http_request_duration_seconds_count 144320
# Finally a summary, which has a complex representation, too:
# HELP rpc_duration_seconds A summary of the RPC duration in seconds.
# TYPE rpc_duration_seconds summary
rpc_duration_seconds{quantile="0.01"} 3102
rpc_duration_seconds{quantile="0.05"} 3272
rpc_duration_seconds{quantile="0.5"} 4773
rpc_duration_seconds{quantile="0.9"} 9001
rpc_duration_seconds{quantile="0.99"} 76656
rpc_duration_seconds_sum 1.7560473e+07
rpc_duration_seconds_count 2693
Commit 1a5a5a7
in my PerfKitBenchmarker
fork added the append mode to the CSV results writer.
Another writer could export this data into the OpenMetrics format.
Upon finishing a benchmark involving more than one VM (pod), PerfKit encounters in this error:
Traceback (most recent call last):
File "pkb/pkb.py", line 21, in <module>
sys.exit(Main())
File "/home/root/distributed-k8s/pkb/perfkitbenchmarker/pkb.py", line 1209, in Main
return RunBenchmarks()
File "/home/root/distributed-k8s/pkb/perfkitbenchmarker/pkb.py", line 1122, in RunBenchmarks
collector.PublishSamples()
File "/home/root/distributed-k8s/pkb/perfkitbenchmarker/publisher.py", line 1108, in PublishSamples
publisher.PublishSamples(self.samples)
File "/home/root/distributed-k8s/pkb/perfkitbenchmarker/publisher.py", line 582, in PublishSamples
registry=self.registry).labels(*(label_values + metadata_label_values))
File "/usr/local/lib/python2.7/dist-packages/prometheus_client/metrics.py", line 324, in __init__
labelvalues=labelvalues,
File "/usr/local/lib/python2.7/dist-packages/prometheus_client/metrics.py", line 107, in __init__
registry.register(self)
File "/usr/local/lib/python2.7/dist-packages/prometheus_client/registry.py", line 29, in register
duplicates))
ValueError: Duplicated timeseries in CollectorRegistry: set(['boot_time_seconds'])
PushgatewayPublisher
needs to have a dictionary of Gauge
s, so it can re-use the ones that have been previously used for exposing metrics.
CronJobs use the dk8s-cronjob
image. They must provide the kubectl
path to PerfkitBenchmarker
in order for it to work:
https://github.com/marcomicera/distributed-k8s/blob/5e4c7a79b2b9712a16585891daba993da690eced/start.sh#L35
These docker images do not have access to the same kubectl
command of the machine on which start_cron.sh
was launched in the first place.
2019-10-28 19:26:23,983 f786558f MainThread INFO Flag values:
...
--kubectl=
...
Exception: Please provide path to kubectl tool using --kubectl flag. Exiting.
More specifically, the job
field must be the benchmark name.
push_to_gateway(gateway=FLAGS.pushgateway, job=sample['test'], registry=registry)
When Docker image dk8s-pkb
issues a sudo
command in start.sh
, the container log shows:
sudo: no tty present and no askpass program specified
To choose amongst this list:
unit
: measurement unitrun_uri
: benchmark execution IDsample_uri
: sample IDcloud
data_disk_0_num_stripes
data_disk_0_size
data_disk_0_type
data_disk_count
direct
directory
/scratch
end_fsync
filename
filesize
fio_job
image
invalidate
iodepth
ioengine
kernel_release
max
mean
min
node_name
num_cpus
: CPU(s)
column for lscpu
entriesnuma_node_count
os_info
os_type
overwrite
p1
p10
p20
p30
p40
p5
p50
p60
p70
p80
p90
p95
p99
p99.5
p99.9
p99.95
p99.99
perfkitbenchmarker_version
randrepeat
run_number
rw
size
stddev
tcp_congestion_control
: always cubicvm_count
workload_mode
zone
lscpu
commandsArchitecture
BogoMIPS
Byte Order
CPU MHz
CPU family
CPU max MHz
CPU min MHz
CPU op-mode(s)
CPU(s)
: num_cpus
is not set for lscpu
entries, but this isCore(s) per socket
Flags
: huge list like:
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts flush_l1d
L1d cache
L1i cache
L2 cache
L3 cache
Model
Model name
NUMA node(s)
NUMA node0 CPU(s)
On-line CPU(s) list
Socket(s)
Stepping
Thread(s) per core
Vendor ID
Virtualization
bw_agg
bw_dev
bw_max
bw_mean
bw_min
Currently, the Pushgateway address is stored within this script:
https://github.com/marcomicera/distributed-k8s/blob/fa29b453d8697b5c4e280a3e92e217b88089514f/start.sh#L34
I.e.,
experiment-conf.yaml
is too generalcronjob.yaml
is too generalPerfKitBenchmarker
expects different benchmarks to write their results in different folders: it refuses to run an experiment if its results folder name is longer than 12 characters and any other sorts of constraints I didn't really go through.
Problem is, it's not possible to group different benchmark results of a single PKB
run.
E.g., cluster_boot
every 5 minutes since it completes in ~10 seconds.
Using their Helm Charts should be enough.
A list of tested, fully-functioning benchmarks:
block_storage_workload
cassandra_ycsb
cassandra_stress
cluster_boot
fio
iperf
(marcomicera/PerfKitBenchmarker@8488f24)mesh_network
(marcomicera/PerfKitBenchmarker@8488f24)mongodb_ycsb
netperf
(marcomicera/PerfKitBenchmarker@8488f24)redis
A ConfigMap could store:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.