etcd-io / dbtester Goto Github PK

Distributed database benchmark tester

Home Page: https://github.com/etcd-io/dbtester/tree/master/test-results

License: Apache License 2.0

Go 92.81% Shell 7.19%

etcd zookeeper consul go benchmark database distributed-systems distributed-database performance-analysis performance-visualization database-benchmarking

dbtester's Introduction

dbtester

Distributed database benchmark tester: etcd, Zookeeper, Consul, zetcd, cetcd

It includes github.com/golang/freetype, which is based in part on the work of the FreeType Team.

Performance Analysis

Latest test results can be found at https://github.com/etcd-io/dbtester/tree/master/test-results
Exploring Performance of etcd, Zookeeper and Consul Consistent Key-value Datastores (February 17, 2017)
- https://coreos.com/blog/performance-of-etcd.html

Project

Database Agent
- https://github.com/etcd-io/dbtester/tree/master/agent
Database Client
- https://github.com/etcd-io/dbtester/tree/master/control
System Metrics
- https://github.com/gyuho/linux-inspect
Test Data Analysis

For etcd, we recommend etcd benchmark tool.

All logs and results can be found at https://github.com/etcd-io/dbtester/tree/master/test-results or https://console.cloud.google.com/storage/browser/dbtester-results/?authuser=0&project=etcd-development.

Noticeable Warnings: Zookeeper

Snapshot, when writing 1-million entries (256-byte key, 1KB value value), with 500 concurrent clients

# snapshot warnings
cd 2017Q1-00-etcd-zookeeper-consul/02-write-1M-keys-best-throughput
grep -r -i fsync-ing\ the zookeeper-r3.4.9-java8-* | less

2017-02-10 18:55:38,997 [myid:3] - WARN  [SyncThread:3:SyncRequestProcessor@148] - Too busy to snap, skipping
2017-02-10 18:55:38,998 [myid:3] - INFO  [SyncThread:3:FileTxnLog@203] - Creating new log file: log.1000c0c51
2017-02-10 18:55:40,855 [myid:3] - INFO  [SyncThread:3:FileTxnLog@203] - Creating new log file: log.1000cd2e6
2017-02-10 18:55:40,855 [myid:3] - INFO  [Snapshot Thread:FileTxnSnapLog@240] - Snapshotting: 0x1000cd1ca to /home/gyuho/zookeeper/zookeeper.data/version-2/snapshot.1000cd1ca
2017-02-10 18:55:46,382 [myid:3] - WARN  [SyncThread:3:FileTxnLog@338] - fsync-ing the write ahead log in SyncThread:3 took 1062ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
2017-02-10 18:55:47,471 [myid:3] - WARN  [SyncThread:3:FileTxnLog@338] - fsync-ing the write ahead log in SyncThread:3 took 1084ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
2017-02-10 18:55:49,425 [myid:3] - WARN  [SyncThread:3:FileTxnLog@338] - fsync-ing the write ahead log in SyncThread:3 took 1142ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
2017-02-10 18:55:51,188 [myid:3] - WARN  [SyncThread:3:FileTxnLog@338] - fsync-ing the write ahead log in SyncThread:3 took 1201ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
2017-02-10 18:55:52,292 [myid:3] - WARN  [SyncThread:3:FileTxnLog@338] - fsync-ing the write ahead log in SyncThread:3 took 1102ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide

When writing more than 2-million entries (256-byte key, 1KB value value) with 500 concurrent clients

# leader election
cd 2017Q1-00-etcd-zookeeper-consul/04-write-too-many-keys
grep -r -i election\ took  zookeeper-r3.4.9-java8-* | less

# leader election is taking more than 10 seconds...
zookeeper-r3.4.9-java8-2-database.log:2017-02-10 19:22:16,549 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@61] - FOLLOWING - LEADER ELECTION TOOK - 22978
zookeeper-r3.4.9-java8-2-database.log:2017-02-10 19:23:02,279 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Leader@361] - LEADING - LEADER ELECTION TOOK - 10210
zookeeper-r3.4.9-java8-2-database.log:2017-02-10 19:23:14,498 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Leader@361] - LEADING - LEADER ELECTION TOOK - 203
zookeeper-r3.4.9-java8-2-database.log:2017-02-10 19:23:36,303 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Leader@361] - LEADING - LEADER ELECTION TOOK - 9791
zookeeper-r3.4.9-java8-2-database.log:2017-02-10 19:23:52,151 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Leader@361] - LEADING - LEADER ELECTION TOOK - 3836
zookeeper-r3.4.9-java8-2-database.log:2017-02-10 19:24:13,849 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Leader@361] - LEADING - LEADER ELECTION TOOK - 9686
zookeeper-r3.4.9-java8-2-database.log:2017-02-10 19:24:29,694 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Leader@361] - LEADING - LEADER ELECTION TOOK - 3573
zookeeper-r3.4.9-java8-2-database.log:2017-02-10 19:24:51,392 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Leader@361] - LEADING - LEADER ELECTION TOOK - 8686
zookeeper-r3.4.9-java8-2-database.log:2017-02-10 19:25:07,231 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Leader@361] - LEADING - LEADER ELECTION TOOK - 3827
zookeeper-r3.4.9-java8-2-database.log:2017-02-10 19:25:28,940 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Leader@361] - LEADING - LEADER ELECTION TOOK - 9697
zookeeper-r3.4.9-java8-2-database.log:2017-02-10 19:25:44,772 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Leader@361] - LEADING - LEADER ELECTION TOOK - 3820

Noticeable Warnings: Consul

Snapshot, when writing 1-million entries (256-byte key, 1KB value value), with 500 concurrent clients

# snapshot warnings
cd 2017Q1-00-etcd-zookeeper-consul/02-write-1M-keys-best-throughput
grep -r -i installed\ remote consul-v0.7.4-go1.7.5-* | less

    2017/02/10 18:58:43 [INFO] snapshot: Creating new snapshot at /home/gyuho/consul.data/raft/snapshots/2-900345-1486753123478.tmp
    2017/02/10 18:58:45 [INFO] snapshot: reaping snapshot /home/gyuho/consul.data/raft/snapshots/2-849399-1486753096972
    2017/02/10 18:58:46 [INFO] raft: Copied 1223270573 bytes to local snapshot
    2017/02/10 18:58:55 [INFO] raft: Compacting logs from 868354 to 868801
    2017/02/10 18:58:56 [INFO] raft: Installed remote snapshot
    2017/02/10 18:58:57 [INFO] snapshot: Creating new snapshot at /home/gyuho/consul.data/raft/snapshots/2-911546-1486753137827.tmp
    2017/02/10 18:58:59 [INFO] consul.fsm: snapshot created in 32.255µs
    2017/02/10 18:59:01 [INFO] snapshot: reaping snapshot /home/gyuho/consul.data/raft/snapshots/2-873921-1486753116619
    2017/02/10 18:59:02 [INFO] raft: Copied 1238491373 bytes to local snapshot
    2017/02/10 18:59:11 [INFO] raft: Compacting logs from 868802 to 868801
    2017/02/10 18:59:11 [INFO] raft: Installed remote snapshot

Logs do not tell much but average latency spikes (e.g. from 70.27517 ms to 10407.900082 ms)

Write 1M keys, 256-byte key, 1KB value, Best Throughput (etcd 1K clients with 100 conns, Zookeeper 700, Consul 500 clients)

Google Cloud Compute Engine
4 machines of 16 vCPUs + 60 GB Memory + 300 GB SSD (1 for client)
Ubuntu 17.10 (GNU/Linux kernel 4.13.0-25-generic)
ulimit -n is 120000
etcd v3.3.0 (Go 1.9.2)
Zookeeper r3.5.3-beta
- Java 8
- javac 1.8.0_151
- Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
- Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
- /usr/bin/java -Djute.maxbuffer=33554432 -Xms50G -Xmx50G
Consul v1.0.2 (Go 1.9.2)

+---------------------------------------+---------------------+-----------------------------+-----------------------+
|                                       | etcd-v3.3.0-go1.9.2 | zookeeper-r3.5.3-beta-java8 | consul-v1.0.2-go1.9.2 |
+---------------------------------------+---------------------+-----------------------------+-----------------------+
|                         TOTAL-SECONDS |         28.3623 sec |                 59.2167 sec |          178.9443 sec |
|                  TOTAL-REQUEST-NUMBER |           1,000,000 |                   1,000,000 |             1,000,000 |
|                        MAX-THROUGHPUT |      37,330 req/sec |              25,124 req/sec |        15,865 req/sec |
|                        AVG-THROUGHPUT |      35,258 req/sec |              16,842 req/sec |         5,588 req/sec |
|                        MIN-THROUGHPUT |      13,505 req/sec |                  20 req/sec |             0 req/sec |
|                       FASTEST-LATENCY |           4.6073 ms |                   2.9094 ms |            11.6604 ms |
|                           AVG-LATENCY |          28.2625 ms |                  30.9499 ms |            89.4351 ms |
|                       SLOWEST-LATENCY |         117.4918 ms |                4564.6788 ms |          4616.2947 ms |
|                           Latency p10 |        13.508626 ms |                 9.068163 ms |          30.408863 ms |
|                           Latency p25 |        16.869586 ms |                 9.351597 ms |          34.224021 ms |
|                           Latency p50 |        22.167478 ms |                10.093377 ms |          39.881181 ms |
|                           Latency p75 |        34.855941 ms |                14.951189 ms |          52.644787 ms |
|                           Latency p90 |        54.613394 ms |                28.497256 ms |         118.340402 ms |
|                           Latency p95 |        59.785127 ms |                72.671788 ms |         229.129526 ms |
|                           Latency p99 |        74.139638 ms |               273.218523 ms |        1495.660763 ms |
|                         Latency p99.9 |        97.385495 ms |              2526.873285 ms |        3499.225138 ms |
|      SERVER-TOTAL-NETWORK-RX-DATA-SUM |              5.1 GB |                      4.6 GB |                5.6 GB |
|      SERVER-TOTAL-NETWORK-TX-DATA-SUM |              3.8 GB |                      3.6 GB |                4.4 GB |
|           CLIENT-TOTAL-NETWORK-RX-SUM |              252 MB |                      357 MB |                206 MB |
|           CLIENT-TOTAL-NETWORK-TX-SUM |              1.5 GB |                      1.4 GB |                1.5 GB |
|                  SERVER-MAX-CPU-USAGE |            446.83 % |                   1122.00 % |              426.33 % |
|               SERVER-MAX-MEMORY-USAGE |              1.1 GB |                       15 GB |                4.6 GB |
|                  CLIENT-MAX-CPU-USAGE |            606.00 % |                    314.00 % |              215.00 % |
|               CLIENT-MAX-MEMORY-USAGE |               96 MB |                      2.4 GB |                 86 MB |
|                    CLIENT-ERROR-COUNT |                   0 |                       2,652 |                     0 |
|  SERVER-AVG-READS-COMPLETED-DELTA-SUM |                   0 |                         237 |                     2 |
|    SERVER-AVG-SECTORS-READS-DELTA-SUM |                   0 |                           0 |                     0 |
| SERVER-AVG-WRITES-COMPLETED-DELTA-SUM |             108,067 |                     157,034 |               675,072 |
|  SERVER-AVG-SECTORS-WRITTEN-DELTA-SUM |          20,449,360 |                  16,480,488 |           106,836,768 |
|           SERVER-AVG-DISK-SPACE-USAGE |              2.6 GB |                      6.9 GB |                2.9 GB |
+---------------------------------------+---------------------+-----------------------------+-----------------------+


zookeeper__r3_5_3_beta errors:
"zk: connection closed" (count 2,264)
"zk: could not connect to a server" (count 388)

dbtester's People

Contributors

Stargazers

Watchers

dbtester's Issues

update Google Cloud API client import paths and more

The Google Cloud API client libraries for Go are making some breaking changes:

The import paths are changing from google.golang.org/cloud/... to
cloud.google.com/go/.... For example, if your code imports the BigQuery client
it currently reads
import "google.golang.org/cloud/bigquery"
It should be changed to
import "cloud.google.com/go/bigquery"
Client options are also moving, from google.golang.org/cloud to
google.golang.org/api/option. Two have also been renamed:
- WithBaseGRPC is now WithGRPCConn
- WithBaseHTTP is now WithHTTPClient
The cloud.WithContext and cloud.NewContext methods are gone, as are the
deprecated pubsub and container functions that required them. Use the Client
methods of these packages instead.

You should make these changes before September 12, 2016, when the packages at
google.golang.org/cloud will go away.

store metrics in memory and write to CSV

currently it appends to CSV every second.

double-check cumulative throughput column

pkg/report: fatal error: concurrent map read and map write

When a bunch of zk: node already exists error happens

2017-01-17 23:56:11.799714 I | control: sending message [index: 2 | operation: "Heartbeat" | database: "ZooKeeper" | endpoint: "10.240.0.28:3500"]
2017-01-17 23:56:11.802353 I | control: got response [index: 2 | endpoint: "10.240.0.28:3500" | response: success:true ]
fatal error: concurrent map read and map write

goroutine 1713 [running]:
runtime.throw(0xee0111, 0x21)
	/usr/local/go/src/runtime/panic.go:566 +0x95 fp=0xc437066510 sp=0xc4370664f0
runtime.mapaccess1_faststr(0xd85d40, 0xc427d3a5a0, 0xed7177, 0x17, 0xc4331c4d00)
	/usr/local/go/src/runtime/hashmap_fast.go:201 +0x4f3 fp=0xc437066570 sp=0xc437066510
github.com/coreos/dbtester/vendor/github.com/coreos/etcd/pkg/report.(*report).processResult(0xc42014d000, 0xc437066640)
	/home/gyuho/go/src/github.com/coreos/dbtester/vendor/github.com/coreos/etcd/pkg/report/report.go:180 +0xa0 fp=0xc437066600 sp=0xc437066570
github.com/coreos/dbtester/vendor/github.com/coreos/etcd/pkg/report.(*report).processResults(0xc42014d000)
	/home/gyuho/go/src/github.com/coreos/dbtester/vendor/github.com/coreos/etcd/pkg/report/report.go:194 +0x118 fp=0xc4370666d0 sp=0xc437066600
github.com/coreos/dbtester/vendor/github.com/coreos/etcd/pkg/report.(*report).Stats.func1(0xc4331c6000, 0xc42014d000)
	/home/gyuho/go/src/github.com/coreos/dbtester/vendor/github.com/coreos/etcd/pkg/report/report.go:114 +0x6d fp=0xc4370667b0 sp=0xc4370666d0
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc4370667b8 sp=0xc4370667b0
created by github.com/coreos/dbtester/vendor/github.com/coreos/etcd/pkg/report.(*report).Stats
	/home/gyuho/go/src/github.com/coreos/dbtester/vendor/github.com/coreos/etcd/pkg/report/report.go:127 +0x67

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0xc420250814)

analyze: output separate CSVs for graphs

Currently it aggregates 3 CSVs into one. Now that we have disk, network stats, the number of columns are over 50.

We need separate CSVs for graphs and make it easy for other viz tools.

csv: add interpolation

When vendor has this patch gyuho/linux-inspect#27

Raw data will be without interpolation.
Now dbtester needs some kind of interpolation to handle missing timestamps in system metrics.

analyze: aggregate by client-num

Currently we record client-number by second

CLIENT-NUM | LATENCY-MS
-----------------------
1          |  30.1
1          |  30.1
1          |  30.1
5          |  30.1
5          |  30.1

We need to aggregate by client-num to make it easy to graph

benchmark, writes: etcd, Zookeeper, Consul test metrics

Here's the new test cases to be implemented, or just to be configured, for better performance comparison.

At the very high level:

Throughput; scale the clients to find the best throughput T of each database ()
- For graph, x-axis is the number of clients
- y-axis is the average throughput for the corresponding client number k
  - raw data should be aggregated by client
- y-axis should also show the minimum and maximum throughput for that client number
- test case #1: 256-byte key, 1024-byte value, 1-million entries, total 1.3 GB, variable client numbers
  - e.g. client numbers could be 1, 3, 5, 10, 50, 100, 500, 700, 1000, ...
Latency; measure the latency with the best-possible throughput
- Choose the client number K with the best throughput T, for each database
- No rate limit
- For graph, just show the latency distribution with box plots
- test case #2: 256-byte key, 1024-byte value, 1-million entries, total 1.3 GB, client number K
Disk; measure sector written
- sector written should be greater than disk writes in /proc/diskstats
- make sure numbers match up, there could be race in metrics collector
- For graph, x-axis is the number of clients
- y-axis is the number of sectors written per clients, per database
  - this would be the rough estimates from 3 nodes' system metrics
  - interpolate each CSV, and get the average value by unix second
- use the test case #1
- need to improve data visualization (TODO: gnuplot)
Disk utilization; measure the total database size on filesystem
- du -sh $HOME/etcd.data
- just explain with plain table with sums
- use the test case #1
Network; measure the total network overhead between peers, clients
- use the netstat data in system metrics
- just explain with plain table with sums
- use the test case #1
CPU; measure the overall system load
- CPU numbers per PID might not be accurate
- use the overall system load numbers from top command
- For graph, x-axis is the number of clients
- y-axis is the average system load numbers for each database
  - this would be the rough estimates from 3 nodes' system metrics
  - interpolate each CSV, and get the average value by unix second
- use the test case #1
- need to improve data visualization (TODO: gnuplot)
Memory; measure the memory usage as number of keys grow
- For graph, x-axis is the cumulative number of keys (estimated)
  - if database A has average memory M and 'cumulative' throughput 32,000, at second x, the x-value would be 30,000 and y-value would be M
- y-axis is the memory being used, by the number of keys
- test case #3: 256-byte key, 1024-byte value, 1-million entries, total 1.3 GB, 1,000 clients
Latency with too many keys; keep writing until the last database breaks
- For graph, x-axis is the cumulative number of keys (estimated, same as case 7)
- y-axis is the average latency, by the number of keys
- use the best-possible throughput of each database from the case 1
- test case #4: 256-byte key, 1024-byte value, X-million entries, total X GB, client number K

fix travis CI

Need to re-enable travis CI.
glide has some issues with go test files Masterminds/glide#320.

add plotting to bench-analyze tool

test Kubernetes storage workloads

AVG-RECEIVE-BYTES-NUM-DELTA-BY-CLIENT-NUM doesn't add up?

AVG-RECEIVE-BYTES-NUM-DELTA-BY-CLIENT-NUM.csv numbers doesn't add up correct...

Investigate this.

control: record number of clients

There's some discrepancy between the number of rows in combined benchmark results
and the one in system metrics because system metrics is collected including pausing periods,
while benchmark results are combined from multiple reports without including pause times.

use pkg/report from etcd

Got some code duplication going on...

measure fsync, disk I/O

revendor 'github.com/gyuho/psn' to get diskstats

handle missing timestamps in monitor data

CPU and memory monitoring records usage for every-second, but
sometimes there could be some gaps of a few seconds where data
did not get collected. This causes a problem when aggregating with
benchmark data which also fills in missing timestamps.

We need a way to fill in these empty rows with estimates.

map cumulative number of keys to memory, latency

For #221.

automatic deployment

Need better way to spin up testing instances...

use cloud provider API
use cluster management tool (kubernetes)

Either way, it needs to be more automatic.

panic: Unexpected error from context packet: context deadline exceeded

2017-02-03 22:51:39.543127 I | control: step 3: stopping databases...
2017-02-03 22:51:39.543264 I | control: sending message [index: 0 | operation: "Stop" | database: "etcdv3" | endpoint: "10.240.0.20:3500"]
2017-02-03 22:51:40.543563 I | control: sending message [index: 1 | operation: "Stop" | database: "etcdv3" | endpoint: "10.240.0.21:3500"]
2017-02-03 22:51:41.543750 I | control: sending message [index: 2 | operation: "Stop" | database: "etcdv3" | endpoint: "10.240.0.22:3500"]
2017-02-03 22:51:50.337145 I | control: got response [index: 0 | endpoint: "10.240.0.20:3500" | response: success:true datasize:2840425696 ]
2017-02-03 22:51:50.885129 I | control: got response [index: 1 | endpoint: "10.240.0.21:3500" | response: success:true datasize:2852999600 ]
panic: Unexpected error from context packet: context deadline exceeded

goroutine 32325 [running]:
panic(0xd527e0, 0xc4283a2450)
	/usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/coreos/dbtester/vendor/google.golang.org/grpc/transport.ContextErr(0x13ad980, 0x1638c98, 0x1638c98, 0x1, 0x0)
	/home/gyuho/go/src/github.com/coreos/dbtester/vendor/google.golang.org/grpc/transport/transport.go:566 +0x213
github.com/coreos/dbtester/vendor/google.golang.org/grpc/transport.(*Stream).Header(0xc42025a4b0, 0xf73720, 0xc42e933a70, 0x13bb860)
	/home/gyuho/go/src/github.com/coreos/dbtester/vendor/google.golang.org/grpc/transport/transport.go:241 +0x11a
github.com/coreos/dbtester/vendor/google.golang.org/grpc.recvResponse(0x0, 0x0, 0x13b7c00, 0x1638c98, 0x0, 0x0, 0x0, 0x0, 0x13aee00, 0xc42b552640, ...)
	/home/gyuho/go/src/github.com/coreos/dbtester/vendor/google.golang.org/grpc/call.go:61 +0xaa
github.com/coreos/dbtester/vendor/google.golang.org/grpc.invoke(0x7f1dd3a03218, 0xc42be7d440, 0xf07f1c, 0x1d, 0xe45d20, 0xc42cc66000, 0xe45e00, 0xc4283a23d0, 0xc42033e900, 0x0, ...)
	/home/gyuho/go/src/github.com/coreos/dbtester/vendor/google.golang.org/grpc/call.go:208 +0x8a1
github.com/coreos/dbtester/vendor/google.golang.org/grpc.Invoke(0x7f1dd3a03218, 0xc42be7d440, 0xf07f1c, 0x1d, 0xe45d20, 0xc42cc66000, 0xe45e00, 0xc4283a23d0, 0xc42033e900, 0x0, ...)
	/home/gyuho/go/src/github.com/coreos/dbtester/vendor/google.golang.org/grpc/call.go:118 +0x19c
github.com/coreos/dbtester/agent/agentpb.(*transporterClient).Transfer(0xc42e1de0d8, 0x7f1dd3a03218, 0xc42be7d440, 0xc42cc66000, 0x0, 0x0, 0x0, 0x0, 0xc426e1c0f8, 0x9)
	/home/gyuho/go/src/github.com/coreos/dbtester/agent/agentpb/message.pb.go:183 +0xd2
github.com/coreos/dbtester/control.sendReq(0xc42013bd00, 0x10, 0x100000001, 0xc4201695c0, 0x27, 0x0, 0xc4201450e0, 0x11, 0x0, 0x186a0, ...)
	/home/gyuho/go/src/github.com/coreos/dbtester/control/step1_start_database.go:79 +0x55f
github.com/coreos/dbtester/control.bcastReq.func1(0xc420164200, 0xc4200ca000, 0xc42f5d6e40, 0xc42f5d6de0, 0x2)
	/home/gyuho/go/src/github.com/coreos/dbtester/control/step1_start_database.go:36 +0xa5
created by github.com/coreos/dbtester/control.bcastReq
	/home/gyuho/go/src/github.com/coreos/dbtester/control/step1_start_database.go:41 +0x1df

control: handle `zk: node already exists` with varying client numbers

export results to csv, and automate data-analysis

Currently test runner has to copy and paste results from multi-machines
and compare the timestamps to match metrics. This is too manual.

analyze: graph X by key number

3 databases are expected to have same row numbers, but not necessarily.

Graph these.

analyze: get average value of disk + network stats (in 3+ node cluster)

Remove unnecessary vendoring in 'pkg'

We have a bunch of copied packages from etcd inside 'pkg'
only because we do not want to update the etcd/clientv3 dependency
yet.

Once we finish the etcd v3.1 tests, just vendor directly from etcd.

add CPU context switches

Just get metrics from proc status by PID

support consul

Save separate CSVs for interpolated version

support number of requests as X-axis

We can write # of clients as a file and let psn track this file as the number of clients

Run benchmark in bare metal machines

auto-generate latency percentile summary in markdown

etcd v2, etcd v3, consul test

with 1000 clients, 8-byte key, 256-byte value

Standalone without comparison to other databases

pkg/report: panic in 'control'

2017-01-17 20:36:56.724117 I | control: step 1: starting databases...
2017-01-17 20:36:56.724274 I | control: sending message [index: 0 | operation: "Start" | database: "etcdv3" | endpoint: "10.240.0.20:3500"]
2017-01-17 20:36:56.729498 I | control: got response [index: 0 | endpoint: "10.240.0.20:3500" | response: success:true ]
2017-01-17 20:36:57.724372 I | control: sending message [index: 1 | operation: "Start" | database: "etcdv3" | endpoint: "10.240.0.21:3500"]
2017-01-17 20:36:57.730303 I | control: got response [index: 1 | endpoint: "10.240.0.21:3500" | response: success:true ]
2017-01-17 20:36:58.724626 I | control: sending message [index: 2 | operation: "Start" | database: "etcdv3" | endpoint: "10.240.0.22:3500"]
2017-01-17 20:36:58.730280 I | control: got response [index: 2 | endpoint: "10.240.0.22:3500" | response: success:true ]

2017-01-17 20:37:04.725011 I | control: step 2: starting tests...
2017-01-17 20:37:04.725154 I | control: write generateReport is started...
 2000000 / 2000000 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00% 50s
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x6559ae]

goroutine 2284 [running]:
panic(0xd90540, 0xc4200100e0)
	/usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/coreos/dbtester/vendor/github.com/coreos/etcd/pkg/report.(*secondPoints).getTimeSeries(0x0, 0x0, 0x0, 0x0)
	/home/gyuho/go/src/github.com/coreos/dbtester/vendor/github.com/coreos/etcd/pkg/report/timeseries.go:69 +0x5e
github.com/coreos/dbtester/vendor/github.com/coreos/etcd/pkg/report.(*report).Stats.func1(0xc420cea1e0, 0xc420ed8300)
	/home/gyuho/go/src/github.com/coreos/dbtester/vendor/github.com/coreos/etcd/pkg/report/report.go:125 +0x11d
created by github.com/coreos/dbtester/vendor/github.com/coreos/etcd/pkg/report.(*report).Stats
	/home/gyuho/go/src/github.com/coreos/dbtester/vendor/github.com/coreos/etcd/pkg/report/report.go:127 +0x67

2017-01-17 23:21:17.665187 I | control: sending message [index: 2 | operation: "Heartbeat" | database: "ZooKeeper" | endpoint: "10.240.0.28:3500"]
2017-01-17 23:21:17.666672 I | control: got response [index: 2 | endpoint: "10.240.0.28:3500" | response: success:true ]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x4a0499]

goroutine 11332 [running]:
panic(0xd91580, 0xc42000e0e0)
	/usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/coreos/dbtester/control.(*benchmark).startRequests.func1(0xc4201a2b00, 0x0)
	/home/gyuho/go/src/github.com/coreos/dbtester/control/step2_stress_database.go:108 +0x189
created by github.com/coreos/dbtester/control.(*benchmark).startRequests
	/home/gyuho/go/src/github.com/coreos/dbtester/control/step2_stress_database.go:112 +0x9e

investigate network usage numbers

the difference between rx and tx should be the client send traffic
but it's like 100mb less than expected minimum for zk

automatic stop RPC calls after benchmark is done

Currently manual stop

dbtester stop --agent-endpoints=$(echo $AGENT_ENDPOINTS)

improve aggregated.csv

just dump all summary data here

don't store benchmark results in repo

The repo is huge!

dbtester $ du -hc bench-results
164M total
dbtester $ du -hc
275M total

automate uploading test results, logs

Currently, I manually ssh into machines and run CLI to store test results and logs to cloud storage.
Automate this.

Currently the manual process as:

# each server
gsutil -m cp /mnt/ssd0/monitor.csv gs://dbtester/test-02-etcd-server-3.csv
gsutil -m cp /mnt/ssd0/database.log gs://dbtester/test-02-etcd-server-3.log
gsutil -m cp /mnt/ssd0/dbtester_agent.log gs://dbtester/test-02-etcd-agent-3.log

Tester should be able to set test name, and the agents use this to prefix result files.
Tester should be able to send cloud storage secret key to agents.
Tester should be able to set bucket name to upload logs to.
Server uses this key and bucket to upload logs and data to cloud storage.

measure network I/O

fix build script to not use `go get`

fix build script to not use go get because etcd recently changed the way it vendors dependencies