datastax / cstar_perf Goto Github PK

Apache Cassandra performance testing platform

License: Apache License 2.0

Python 84.36% HTML 3.79% CSS 0.81% JavaScript 10.09% Nginx 0.65% Ruby 0.31%

cstar_perf's Introduction

cstar_perf

cstar_perf is a performance testing platform for Apache Cassandra which focuses on a high level of automation and test consistency.

It handles the following:

Download and build Cassandra source code.
Configure and bootstrap nodes on a real cluster.
Run stress workloads.
Capture performance metrics.
Create reports and charts comparing different configs/workloads.
Webserver frontend for scheduling tests, viewing prior runs, and monitoring test clusters.

5 Minute Introduction

Documentation

The evolving documentation is available online here.

The source for these docs are contained in the gh-pages branch, please feel free to make pull requests for improvements.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

cstar_perf's People

Contributors

Stargazers

Watchers

cstar_perf's Issues

field to specify stress branch

It would be cool to specify an alternate stress build, for example: apache/cassandra-2.1, tjake/7985

Automatically escape command line arguments

It would be great if the escaping of brackets etc. in the stress commands were done automatically by the cstar process, to keep things neat for the user

Feature Request: Allow for consolidated reporting when a test set runs on two different clusters

Today, if I want to run a performance test using cstar_perf, my understanding is that I only get consolidated reporting if I run the tests against different configurations of Cassandra on the same cluster of compute nodes.

It would be fantastic to be able to have a fixed configuration of Cassandra run on two separate (physical or logical) clusters, and get the benefits of consolidated, side-by-side reporting that cstar_perf's dashboard provides today.

This would allow me to test performance across different CPS's, or even on different cluster configurations within the same cloud.

Have option to use existing clusters, without bootstrapping anything

It might be nice to be able to list clusters in the interface that you can run operations on, but that you can't install C* or bootstrap nodes on, for instance, clusters maintained outside of cstar_perf.

stress_compare currently always tears down the cluster at the end of it's run, perhaps we need a no_bootstrap option. We already have a initial_destroy and leave_data option which may need to be rethought to handle this though.

Client not cleaning up old failed job directories

I receive notifications of failed jobs multiple times when clients reconnect.

Graph tool needs loading indicator when asynchronously loading JSON

On a slow connection with a large stats file, it appears as if the graph tool is not working, showing nothing. We just need to add a loading indicator.

Add template search

With #2 finished, we can create jobs from existing jobs, it would be nice if we had a way to find commonly used tests.

The job schedule page could have a "Save as template" button and instead of scheduling the job, it could be marked as a template, which a search interface could key off of.

Search should include filtering by user at a minimum.

Ensure notification service is running

The zeromq notification server (cstar_perf_notifications) needs to be running in order for client streaming to function (it forwards to the web console, and if not started it times out.)

Need to add a check at the beginning of server startup to ensure it's running first.

Clone test from existing

Currently you have to create tests from scratch.

You should be able to clone a test from any prior test to pre-populate the form.

The /schedule form should accept a GET to serve a blank form, and if you POST test JSON, it should pre-populate the form. (The actual test submission is done asynchronously to /apt/test/schedule.)

startup error "Undefined name date in selection clause"

InvalidRequest: code=2200 [Invalid query] message="Undefined name date in selection clause"
Traceback (most recent call last):
File "/home/cstar-perf/automaton/cstar_perf/frontend/env/bin/cstar_perf_server", line 9, in
load_entry_point('cstar-perf-frontend==0.1', 'console_scripts', 'cstar_perf_server')()
File "/home/cstar-perf/automaton/cstar_perf/frontend/cstar_perf_frontend/lib/server.py", line 40, in main
run_server()
File "/home/cstar-perf/automaton/cstar_perf/frontend/cstar_perf_frontend/lib/server.py", line 12, in run_server
db = Model()
File "/home/cstar-perf/automaton/cstar_perf/frontend/cstar_perf_frontend/server/model.py", line 112, in init
self.__prepared_statements[name] = self.get_session().prepare(stmt)
File "build/bdist.linux-x86_64/egg/cassandra/cluster.py", line 1227, in prepare
File "build/bdist.linux-x86_64/egg/cassandra/cluster.py", line 2473, in result
cassandra.InvalidRequest: code=2200 [Invalid query] message="Undefined name date in selection clause"

Diff stress config (and results?)

Would be very nice to be able to see a quick diff of the inputs between two (or more) runs, and also potentially the outputs (e.g. side-by-side graph outputs?)

Fatal error: Needed to prompt for a connection or sudo password (host: cnode1), but input would be ambiguous in parallel mode

I'm trying to set up cstar_perf on EC2
My env have only one cassandra node, cnode1
Although all the following combinations do work without a prompt

ssh cnode1 hostname
ssh root@cnode1 hostname
ssh ec2-user@cnode1 hostname

I'm still getting "Needed to prompt.." error when trying to run cstar_perf_bootstrap apache/cassandra-2.1
Any idea?
Thanks!

Full trace below.

$ cstar_perf_bootstrap apache/cassandra-2.1
INFO:bootstrap:Bringing up apache/cassandra-2.1 cluster...
INFO:benchmark:### Config: ###
{'ant_tarball': 'http://www.apache.org/dist/ant/binaries/apache-ant-1.8.4-bin.tar.bz2',
 'block_devices': [u'/dev/xvdb', u'/dev/xvdc', u'/dev/xvdd', u'/dev/xvde'],
 'blockdev_readahead': u'256',
 'cluster_name': 'cstar_perf Y56VVL9VHQ',
 'commitlog_directory': u'/mnt/d1/commitlog',
 'data_file_directories': [u'/mnt/d2/data', u'/mnt/d3/data', u'/mnt/d4/data'],
 'env': '',
 'flush_directory': '/var/lib/cassandra/flush',
 'git_repo': 'git://github.com/apache/cassandra.git',
 'hosts': {u'cnode1': {u'hostname': u'cnode1',
                       u'internal_ip': u'172.31.14.24',
                       u'seed': True}},
 'log_dir': '~/fab/cassandra/logs',
 u'name': u'example1',
 'num_tokens': 256,
 'override_version': None,
 'partitioner': 'murmur3',
 'revision': 'apache/cassandra-2.1',
 'saved_caches_directory': u'/mnt/d2/saved_caches',
 'seeds': [u'172.31.14.24'],
 'use_jna': True,
 'use_vnodes': True,
 'user': u'ec2_user'}
[cnode1] Executing task 'set_device_read_ahead'
[cnode1] run: blockdev --setra 256 /dev/xvdb
[cnode1] run: blockdev --setra 256 /dev/xvdc
[cnode1] run: blockdev --setra 256 /dev/xvdd
[cnode1] run: blockdev --setra 256 /dev/xvde
[cnode1] Executing task 'destroy'

Fatal error: Needed to prompt for a connection or sudo password (host: cnode1), but input would be ambiguous in parallel mode

Aborting.
Needed to prompt for a connection or sudo password (host: cnode1), but input would be ambiguous in parallel mode

Fatal error: One or more hosts failed while executing task 'destroy'

Aborting.
One or more hosts failed while executing task 'destroy'

Add ability to upgrade (a subset of) nodes

mixed version cluster testing could be useful (After schema is created, of course)

Not all test operations are exposed to the web frontend

stress_compare has the following operations:

stress
nodetool
cqlsh
bash

cqlsh and bash aren't exposed at all, and nodetool is exposed as 'flush' and 'compact' - these should be generalized and re-exposed.

Cloning old job that did not choose a JVM will fail now

Add CQLSH step

It would be nice to add a cqlsh step to execute arbitrary commands via cqlsh between other operations

UI management of cassandra source trees

Admin users should be able to add/remove cassandra source trees without requiring a source code update to cstar_perf.

Improve error reporting on build steps

In bisecting the read regression, I bisected over a couple versions that wouldn't compile. The error reported on the test's page said

Traceback (most recent call last):
  File "/home/ryan/git/cstar_perf/frontend/cstar_perf/frontend/client/client.py", line 167, in run
    self.perform_job(job, ws)
  File "/home/ryan/git/cstar_perf/frontend/cstar_perf/frontend/client/client.py", line 255, in perform_job
    with open(stats_path) as stats:
IOError: [Errno 2] No such file or directory: u'/home/ryan/.cstar_perf/jobs/7fc2198c-19ae-11e5-af65-42010af0688f/stats.7fc2198c-19ae-11e5-af65-42010af0688f.json'

but it would have been more useful and saved me some time if it reported on the failure in the ant build step.

use vnodes either does not apply, or does not display correctly

If you disable vnodes in the test setup, the test view displays them as still enabled. Whether this is a display or genuine config issue, I'm not sure

Capture errors before C* starts

There are a class of errors that Cassandra does not put into system.log, for instance environment errors. These should be captured and reported to the frontend.

For instance, if you set MAX_HEAP_SIZE without setting HEAP_NEWSIZE, you'll get this error:

please set or unset MAX_HEAP_SIZE and HEAP_NEWSIZE in pairs (see cassandra-env.sh)

but that doesn't bubble up to the frontend, and you don't know why it failed.

need dropdown to pick JDK version

Testing Java 7 vs 8 is going to be important.

Label from frontend is not getting applied to chart. Multiple runs overlap each other

http://cstar.datastax.com/graph?stats=e62573e4-4e35-11e4-89fe-bc764e04482c&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=37.62&ymin=0&ymax=71188.7

Above is a run with two tests with the same revision. A label was applied to each, but that label is null in the JSON, so the graph overlaps with no distinction.

As a user I want to run shell scripts at various points during a test run

To facilitate any arbitrary stuff I want to include in a run I would like an option to run shell scripts at each node and at the final aggregation step.

For instance in order to support running a profiler I would like to be able to run a script at each server node after the test is done, but before the server process is terminated.

I would like the same thing, but for every client instance that is started so I can profile the client as well.

I would like to run a script to parse and merge logs, generate graphs, and possibly upload results after all the artifacts from each node have been collected in one place.

Add support for graphing the attributes from summaries of multiple jobs

Add a REST API call that given a list of job IDs will graph attributes from the summary of the job and return it as a static image.

Update data model to facilitate retrieving data for a series of jobs

I want to make a couple of data model changes to support future graphing functionality.

Move interval performance data out of the current JSON blob and into a separate JSON blob in a different table. For a workload run daily and a 90 day graph I don't want to have to retrieve all the interval data points just to graph the summary for the entire workload. Let's call this table interval_stats with a primary key that is the job time UUID.

I want to be able to query jobs that are part of a series so I want a new table with a primary key that is the name of a series, and a column that is the time UUID of all the jobs in that series. It may be that we want to create this as a secondary index and let the database handle it.

For the form where a job is submitted if the series is the empty string it will not belong to a series and no row will end up in the series table. Same with the API call.

The transition plan would be have new jobs use the new data model. The existing graphing function will pull from both the interval_stats and existing stats (has to anyways to fill in all fields) and then pick up the interval data from wherever it happens to lie for a given job so old jobs continue to graph normally.

Then I want to add a REST API call for retrieving the job ids for series going back to some timestamp since the epoch or in some range.

Logs are not copied in all failure scenarios

http://cstar.datastax.com/tests/id/05524d18-fffd-11e4-8717-42010af0688f

That job obscured this error message, because it didn't copy the logs:

INFO  [main] 2015-05-21 14:05:22,588 CassandraDaemon.java:122 - JMX is enabled to receive remote connections on port: 7199
INFO  [main] 2015-05-21 14:05:22,613 CacheService.java:111 - Initializing key cache with capacity of 100 MBs.
INFO  [main] 2015-05-21 14:05:22,624 CacheService.java:133 - Initializing row cache with capacity of 0 MBs
INFO  [main] 2015-05-21 14:05:22,635 CacheService.java:150 - Initializing counter cache with capacity of 50 MBs
INFO  [main] 2015-05-21 14:05:22,638 CacheService.java:161 - Scheduling counter cache save to every 7200 seconds (going to save all keys).
INFO  [main] 2015-05-21 14:05:22,830 ColumnFamilyStore.java:314 - Initializing system.schema_triggers
ERROR [COMMIT-LOG-ALLOCATOR] 2015-05-21 14:05:24,462 CommitLog.java:397 - Failed managing commit log segments. Commit disk failure policy is stop; terminating thread
org.apache.cassandra.io.FSWriteError: java.io.IOException: Invalid argument
        at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:177) ~[main/:na]
        at org.apache.cassandra.db.commitlog.CommitLogSegment.freshSegment(CommitLogSegment.java:119) ~[main/:na]
        at org.apache.cassandra.db.commitlog.CommitLogSegmentManager$1.runMayThrow(CommitLogSegmentManager.java:119) ~[main/:na]
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) [main/:na]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_76]
Caused by: java.io.IOException: Invalid argument
        at java.io.RandomAccessFile.setLength(Native Method) ~[na:1.7.0_76]
        at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:163) ~[main/:na]
        ... 4 common frames omitted

Cancel button doesn't appear when test is in progress

Currently you can only cancel a test before it starts. We should be able to cancel it while it is in progress.

SSHException('EOF during negotiation') while running cstar_perf_bootstrap

I've been trying to get "cstar_perf_bootstrap cassandra-2.0.10" command to work following this tutorial. At first I was getting errors with the script not being able to check out from Git, so I ran "git config --global url."https://".insteadOf git://" on all Cassandra nodes and it got me past the problem.

This time I am getting a similar error while running "cstar_perf_bootstrap cassandra-2.0.10" but I am not sure how I can work-around it or fix it:
[cnode3] run: ln -s /usr/share/java/jna.jar /fab/cassandra/lib/jna.jar
!!! Parallel execution exception under host u'cnode3':
Process cnode3:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(_self._args, *_self._kwargs)
File "/usr/local/lib/python2.7/dist-packages/fabric/tasks.py", line 239, in inner
submit(task.run(_args, *_kwargs))
File "/usr/local/lib/python2.7/dist-packages/fabric/tasks.py", line 174, in run
return self.wrapped(_args, *_kwargs)
File "/usr/local/lib/python2.7/dist-packages/fabric/decorators.py", line 181, in inner
return func(_args, *_kwargs)
File "/usr/local/lib/python2.7/dist-packages/cstar_perf/tool/fab_cassandra.py", line 327, in bootstrap
fab.get('/fab/cassandra/conf/cassandra.yaml', conf_file)
File "/usr/local/lib/python2.7/dist-packages/fabric/network.py", line 647, in host_prompting_wrapper
return func(_args, *_kwargs)
File "/usr/local/lib/python2.7/dist-packages/fabric/operations.py", line 535, in get
ftp = SFTP(env.host_string)
File "/usr/local/lib/python2.7/dist-packages/fabric/sftp.py", line 30, in init
self.ftp = connections[host_string].open_sftp()
File "/usr/local/lib/python2.7/dist-packages/paramiko/client.py", line 379, in open_sftp
return self._transport.open_sftp_client()
File "/usr/local/lib/python2.7/dist-packages/paramiko/transport.py", line 811, in open_sftp_client
return SFTPClient.from_transport(self)
File "/usr/local/lib/python2.7/dist-packages/paramiko/sftp_client.py", line 132, in from_transport
return cls(chan)
File "/usr/local/lib/python2.7/dist-packages/paramiko/sftp_client.py", line 101, in init
raise SSHException('EOF during negotiation')
SSHException: EOF during negotiation
[alebedev@m0053118] out: fatal: unable to connect to github.com:
[alebedev@m0053118] out: github.com[0: 192.30.252.128]: errno=Connection timed out
[alebedev@m0053118] out:
[alebedev@m0053118] out:

Cassandra compilation cache is probably broken because of switchable JDK

fab_cassandra caches the C* build to reduce compilation time on large tests. This is probably broken if it was requested to compile with JDK7 but then needs to run on JDK8 or vice versa.

Should we disable the cache, or parameterize the cache so that it knows which JDK built it?

perf Test for C* 1.2.x

This is a great tool, can I use this tool to test C* version 1.2.x, is it supported.

Thanks

adjust stats graphing to current cassandra-stress output format

Looks like a column has been removed from the output.

Stress variations aren't passed from frontend to stress_compare

stress_compare supports variations such as kill a node after X seconds. The frontend has these in the form, but are not passed correctly.

Ability to cc others on test updates

When configuring a test, you should be able to add others to notify of test updates.

Users should also be able to click a button on the test page to get updates of the test.

I'm also thinking it might be cool to have global user config to get notified of all tests (or maybe on some criteria, like specific cluster)

Future Improvements

Keep track of some possible improvements that could be done in the future.

Ability to compare 2+ operations for a same revisions. (useful when using stress user profiles) This might need a fresh cluster between operations. Another solution could be the ability the results of 2 tests.
Ability to specify what metrics to record and graph
Combine all operations in the same graph. (an option in the UI)?

Save user profile yaml file used for a stress run

For posterity, it would be useful to save the yaml that was used for a test run

Repeat test X times

Similar to #2 we should have an option at the bottom of the test creation page to repeat the test X times. Differing from the clone feature though, the results should all appear on the same job page, so you'll see X charts and X logs.

We could get a bit fancier and rank the results by some consistency metrics, 'pick the most consistent run out of the X jobs' style.

Apply label to revision if git refspec is duplicated

If you have two revisions with the same git refspec, and you forget to apply a label, they will show up as the same color on the chart. We could automatically apply a label in the case the user forgets about this.

Eg, if revision='apache/cassandra-2.1' and there are multiple revisions with that version, apply labels 'apache/cassandra-2.1 (1)', 'apache/cassandra-2.1 (2)', etc...

user profile is broken due to new output format of stress

I'm currently using an older version of stress from SHA 184bb65fca on the clusters due to a change in how the output of user profiles are handled.

In the old version, it output one line of stats for the entire mixed operation:

Running [singlepost, timeline1, timeline2] with 500 threads for 19000000 iteration
total ops , adj row/s,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr,  gc: #,  max ms,  sum ms,  sdv ms,      mb
218246    ,    218195,  218195,  218195,  218195,     2.3,     1.7,     4.9,     7.1,    42.1,    45.4,    1.0,  0.00000,      1,      22,      22,       0,    1636
471482    ,    246739,  246739,  246739,  246739,     2.0,     1.7,     4.0,     5.0,    17.4,    20.5,    2.0,  0.00000,      0,       0,       0,       0,       0
699553    ,    221421,  221421,  221421,  221421,     2.2,     1.6,     4.8,     6.4,    34.6,    40.6,    3.1,  0.04341,      2,      52,      52,       3,    3275

In the new version it output a separate line for each operation:

Running [singlepost, timeline1, timeline2] with 500 threads for 19000000 iteration
type,      total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
singlepost,     73025,   72989,   72989,   72989,     2.3,     1.5,     4.1,    17.4,    85.6,    94.6,    1.0,  0.00000,      0,      3,      83,     107,      13,    3264
timeline1,     73389,   73335,   73335,   73335,     2.3,     1.6,     4.0,    17.5,    85.7,    94.3,    1.0,  0.00000,      0,      3,      83,     107,      13,    3264
timeline2,     72413,   72339,   72339,   72339,     2.3,     1.6,     4.1,    17.5,    85.8,    93.0,    1.0,  0.00000,      0,      3,      83,     107,      13,    3264
total,        218827,  218602,  218602,  218602,     2.3,     1.6,     3.9,    11.0,    85.0,    94.6,    1.0,  0.00000,      0,      3,      83,     107,      13,    3264

Looks like we just need to ignore the lines that don't say total at the front.

Expose wait_for_compaction as a checkbox in the frontend

stress_compare accepts a wait_for_compaction flag on each operation. It defaults to True and you can't change that in the interface currently.

Smarter git fetch

Right now we fetch from all configured repositories every job. We can inspect the revisions and infer which repositories they are from as long as they are not a git SHA, and only update those ones.

Allow passing cassandra configuration options for cstar_perf_bootstrap

Right now cstar_perf_bootstrap only allows you to select the stock version you want, one should also be able to pass in yaml and env configuration options. Since multiple options will messy on the command line, I agree that a JSON format similar to cstar_perf_stress is probably the way to go.

human review queue

Until the (inevitable) robot uprising, human intuition is required to find interesting patterns and potential performance regressions in test results.

Automatic test submission is going to be a great way to monitor performance characteristics over time, but it's all for nought if no one actually looks at them.

We should differentiate tests submitted programmatically, and allow these tests to opt in to a review queue of some kind. This might find multiple uses, but an obvious one would be building a list of test results to be reviewed by a human and a simple acknowledgement of the results.

I'd propose this feature be built as dumb-simple and requiring tests to opt-in to be included, so everything can work basically the same as before for anyone using cstar_perf in the wild who may not be interested in a review queue.

Admin interface

Right now, users and clusters have to be added manually to the database. We should have an admin interface for adding them via the frontend.

number of nodes to use in stress_compare should be configurable

stress_compare uses all the nodes defined in ~/.cstar_perf/cluster_config.json. This should be a configurable option in the test JSON.

The web frontend already has a field for it, but it is greyed out.

Default stress config on schedule page should show TWO read operations

It's standard practice to do two read operations, one from a cold cache, one from a hot cache. We should default to run two reads.

In the event of failure, logs are not copied

It's hard to tell why a test failed to run at the moment as the individual c* node logs do not get made available

Permit uploading files to reference in the stress command

The new schema stress facility requires a profile file to be uploaded. This could be pasted into a special box, but perhaps more generally useful would be to support uploading an arbitrary file to reference in the stress command. Either works

Allow inverting test so that a single graph compares multiple stress commands

...and the operation dropdown changes to become the version selection.
Currently the graphing capabilities are only suited to comparing between various versions. Though I could possibly get around this by entering the same version multiple times and running a test.

Example: I create a test because I'm curious about stress write perf with differing threadcounts. I add apache/cassandra-2.1 and apache/trunk. I add stress steps for writing with 50 threads, and writing with 100 threads. I select "invert test" (or something like that). The test then is performed but graphed so that the bottom left legend shows the stress variations instead of versions. The operation dropdown on the right side of the graph allows toggling between the two cassandra versions tested.

As a user I want a place to put artifacts that will be captured after a cstar run

I want to be able to create log files, profiler output, graphs, etc. from within the database, stress client, and shell scripts that run as part of a test.

I want those artifacts to be collected from each node and stored as a tgz that is available after the run.

One way to make this path available would be to set an environment variable with a path whose contents will be collected at the end.

Support profilers

Would be great to be able to safely wire up a profiler for a subset of nodes to track the behaviour of the C* process. Support for YourKit and VTune would be great.

Probably the easiest way to do this is to install the necessary software and just permit executing an arbitrary command with access to the C* PID, followed by a short sleep period before starting the test. Then to kill the command after each run and restart, so we can get clearly delineated results. We just then need a way of grabbing and delivering the result files.