Git Product home page Git Product logo

cstar_perf's Introduction

cstar_perf

cstar_perf is a performance testing platform for Apache Cassandra which focuses on a high level of automation and test consistency.

It handles the following:

  • Download and build Cassandra source code.
  • Configure and bootstrap nodes on a real cluster.
  • Run stress workloads.
  • Capture performance metrics.
  • Create reports and charts comparing different configs/workloads.
  • Webserver frontend for scheduling tests, viewing prior runs, and monitoring test clusters.

5 Minute Introduction

IMAGE ALT TEXT HERE

Documentation

The evolving documentation is available online here.

The source for these docs are contained in the gh-pages branch, please feel free to make pull requests for improvements.

License

Copyright 2014 DataStax

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

cstar_perf's People

Contributors

aboudreault avatar aweisberg avatar enigmacurry avatar knifewine avatar mambocab avatar mshuler avatar nutbunnies avatar ptnapoleon avatar spodkowinski avatar therealfalcon avatar tlasica avatar yukim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cstar_perf's Issues

Feature Request: Allow for consolidated reporting when a test set runs on two different clusters

Today, if I want to run a performance test using cstar_perf, my understanding is that I only get consolidated reporting if I run the tests against different configurations of Cassandra on the same cluster of compute nodes.

It would be fantastic to be able to have a fixed configuration of Cassandra run on two separate (physical or logical) clusters, and get the benefits of consolidated, side-by-side reporting that cstar_perf's dashboard provides today.

This would allow me to test performance across different CPS's, or even on different cluster configurations within the same cloud.

Have option to use existing clusters, without bootstrapping anything

It might be nice to be able to list clusters in the interface that you can run operations on, but that you can't install C* or bootstrap nodes on, for instance, clusters maintained outside of cstar_perf.

stress_compare currently always tears down the cluster at the end of it's run, perhaps we need a no_bootstrap option. We already have a initial_destroy and leave_data option which may need to be rethought to handle this though.

Add template search

With #2 finished, we can create jobs from existing jobs, it would be nice if we had a way to find commonly used tests.

The job schedule page could have a "Save as template" button and instead of scheduling the job, it could be marked as a template, which a search interface could key off of.

Search should include filtering by user at a minimum.

Ensure notification service is running

The zeromq notification server (cstar_perf_notifications) needs to be running in order for client streaming to function (it forwards to the web console, and if not started it times out.)

Need to add a check at the beginning of server startup to ensure it's running first.

Clone test from existing

Currently you have to create tests from scratch.

You should be able to clone a test from any prior test to pre-populate the form.

The /schedule form should accept a GET to serve a blank form, and if you POST test JSON, it should pre-populate the form. (The actual test submission is done asynchronously to /apt/test/schedule.)

startup error "Undefined name date in selection clause"

InvalidRequest: code=2200 [Invalid query] message="Undefined name date in selection clause"
Traceback (most recent call last):
File "/home/cstar-perf/automaton/cstar_perf/frontend/env/bin/cstar_perf_server", line 9, in
load_entry_point('cstar-perf-frontend==0.1', 'console_scripts', 'cstar_perf_server')()
File "/home/cstar-perf/automaton/cstar_perf/frontend/cstar_perf_frontend/lib/server.py", line 40, in main
run_server()
File "/home/cstar-perf/automaton/cstar_perf/frontend/cstar_perf_frontend/lib/server.py", line 12, in run_server
db = Model()
File "/home/cstar-perf/automaton/cstar_perf/frontend/cstar_perf_frontend/server/model.py", line 112, in init
self.__prepared_statements[name] = self.get_session().prepare(stmt)
File "build/bdist.linux-x86_64/egg/cassandra/cluster.py", line 1227, in prepare
File "build/bdist.linux-x86_64/egg/cassandra/cluster.py", line 2473, in result
cassandra.InvalidRequest: code=2200 [Invalid query] message="Undefined name date in selection clause"

Diff stress config (and results?)

Would be very nice to be able to see a quick diff of the inputs between two (or more) runs, and also potentially the outputs (e.g. side-by-side graph outputs?)

Fatal error: Needed to prompt for a connection or sudo password (host: cnode1), but input would be ambiguous in parallel mode

I'm trying to set up cstar_perf on EC2
My env have only one cassandra node, cnode1
Although all the following combinations do work without a prompt

ssh cnode1 hostname
ssh root@cnode1 hostname
ssh ec2-user@cnode1 hostname

I'm still getting "Needed to prompt.." error when trying to run cstar_perf_bootstrap apache/cassandra-2.1
Any idea?
Thanks!

Full trace below.

$ cstar_perf_bootstrap apache/cassandra-2.1
INFO:bootstrap:Bringing up apache/cassandra-2.1 cluster...
INFO:benchmark:### Config: ###
{'ant_tarball': 'http://www.apache.org/dist/ant/binaries/apache-ant-1.8.4-bin.tar.bz2',
 'block_devices': [u'/dev/xvdb', u'/dev/xvdc', u'/dev/xvdd', u'/dev/xvde'],
 'blockdev_readahead': u'256',
 'cluster_name': 'cstar_perf Y56VVL9VHQ',
 'commitlog_directory': u'/mnt/d1/commitlog',
 'data_file_directories': [u'/mnt/d2/data', u'/mnt/d3/data', u'/mnt/d4/data'],
 'env': '',
 'flush_directory': '/var/lib/cassandra/flush',
 'git_repo': 'git://github.com/apache/cassandra.git',
 'hosts': {u'cnode1': {u'hostname': u'cnode1',
                       u'internal_ip': u'172.31.14.24',
                       u'seed': True}},
 'log_dir': '~/fab/cassandra/logs',
 u'name': u'example1',
 'num_tokens': 256,
 'override_version': None,
 'partitioner': 'murmur3',
 'revision': 'apache/cassandra-2.1',
 'saved_caches_directory': u'/mnt/d2/saved_caches',
 'seeds': [u'172.31.14.24'],
 'use_jna': True,
 'use_vnodes': True,
 'user': u'ec2_user'}
[cnode1] Executing task 'set_device_read_ahead'
[cnode1] run: blockdev --setra 256 /dev/xvdb
[cnode1] run: blockdev --setra 256 /dev/xvdc
[cnode1] run: blockdev --setra 256 /dev/xvdd
[cnode1] run: blockdev --setra 256 /dev/xvde
[cnode1] Executing task 'destroy'

Fatal error: Needed to prompt for a connection or sudo password (host: cnode1), but input would be ambiguous in parallel mode

Aborting.
Needed to prompt for a connection or sudo password (host: cnode1), but input would be ambiguous in parallel mode

Fatal error: One or more hosts failed while executing task 'destroy'

Aborting.
One or more hosts failed while executing task 'destroy'

Add CQLSH step

It would be nice to add a cqlsh step to execute arbitrary commands via cqlsh between other operations

Improve error reporting on build steps

In bisecting the read regression, I bisected over a couple versions that wouldn't compile. The error reported on the test's page said

Traceback (most recent call last):
  File "/home/ryan/git/cstar_perf/frontend/cstar_perf/frontend/client/client.py", line 167, in run
    self.perform_job(job, ws)
  File "/home/ryan/git/cstar_perf/frontend/cstar_perf/frontend/client/client.py", line 255, in perform_job
    with open(stats_path) as stats:
IOError: [Errno 2] No such file or directory: u'/home/ryan/.cstar_perf/jobs/7fc2198c-19ae-11e5-af65-42010af0688f/stats.7fc2198c-19ae-11e5-af65-42010af0688f.json'

but it would have been more useful and saved me some time if it reported on the failure in the ant build step.

Capture errors before C* starts

There are a class of errors that Cassandra does not put into system.log, for instance environment errors. These should be captured and reported to the frontend.

For instance, if you set MAX_HEAP_SIZE without setting HEAP_NEWSIZE, you'll get this error:

please set or unset MAX_HEAP_SIZE and HEAP_NEWSIZE in pairs (see cassandra-env.sh)

but that doesn't bubble up to the frontend, and you don't know why it failed.

As a user I want to run shell scripts at various points during a test run

To facilitate any arbitrary stuff I want to include in a run I would like an option to run shell scripts at each node and at the final aggregation step.

For instance in order to support running a profiler I would like to be able to run a script at each server node after the test is done, but before the server process is terminated.

I would like the same thing, but for every client instance that is started so I can profile the client as well.

I would like to run a script to parse and merge logs, generate graphs, and possibly upload results after all the artifacts from each node have been collected in one place.

Update data model to facilitate retrieving data for a series of jobs

I want to make a couple of data model changes to support future graphing functionality.

Move interval performance data out of the current JSON blob and into a separate JSON blob in a different table. For a workload run daily and a 90 day graph I don't want to have to retrieve all the interval data points just to graph the summary for the entire workload. Let's call this table interval_stats with a primary key that is the job time UUID.

I want to be able to query jobs that are part of a series so I want a new table with a primary key that is the name of a series, and a column that is the time UUID of all the jobs in that series. It may be that we want to create this as a secondary index and let the database handle it.

For the form where a job is submitted if the series is the empty string it will not belong to a series and no row will end up in the series table. Same with the API call.

The transition plan would be have new jobs use the new data model. The existing graphing function will pull from both the interval_stats and existing stats (has to anyways to fill in all fields) and then pick up the interval data from wherever it happens to lie for a given job so old jobs continue to graph normally.

Then I want to add a REST API call for retrieving the job ids for series going back to some timestamp since the epoch or in some range.

Logs are not copied in all failure scenarios

http://cstar.datastax.com/tests/id/05524d18-fffd-11e4-8717-42010af0688f

That job obscured this error message, because it didn't copy the logs:

INFO  [main] 2015-05-21 14:05:22,588 CassandraDaemon.java:122 - JMX is enabled to receive remote connections on port: 7199
INFO  [main] 2015-05-21 14:05:22,613 CacheService.java:111 - Initializing key cache with capacity of 100 MBs.
INFO  [main] 2015-05-21 14:05:22,624 CacheService.java:133 - Initializing row cache with capacity of 0 MBs
INFO  [main] 2015-05-21 14:05:22,635 CacheService.java:150 - Initializing counter cache with capacity of 50 MBs
INFO  [main] 2015-05-21 14:05:22,638 CacheService.java:161 - Scheduling counter cache save to every 7200 seconds (going to save all keys).
INFO  [main] 2015-05-21 14:05:22,830 ColumnFamilyStore.java:314 - Initializing system.schema_triggers
ERROR [COMMIT-LOG-ALLOCATOR] 2015-05-21 14:05:24,462 CommitLog.java:397 - Failed managing commit log segments. Commit disk failure policy is stop; terminating thread
org.apache.cassandra.io.FSWriteError: java.io.IOException: Invalid argument
        at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:177) ~[main/:na]
        at org.apache.cassandra.db.commitlog.CommitLogSegment.freshSegment(CommitLogSegment.java:119) ~[main/:na]
        at org.apache.cassandra.db.commitlog.CommitLogSegmentManager$1.runMayThrow(CommitLogSegmentManager.java:119) ~[main/:na]
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) [main/:na]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_76]
Caused by: java.io.IOException: Invalid argument
        at java.io.RandomAccessFile.setLength(Native Method) ~[na:1.7.0_76]
        at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:163) ~[main/:na]
        ... 4 common frames omitted

SSHException('EOF during negotiation') while running cstar_perf_bootstrap

I've been trying to get "cstar_perf_bootstrap cassandra-2.0.10" command to work following this tutorial. At first I was getting errors with the script not being able to check out from Git, so I ran "git config --global url."https://".insteadOf git://" on all Cassandra nodes and it got me past the problem.

This time I am getting a similar error while running "cstar_perf_bootstrap cassandra-2.0.10" but I am not sure how I can work-around it or fix it:
[cnode3] run: ln -s /usr/share/java/jna.jar /fab/cassandra/lib/jna.jar
!!! Parallel execution exception under host u'cnode3':
Process cnode3:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(_self._args, *_self._kwargs)
File "/usr/local/lib/python2.7/dist-packages/fabric/tasks.py", line 239, in inner
submit(task.run(_args, *_kwargs))
File "/usr/local/lib/python2.7/dist-packages/fabric/tasks.py", line 174, in run
return self.wrapped(_args, *_kwargs)
File "/usr/local/lib/python2.7/dist-packages/fabric/decorators.py", line 181, in inner
return func(_args, *_kwargs)
File "/usr/local/lib/python2.7/dist-packages/cstar_perf/tool/fab_cassandra.py", line 327, in bootstrap
fab.get('
/fab/cassandra/conf/cassandra.yaml', conf_file)
File "/usr/local/lib/python2.7/dist-packages/fabric/network.py", line 647, in host_prompting_wrapper
return func(_args, *_kwargs)
File "/usr/local/lib/python2.7/dist-packages/fabric/operations.py", line 535, in get
ftp = SFTP(env.host_string)
File "/usr/local/lib/python2.7/dist-packages/fabric/sftp.py", line 30, in init
self.ftp = connections[host_string].open_sftp()
File "/usr/local/lib/python2.7/dist-packages/paramiko/client.py", line 379, in open_sftp
return self._transport.open_sftp_client()
File "/usr/local/lib/python2.7/dist-packages/paramiko/transport.py", line 811, in open_sftp_client
return SFTPClient.from_transport(self)
File "/usr/local/lib/python2.7/dist-packages/paramiko/sftp_client.py", line 132, in from_transport
return cls(chan)
File "/usr/local/lib/python2.7/dist-packages/paramiko/sftp_client.py", line 101, in init
raise SSHException('EOF during negotiation')
SSHException: EOF during negotiation
[alebedev@m0053118] out: fatal: unable to connect to github.com:
[alebedev@m0053118] out: github.com[0: 192.30.252.128]: errno=Connection timed out
[alebedev@m0053118] out:
[alebedev@m0053118] out:

perf Test for C* 1.2.x

This is a great tool, can I use this tool to test C* version 1.2.x, is it supported.

Thanks

Ability to cc others on test updates

When configuring a test, you should be able to add others to notify of test updates.

Users should also be able to click a button on the test page to get updates of the test.

I'm also thinking it might be cool to have global user config to get notified of all tests (or maybe on some criteria, like specific cluster)

Future Improvements

Keep track of some possible improvements that could be done in the future.

  • Ability to compare 2+ operations for a same revisions. (useful when using stress user profiles) This might need a fresh cluster between operations. Another solution could be the ability the results of 2 tests.
  • Ability to specify what metrics to record and graph
  • Combine all operations in the same graph. (an option in the UI)?

Repeat test X times

Similar to #2 we should have an option at the bottom of the test creation page to repeat the test X times. Differing from the clone feature though, the results should all appear on the same job page, so you'll see X charts and X logs.

We could get a bit fancier and rank the results by some consistency metrics, 'pick the most consistent run out of the X jobs' style.

Apply label to revision if git refspec is duplicated

If you have two revisions with the same git refspec, and you forget to apply a label, they will show up as the same color on the chart. We could automatically apply a label in the case the user forgets about this.

Eg, if revision='apache/cassandra-2.1' and there are multiple revisions with that version, apply labels 'apache/cassandra-2.1 (1)', 'apache/cassandra-2.1 (2)', etc...

user profile is broken due to new output format of stress

I'm currently using an older version of stress from SHA 184bb65fca on the clusters due to a change in how the output of user profiles are handled.

  • In the old version, it output one line of stats for the entire mixed operation:
Running [singlepost, timeline1, timeline2] with 500 threads for 19000000 iteration
total ops , adj row/s,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr,  gc: #,  max ms,  sum ms,  sdv ms,      mb
218246    ,    218195,  218195,  218195,  218195,     2.3,     1.7,     4.9,     7.1,    42.1,    45.4,    1.0,  0.00000,      1,      22,      22,       0,    1636
471482    ,    246739,  246739,  246739,  246739,     2.0,     1.7,     4.0,     5.0,    17.4,    20.5,    2.0,  0.00000,      0,       0,       0,       0,       0
699553    ,    221421,  221421,  221421,  221421,     2.2,     1.6,     4.8,     6.4,    34.6,    40.6,    3.1,  0.04341,      2,      52,      52,       3,    3275
  • In the new version it output a separate line for each operation:
Running [singlepost, timeline1, timeline2] with 500 threads for 19000000 iteration
type,      total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
singlepost,     73025,   72989,   72989,   72989,     2.3,     1.5,     4.1,    17.4,    85.6,    94.6,    1.0,  0.00000,      0,      3,      83,     107,      13,    3264
timeline1,     73389,   73335,   73335,   73335,     2.3,     1.6,     4.0,    17.5,    85.7,    94.3,    1.0,  0.00000,      0,      3,      83,     107,      13,    3264
timeline2,     72413,   72339,   72339,   72339,     2.3,     1.6,     4.1,    17.5,    85.8,    93.0,    1.0,  0.00000,      0,      3,      83,     107,      13,    3264
total,        218827,  218602,  218602,  218602,     2.3,     1.6,     3.9,    11.0,    85.0,    94.6,    1.0,  0.00000,      0,      3,      83,     107,      13,    3264

Looks like we just need to ignore the lines that don't say total at the front.

Smarter git fetch

Right now we fetch from all configured repositories every job. We can inspect the revisions and infer which repositories they are from as long as they are not a git SHA, and only update those ones.

Allow passing cassandra configuration options for cstar_perf_bootstrap

Right now cstar_perf_bootstrap only allows you to select the stock version you want, one should also be able to pass in yaml and env configuration options. Since multiple options will messy on the command line, I agree that a JSON format similar to cstar_perf_stress is probably the way to go.

human review queue

Until the (inevitable) robot uprising, human intuition is required to find interesting patterns and potential performance regressions in test results.

Automatic test submission is going to be a great way to monitor performance characteristics over time, but it's all for nought if no one actually looks at them.

We should differentiate tests submitted programmatically, and allow these tests to opt in to a review queue of some kind. This might find multiple uses, but an obvious one would be building a list of test results to be reviewed by a human and a simple acknowledgement of the results.

I'd propose this feature be built as dumb-simple and requiring tests to opt-in to be included, so everything can work basically the same as before for anyone using cstar_perf in the wild who may not be interested in a review queue.

Admin interface

Right now, users and clusters have to be added manually to the database. We should have an admin interface for adding them via the frontend.

Permit uploading files to reference in the stress command

The new schema stress facility requires a profile file to be uploaded. This could be pasted into a special box, but perhaps more generally useful would be to support uploading an arbitrary file to reference in the stress command. Either works

Allow inverting test so that a single graph compares multiple stress commands

...and the operation dropdown changes to become the version selection.
Currently the graphing capabilities are only suited to comparing between various versions. Though I could possibly get around this by entering the same version multiple times and running a test.

Example: I create a test because I'm curious about stress write perf with differing threadcounts. I add apache/cassandra-2.1 and apache/trunk. I add stress steps for writing with 50 threads, and writing with 100 threads. I select "invert test" (or something like that). The test then is performed but graphed so that the bottom left legend shows the stress variations instead of versions. The operation dropdown on the right side of the graph allows toggling between the two cassandra versions tested.

As a user I want a place to put artifacts that will be captured after a cstar run

I want to be able to create log files, profiler output, graphs, etc. from within the database, stress client, and shell scripts that run as part of a test.

I want those artifacts to be collected from each node and stored as a tgz that is available after the run.

One way to make this path available would be to set an environment variable with a path whose contents will be collected at the end.

Support profilers

Would be great to be able to safely wire up a profiler for a subset of nodes to track the behaviour of the C* process. Support for YourKit and VTune would be great.

Probably the easiest way to do this is to install the necessary software and just permit executing an arbitrary command with access to the C* PID, followed by a short sleep period before starting the test. Then to kill the command after each run and restart, so we can get clearly delineated results. We just then need a way of grabbing and delivering the result files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.