cloud-bulldozer / benchmark-wrapper Goto Github PK

View Code? Open in Web Editor NEW

19.0 10.0 57.0 1.15 MB

Python Library to run benchmarks

Home Page: https://benchmark-wrapper.readthedocs.io

License: Apache License 2.0

Python 93.47% Shell 2.34% Dockerfile 4.19%

benchmark-wrapper's People

Stargazers

Watchers

benchmark-wrapper's Issues

stressng wrapper crashes - stressng output file not written to read-only file system

Running the stressng benchmark on an OpenShift 4.5 cluster, the stressng benchmark never completes successfully as its working directory is read only and stressng cannot write the YAML result file. The stressng process completes, but the snafu exits unsuccessfuly when it cannot load the YAML result file from stressng:

stress-ng: error: [15] Cannot open log file stressng.log
stress-ng: debug: [15] 4 processors online, 4 processors configured
stress-ng: info:  [15] Working directory / is not read/writeable, some I/O tests may fail
(..)
stress-ng: info:  [15] successful run completed in 30.00s
stress-ng: error: [15] Cannot output YAML data to stressng.yml
(..)
2020-10-28T16:36:26Z - INFO     - MainProcess - trigger_stressng: Starting output parsing
Traceback (most recent call last):
  File "/usr/local/bin/run_snafu", line 33, in <module>
    sys.exit(load_entry_point('snafu', 'console_scripts', 'run_snafu')())
  File "/opt/snafu/snafu/run_snafu.py", line 122, in main
    for i in process_generator(index_args, parser):
  File "/opt/snafu/snafu/run_snafu.py", line 141, in process_generator
    for action, index in data_object.emit_actions():
  File "/opt/snafu/snafu/stressng_wrapper/trigger_stressng.py", line 90, in emit_actions
    data = self._parse_outfile()
  File "/opt/snafu/snafu/stressng_wrapper/trigger_stressng.py", line 27, in _parse_outfile
    stream = open('stressng.yml', 'r')
FileNotFoundError: [Errno 2] No such file or directory: 'stressng.yml'

The benchmark is left in a Running state while the pod started from the job exits in an Error state until the job reaches its backoff limit:

Events:
  Type     Reason                Age    From            Message
  ----     ------                ----   ----            -------
  Normal   SuccessfulCreate      12m    job-controller  Created pod: stressng-workload-4c8ac04d-w8jlr
  Normal   SuccessfulDelete      3m15s  job-controller  Deleted pod: stressng-workload-4c8ac04d-w8jlr
  Warning  BackoffLimitExceeded  3m15s  job-controller  Job has reached the specified backoff limit

No comments in the code

So, to begin with the wrapper scripts for smallfile, I tried to go through the code for currently available benchmarks. But it was quite rough to go through, since, there were no comments added in the code. I highly recommend to follow this good practice to add comments in the code, which is currently residing and upcoming in snafu.

Labels for CI

@aakarshg Can you please add the ok to test, etc labels for CI?

Improve documentation for supported wrappers

We have documentation around the supported workload wrappers and how to a new workload wrapper, it might be useful to add information around things like brief description about the benchmark as well as the wrapper, metrics it collects/indexes, and instructions on how to run the wrapper locally using podman.

Add CLI Option to Purge Empty Fields in ES

Currently we don't do any sort of checks to see if a field is non-empty before shipping off to ES. This can lead to a lot of empty fields being sent out, depending on the use case and the benchmark. For instance, here is a document from a Uperf CI test ran by ripsaw, where most of the fields are populated through environment variables:

{
    "_index" : "ripsaw-uperf-results-000002",
    "_type" : "_doc",
    "_id" : "297e5bb466870ad7f90916e68f60c440a43be6dbe0707ab6f3c4d7e30007e807",
    "_score" : 7.654378,
    "_source" : {
        "workload" : "uperf",
        "uuid" : "79690e49-479e-593a-8a51-0a1ef032de88",
        "user" : "ripsaw",
        "cluster_name" : "myk8scluster",
        "hostnetwork" : "True",
        "iteration" : 2,
        "remote_ip" : "10.0.133.30",
        "client_ips" : "10.0.173.62 10.130.0.1 ",
        "uperf_ts" : "2021-06-01T22:32:39.594000",
        "service_ip" : "False",
        "bytes" : 519864320,
        "norm_byte" : 258605056,
        "ops" : 1015360,
        "norm_ops" : 505088,
        "norm_ltcy" : 2.3778479412250935,
        "kind" : "pod",
        "client_node" : "ip-10-0-173-62.us-west-2.compute.internal",
        "server_node" : "unknown",
        "num_pairs" : "1",
        "multus_client" : "",
        "networkpolicy" : "",
        "density" : "1",
        "nodes_in_iter" : "1",
        "step_size" : "",
        "colocate" : "False",
        "density_range" : [ ],
        "node_range" : [ ],
        "pod_id" : "0",
        "test_type" : "stream",
        "protocol" : "udp",
        "message_size" : 512,
        "read_message_size" : 512,
        "num_threads" : 2,
        "duration" : 3,
        "run_id" : "NA"
    }
}

And here is a document exported from just running the command run_snafu --tool uperf --user ryan --uuid 1234 --proto tcp --remoteip localhost -w iperf.xml --resourcetype container -s 1 --verbose:

{
    "_index": "snafu-uperf-results",
    "_op_type": "create",
    "_source": {
        "test_type": "",
        "protocol": "tcp",
        "message_size": null,
        "read_message_size": null,
        "num_threads": 1,
        "duration": 31,
        "kind": "container",
        "hostnetwork": "False",
        "remote_ip": "localhost",
        "client_ips": "",
        "service_ip": "False",
        "client_node": "",
        "server_node": "",
        "num_pairs": "",
        "multus_client": "",
        "networkpolicy": "",
        "density": "",
        "nodes_in_iter": "",
        "step_size": "",
        "colocate": "",
        "density_range": "",
        "node_range": "",
        "pod_id": null,
        "uperf_ts": "2021-06-28T14:42:37.066000",
        "bytes": 48083435520,
        "norm_byte": 1505771520,
        "ops": 5869560,
        "norm_ops": 183810,
        "norm_ltcy": 6.534098878427453,
        "iteration": 1,
        "user": "ryan",
        "uuid": "1234",
        "workload": "uperf",
        "run_id": "NA"
    },
    "_id": "ae6d9dfc7083e94c569d1999c2eb2ae1dce4a77fc5c7052c128103783f9acc70",
    "run_id": "NA"
}

I think it would be cool to add in a CLI option called --no-empty-fields or something, that would remove any field from exported documents which is null or an empty string. This way teams only get the fields and the data that they care about, rather than also getting the extra fields which we use as a team.

why modify contents of argparse output object?

@acalhounRH in the development branch at this line of code I see this:

self.args.cluster_name = "mycluster"
if "clustername" in os.environ:
    self.args.cluster_name = os.environ["clustername"]

the args structure is generated by argparse module. In all other fields you extract data from this structure but you don't modify it, why are you doing that here? Think you meant to do

self.cluster_name = 'mycluster'
if "clustername" in os.environ:
    self.cluster_name = os.environ["clustername"]

Add test case for upgrade wrapper

Benchmark-wrapper needs to test the upgrade wrapper similar to other wrappers in the CI to make sure the code changes are not breaking the functionality. The version to upgrade to can be set to the same version as the CI version to avoid upgrading the OpenShift cluster on which CI is running.

Checking ES in SNAFU CI is redundant

Since checking ES was added into ripsaw checking it again here is redundant. This can simplify the snafu CI process by just checking the return code of the test.

BUG: Snafu CI only triggers a single benchmark test even if PR affects more

Have been noticing this oddity with snafu ci, where even if the PR changes more than a benchmark, only the first benchmark is run.

Example PR: #151 this only ran fio, despite changes affecting dockerfiles of other benchmarks

[Upgrade workload] Desired version takes a couple of seconds to switch

The desired version field of the clusterversion takes 2-3 seconds to change after an upgrade has been triggered. This is leading to issues with capturing the current/desired versions, introducing a delay/sleep 10 before grabbing the desired version for comparison should fix it.

Include requirements.txt

SNAFU requires specific packages to be installed to run. We should include what we expect.

fs-drift with cephfs ReadWriteMany : limit of total fields exceeded

When I use fs-drift with cephfs and access_mode: ReadWriteMany, snafu fails to write results for at least 1 pod because of:

            "reason": "Limit of total fields [1000] in index [ripsaw-fs-drift-results-000001] has been exceeded",

I think this is caused because all the pods are sharing the same directory tree (intentionally) and are thus seeing each others' output JSON and not filtering that. Can be fixed by selecting just response time and JSON files that have a matching hostname (i.e. pod name).

urgent: fio_wrapper ci fails on minikube 3.17 with client: host x.x.x.x disconnected

see this pastebin, the key indicator is:

client: host=172.17.0.6 disconnected
client: host=172.17.0.5 disconnected

This is with fio 3.19. looking into whether this happens with earlier fio versions.

SNAFU fails with fips enabled

https://github.com/cloud-bulldozer/snafu/blob/08284205cb010f8c6973809e24aac04b50ce1897/src/run_snafu.py#L139

generates error

Traceback (most recent call last):
  File "/usr/local/bin/run_snafu", line 11, in <module>
    load_entry_point('snafu', 'console_scripts', 'run_snafu')()
  File "/opt/snafu/src/run_snafu.py", line 99, in main
    parser))
  File "/opt/snafu/src/utils/py_es_bulk.py", line 156, in streaming_bulk
    for ok, resp_payload in streaming_bulk_generator:
  File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 212, in streaming_bulk
    actions, chunk_size, max_chunk_bytes, client.transport.serializer
  File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 63, in _chunk_actions
    for action, data in actions:
  File "/opt/snafu/src/utils/py_es_bulk.py", line 117, in actions_tracking_closure
    for cl_action in cl_actions:
  File "/opt/snafu/src/run_snafu.py", line 139, in process_generator
    es_valid_document["_id"] = hashlib.md5(str(action).encode()).hexdigest()
ValueError: [digital envelope routines: EVP_DigestInit_ex] disabled for FIPS

[uperf] Indexing issue causes result output not to be displayed.

Here is an example :

<?xml version=1.0?>
<profile name="stream-udp-16384-8">
<group nthreads="8">
              <transaction iterations="1">
        <flowop type="connect" options="remotehost=$h protocol=udp"/>
      </transaction>
      <transaction duration="60">
        <flowop type=write options="count=16 size=16384"/>
      </transaction>
      <transaction iterations="1">
        <flowop type=disconnect />
      </transaction>
          </group>
</profile>
Traceback (most recent call last):
  File "/opt/snafu/uperf-wrapper/uperf-wrapper.py", line 227, in <module>
    sys.exit(main())
  File "/opt/snafu/uperf-wrapper/uperf-wrapper.py", line 221, in main
    _index_result("ripsaw-uperf-results",server,port,documents)
  File "/opt/snafu/uperf-wrapper/uperf-wrapper.py", line 27, in _index_result
    es.index(index=index, body=result)
  File "/usr/lib/python2.7/site-packages/elasticsearch/client/utils.py", line 84, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/usr/lib/python2.7/site-packages/elasticsearch/client/__init__.py", line 364, in index
    "POST", _make_path(index, doc_type, id), params=params, body=body
  File "/usr/lib/python2.7/site-packages/elasticsearch/transport.py", line 353, in perform_request
    timeout=timeout,
  File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 244, in perform_request
    raise ConnectionError("N/A", str(e), e)
elasticsearch.exceptions.ConnectionError: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7fc7b1822e90>: Failed to establish a new connection: [Errno -2] Name or service not known) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7fc7b1822e90>: Failed to establish a new connection: [Errno -2] Name or service not known)

We should catch this error, report a connection failure to the user, but still present the results.

cyclictest container fails to build

Looks like one of the RPMs cyclictest uses has been purged

dnf -y install https://www.rpmfind.net/linux/centos/8-stream/AppStream/x86_64/os/Packages/rt-tests-1.8-11.el8.x86_64.rpm
08/01/2021, 17:38:47
08/01/2021, 17:38:47 ---> Running in e7d6a1619497
08/01/2021, 17:38:49 Extra Packages for Enterprise Linux Modular 8 - 1.8 MB/s | 527 kB 00:00
08/01/2021, 17:38:49 Extra Packages for Enterprise Linux 8 - x86_64 29 MB/s | 8.7 MB 00:00
08/01/2021, 17:38:53 [MIRROR] rt-tests-1.8-11.el8.x86_64.rpm: Status code: 404 for https://www.rpmfind.net/linux/centos/8-stream/AppStream/x86_64/os/Packages/rt-tests-1.8-11.el8.x86_64.rpm (IP: 195.220.108.108)
08/01/2021, 17:38:53 [MIRROR] rt-tests-1.8-11.el8.x86_64.rpm: Status code: 404 for https://www.rpmfind.net/linux/centos/8-stream/AppStream/x86_64/os/Packages/rt-tests-1.8-11.el8.x86_64.rpm (IP: 195.220.108.108)
08/01/2021, 17:38:53 [MIRROR] rt-tests-1.8-11.el8.x86_64.rpm: Status code: 404 for https://www.rpmfind.net/linux/centos/8-stream/AppStream/x86_64/os/Packages/rt-tests-1.8-11.el8.x86_64.rpm (IP: 195.220.108.108)
08/01/2021, 17:38:53 [MIRROR] rt-tests-1.8-11.el8.x86_64.rpm: Status code: 404 for https://www.rpmfind.net/linux/centos/8-stream/AppStream/x86_64/os/Packages/rt-tests-1.8-11.el8.x86_64.rpm (IP: 195.220.108.108) [FAILED] rt-tests-1.8-11.el8.x86_64.rpm: Status code: 404 for https://www.rpmfind.net/linux/centos/8-stream/AppStream/x86_64/os/Packages/rt-tests-1.8-11.el8.x86_64.rpm (IP: 195.220.108.108)
08/01/2021, 17:38:53 Status code: 404 for https://www.rpmfind.net/linux/centos/8-stream/AppStream/x86_64/os/Packages/rt-tests-1.8-11.el8.x86_64.rpm (IP: 195.220.108.108)

uperf Docker Image creation fails with dependency issues on ppc64le

Hi Team, I am trying to build the uperf image from benchmark-wrapper repository for ppc64le but it keeps failing at the last step - RUN pip3 install -e /opt/snafu/ with errors pointing to missing packages/dependencies as below -

ModuleNotFoundError: No module named 'numpy'
ModuleNotFoundError: No module named 'cython'
RuntimeError: Broken toolchain: cannot link a simple C program
gcc: error trying to exec 'cc1plus': execvp: No such file or directory

Adding these 'missing' packages is not straight as a new one shows up with every image build failure. The expectation was that all dependencies mentioned in setup.cfg would be sufficient but that isn't the case ppc64le. Main issue seems to be scipy & numpy packages.

As per a workaround suggested on perf-scale slack, I tried removing scipy & numpy in setup.cfg but it didn't help and the build failed with error as numpy is again required by prometheus_api_client -

Collecting numpy (from prometheus_api_client->snafu==0.0.1)
  Downloading https://files.pythonhosted.org/packages/c5/63/a48648ebc57711348420670bb074998f79828291f68aebfff1642be212ec/numpy-1.19.4.zip (7.3MB)
    100% |████████████████████████████████| 7.3MB 175kB/s 
Collecting pandas>=1.0.0 (from prometheus_api_client->snafu==0.0.1)
  Downloading https://files.pythonhosted.org/packages/09/39/fb93ed98962d032963418cd1ea5927b9e11c4c80cb1e0b45dea769d8f9a5/pandas-1.1.4.tar.gz (5.2MB)
    100% |████████████████████████████████| 5.2MB 243kB/s 
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 346, in get_provider
        module = sys.modules[moduleOrReq]
    KeyError: 'numpy'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-6u08iba0/pandas/setup.py", line 792, in <module>
        setup_package()
      File "/tmp/pip-build-6u08iba0/pandas/setup.py", line 762, in setup_package
        ext_modules=maybe_cythonize(extensions, compiler_directives=directives),
      File "/tmp/pip-build-6u08iba0/pandas/setup.py", line 521, in maybe_cythonize
        numpy_incl = pkg_resources.resource_filename("numpy", "core/include")
      File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1132, in resource_filename
        return get_provider(package_or_requirement).get_resource_filename(
      File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 348, in get_provider
        __import__(moduleOrReq)
    ModuleNotFoundError: No module named 'numpy'
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-6u08iba0/pandas/
Error: error building at STEP "RUN pip3 install -e /opt/snafu/": error while running runtime: exit status 1

Please help suggest a fix or a workaround for the same.

Add human readable description to ES documents

With the increasing amount of tests, it is getting difficult to locate/identify a given test. By adding a human readable description to the ES Documents users would be able to describe each test, and this would assist with future analysis/look up.

[FIO]: Elasticsearch index 'ripsaw-fio-analyzed-result' doesn't record the std-deviation for randomread and randomwrite

we depend on the std-dev value in the ripsaw-fio-analyzed-result in ocs-ci to evaluate whether the data sample is valid before proceeding the validation of regression.

[RFE] Snafu ci rebuilds ripsaw image for every benchmark

Currently the snafu CI or err ripsaw CI rebuilds the ripsaw image for every benchmark. We can just build the image for benchmark-operator once and use it for snafu testing.

perf-lta-es not the clustername

When I run smallfile wrapper, run_snafu.py logs these messages below, and the cluster_name is wrong, not sure where perf-lta-es comes from as a cluster name but I did not specify it at all, either in the wrapper or in env. vars.

2019-11-04T15:33:33Z - INFO     - MainProcess - 
run_snafu: Connected to the elasticsearch cluster with info as follows:
{
u'cluster_name': u'perf-lta-es', 
u'cluster_uuid': u'fLlQtf5cTiqtRoc02RaQdQ', 
u'version': {u'build_date': u'2019-09-06T14:40:30.409026Z', 
u'minimum_wire_compatibility_version': u'6.8.0', 
u'build_hash': u'1c1faf1', 
u'number': u'7.3.2', 
u'lucene_version': u'8.1.0', 
u'minimum_index_compatibility_version': u'6.0.0-beta1', 
u'build_flavor': u'default', 
u'build_snapshot': False, 
u'build_type': u'rpm'}, 
u'name': u'perf-es-1', 
u'tagline': u'You Know, for Search'
}
2019-11-04T15:33:33Z - INFO     - MainProcess - 
wrapper_factory: identified smallfile as the benchmark wrapper

Can we not print stuff that's overridden in the wrapper? Or make more common variables inherited by all wrappers (i.e. not implemented in per-wrapper parsing)? I could help with this.

Invalid name(s) for python module

All wrapper(script) names and directory names need to be changed to be complaint with python module syntax standards, if this is not resolved python will not import those modules.

change dash "-" to underscore "_" in all directory names and script names.

https://docs.python.org/3/reference/lexical_analysis.html#identifiers

YCSB Wrapper issue

Need more checking for the output of YCSB -- for example running with YCSB.

[OVERALL], RunTime(ms), 2471
[OVERALL], Throughput(ops/sec), 40.46944556859571
[TOTAL_GCS_Copy], Count, 1
[TOTAL_GC_TIME_Copy], Time(ms), 17
[TOTAL_GC_TIME_%Copy], Time(%), 0.6879805746661272
[TOTAL_GCS_MarkSweepCompact], Count, 0
[TOTAL_GC_TIME_MarkSweepCompact], Time(ms), 0
[TOTAL_GC_TIME%MarkSweepCompact], Time(%), 0.0
[TOTAL_GCs], Count, 1
[TOTAL_GC_TIME], Time(ms), 17
[TOTAL_GC_TIME%], Time(%), 0.6879805746661272
[CLEANUP], Operations, 1
[CLEANUP], AverageLatency(us), 3.0
[CLEANUP], MinLatency(us), 3
[CLEANUP], MaxLatency(us), 3
[CLEANUP], 95thPercentileLatency(us), 3
[CLEANUP], 99thPercentileLatency(us), 3
[INSERT], Operations, 100
[INSERT], AverageLatency(us), 9752.74
[INSERT], MinLatency(us), 1739
[INSERT], MaxLatency(us), 41407
[INSERT], 95thPercentileLatency(us), 22831
[INSERT], 99thPercentileLatency(us), 39167
[INSERT], Return=OK, 100
java -cp /ycsb/couchbase2-binding/conf:/ycsb/conf:/ycsb/lib/HdrHistogram-2.1.4.jar:/ycsb/lib/core-0.15.0.jar:/ycsb/lib/htrace-core4-4.1.0-incubating.jar:/ycsb/lib/jackson-core-asl-1.9.4.jar:/ycsb/lib/jackson-mapper-asl-1.9.4.jar:/ycsb/couchbase2-binding/lib/core-io-1.3.1.jar:/ycsb/couchbase2-binding/lib/couchbase2-binding-0.15.0.jar:/ycsb/couchbase2-binding/lib/java-client-2.3.1.jar:/ycsb/couchbase2-binding/lib/rxjava-1.1.5.jar com.yahoo.ycsb.Client -db com.yahoo.ycsb.db.couchbase2.Couchbase2Client -s -P /tmp/ycsb/workloada -p couchbase.host=cb-benchmark-0000.cb-benchmark.builder-infra.svc -p couchbase.password=password -load
Command line: -db com.yahoo.ycsb.db.couchbase2.Couchbase2Client -s -P /tmp/ycsb/workloada -p couchbase.host=cb-benchmark-0000.cb-benchmark.builder-infra.svc -p couchbase.password=password -load
YCSB Client 0.15.0

Loading workload...
Starting test.
2019-08-03 13:52:22:084 0 sec: 0 operations; est completion in 0 second
Aug 03, 2019 1:52:22 PM com.couchbase.client.core.env.DefaultCoreEnvironment
INFO: ioPoolSize is less than 3 (1), setting to: 3
Aug 03, 2019 1:52:22 PM com.couchbase.client.core.env.DefaultCoreEnvironment
INFO: computationPoolSize is less than 3 (1), setting to: 3
Aug 03, 2019 1:52:22 PM com.yahoo.ycsb.db.couchbase2.Couchbase2Client logParams
INFO: ===> Using Params: host=cb-benchmark-0000.cb-benchmark.builder-infra.svc, bucket=default, upsert=false, persistTo=NONE, replicateTo=NONE, syncMutResponse=true, adhoc=false, kv=true, maxParallelism=1, queryEndpoints=1, kvEndpoints=1, queryEndpoints=1, epoll=false, boost=3, networkMetricsInterval=0, runtimeMetricsInterval=0
Aug 03, 2019 1:52:22 PM com.couchbase.client.core.CouchbaseCore
INFO: CouchbaseEnvironment: {sslEnabled=false, sslKeystoreFile='null', sslKeystorePassword=false, sslKeystore=null, bootstrapHttpEnabled=true, bootstrapCarrierEnabled=true, bootstrapHttpDirectPort=8091, bootstrapHttpSslPort=18091, bootstrapCarrierDirectPort=11210, bootstrapCarrierSslPort=11207, ioPoolSize=3, computationPoolSize=3, responseBufferSize=16384, requestBufferSize=16384, kvServiceEndpoints=1, viewServiceEndpoints=1, queryServiceEndpoints=1, searchServiceEndpoints=1, ioPool=NioEventLoopGroup, coreScheduler=CoreScheduler, eventBus=DefaultEventBus, packageNameAndVersion=couchbase-java-client/2.3.1 (git: 2.3.1, core: 1.3.1), dcpEnabled=false, retryStrategy=BestEffort, maxRequestLifetime=75000, retryDelay=ExponentialDelay{growBy 1.0 MICROSECONDS, powers of 2; lower=100, upper=100000}, reconnectDelay=ExponentialDelay{growBy 1.0 MILLISECONDS, powers of 2; lower=32, upper=4096}, observeIntervalDelay=ExponentialDelay{growBy 1.0 MICROSECONDS, powers of 2; lower=10, upper=100000}, keepAliveInterval=30000, autoreleaseAfter=2000, bufferPoolingEnabled=true, tcpNodelayEnabled=true, mutationTokensEnabled=false, socketConnectTimeout=10000, dcpConnectionBufferSize=20971520, dcpConnectionBufferAckThreshold=0.2, dcpConnectionName=dcp/core-io, callbacksOnIoPool=true, disconnectTimeout=25000, requestBufferWaitStrategy=com.couchbase.client.core.env.DefaultCoreEnvironment$2@419a6765, queryTimeout=75000, viewTimeout=75000, kvTimeout=10000, connectTimeout=30000, dnsSrvEnabled=false}
Aug 03, 2019 1:52:22 PM com.couchbase.client.core.node.CouchbaseNode signalConnected
INFO: Connected to Node cb-benchmark-0000.cb-benchmark.builder-infra.svc
Aug 03, 2019 1:52:23 PM com.couchbase.client.core.node.CouchbaseNode signalConnected
INFO: Connected to Node cb-benchmark-0001.cb-benchmark.builder-infra.svc
Aug 03, 2019 1:52:23 PM com.couchbase.client.core.config.DefaultConfigurationProvider$8 call
INFO: Opened bucket default
DBWrapper: report latency for each error is false and specific error codes to track for latency are: []
Aug 03, 2019 1:52:23 PM com.couchbase.client.core.node.CouchbaseNode signalConnected
INFO: Connected to Node cb-benchmark-0002.cb-benchmark.builder-infra.svc
2019-08-03 13:52:24:475 2 sec: 100 operations; 40.58 current ops/sec; [CLEANUP: Count=1, Max=3, Min=3, Avg=3, 90=3, 99=3, 99.9=3, 99.99=3] [INSERT: Count=100, Max=41407, Min=1739, Avg=9752.74, 90=18271, 99=39167, 99.9=41407, 99.99=41407]
Traceback (most recent call last):
File "/opt/snafu/ycsb-wrapper/ycsb-wrapper.py", line 191, in
sys.exit(main())
File "/opt/snafu/ycsb-wrapper/ycsb-wrapper.py", line 180, in main
documents,summary = _json_payload(data,args.run[0],uuid,user,phase,workload,args.driver[0],recordcount,operationcount)
File "/opt/snafu/ycsb-wrapper/ycsb-wrapper.py", line 83, in _json_payload
summary_dict[summ[0].strip('[').strip(']')][summ[1]] = float(summ[2])
ValueError: could not convert string to float: runtimeMetricsInterval=0

Add indexing for pgbench progress report output

Snafu should be capability of indexing the progress report that is generated by pgbnech. a detailed description of what the progress report is below.

"...The report includes the time since the beginning of the run, the tps since the last report, and the transaction latency average and standard deviation since the last report. Under throttling (-R), the latency is computed with respect to the transaction scheduled start time, not the actual transaction beginning time, thus it also includes the average schedule lag time."

[RFE] Add a linter job

The project needs a linting job.

Write data to file if no ES is provided or fails

If would be useful to write the data to be indexed to a file in the case where ES fails or is not provided.This would allow us to retrieve the data seperately and not have to do a full re-run if an issue is hit.

[RFE] Uperf missing number of pairs in documents

Uperf docs are missing number of pairs being used for a particular test

[RFE] include ycsb arguments with the summary docs

Today the ycsb summary docs don't include the ycsb arguments, this will be helpful when comparing to ensure an apple to apple comparison is happening.

Inefficient indexing

Currently indexing is performed by opening a connection to elasticsearch for each test and document type, the recommended method would be to open a single connection and as documents are created/parsed those documents are emitted to the indexer, this can be achieved with the utilization of py_es_bulk and python generators.

pseudo code follows:

main ():
handle args
initialize es
status = py_es_bulk.streaming_bulk(es, process_generator(args)

def process_generator(args)
object_generator = process_data(args)

for obj in object_generator:
for action in obj.emit_actions():
yield action

def process_data(args)

for i in args.sample:
trigger_fio_generator = _trigger_fio(args) # _trigger_fio would become an object containing the emit_actions method
yield trigger_fio_generator

Determine and Implement Versioning Schema

Goal here is to discuss and then determine how we want to version snafu. When we make a determination, we can update the docs and make the necessary config changes as needed.

CI testing for VM/CNV

Spawning this issue from #261 which was caused because we did not test on anything outside of openshift (ie VMs, etc).

We currently do not have a method of CI testing CNV/VM's for benchmark-wrapper. As we continue to move forward this is something we should look into as it will likely become more prevelant.

cc: @sarahbx @jtaleric @amitsagtani97

Write raw data to ES

Previously we would write the raw data to ES in a byte64encode. However, we have gotten away from that. We should re-enable this in our workflow.

Failed to account for ramp time when processing logs...

Failing to account for ramp time when processing logs. As specified by FIO ramp time is:

If set, fio will run the specified workload for this amount of time before logging any performance numbers. Useful for letting performance settle before logging results, thus minimizing the runtime required for stable results. Note that the ramp_time is considered lead in time for a job, thus it will increase the total runtime if a special timeout or runtime is specified.

Because of this when ever ramp time is used, the logged start time must be adjusted based on it before processing log results. If this is not fixed and ramp time is used, results will be indexed with incorrect timing.

if ramp_time:
start time = ramp_time + fio_timestamp
else:
start_time = fio_timestamp

for example see graph a ramp time of 300s and run time of 600s were used. total iops should be delayed by 300s although it is recorded as starting at the beginning of the testing period.

Consolidate top level directory

With more benchmark scripts getting added the top level directory is getting cluttered, suggest creating a benchmark_wrapper directory and moving all wrapper directories into it.

python2.7 is being deprecated

see this message from pip:

Step 11/14 : RUN pip install "elasticsearch>=6.0.0,<=7.0.2"
 ---> Running in 31c46c82647f
DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support

It's time, right? There are no platforms that we support that don't have python3 at this point, along with packages that we need, right? for example, python3-elasticsearch RPM.

no cluster_loader/ci_test.sh

causes CI to fail. For example, see this log error

Add indexing of PGbench report latency output

Snafu should be capability of indexing the report latency that is generated by pgbnech. a detailed description of what the progress report is below.

"Report the average per-statement latency (execution time from the perspective of the client) of each command after the benchmark finishes. See below for details."

Lack of usage documentation

When looking to utilize run_snafu for the first time there is no documentation on how to actually run any of the exiting benchmarks. There is some information in the building your own section however there is no "Running Snafu" section. Similarly there is no documentation for each of the wrappers. Lastly, there is no "help" functionality in run_snafu so even just running it blindly doesn't give any helpful information with a failure message describing usage.

[RFE] Publish a pip package

Snafu needs to be published a python package to pip, to make it easier to install/upgrade and run. This will also put more emphasis on snafu being used independent of ripsaw.

CI reliability

The following #109 had syntax errors which weren't detected by CI (fixed by #110). CI needs some love as it doesn't look very reliable in certain scenarios.

[rfe] define shared message bus keys

If we have a message bus available, we should have a defined interface to interact with the bus.

Redis - As message bus

Each workload is an hash
hset uperf status "0|1|2"

0 - Complete
1 - Failed
2 - Running

By default have a status with each workload.

smallfile pods not synchronized

I've been reminded (by using elastic search and kibana!) that smallfile ripsaw pods are not synchronizing with each other, so if you request a sequence of operations, such as create,read,append,delete and multiple samples of each, pod 1 may start reads before pod 2 has finished creates, etc. and over time this becomes worse and worse, preventing us from really measuring what throughput and response time is generated by each operation type. The solution is probably to use redis to synchronize the different pods so that pod 1 doesn't start its read until pods other than pod 1 have finished their creates, and so on. smallfile is still usable to some extent but it would be better and more scalable if this was fixed.

RFE: add version to indexed document

We should look at adding snafu version to the indexed document.

uperf wrapper bug

If UPerf fails to execute it leaves drops a python error into stdout.

Instead we should consider one or a combination of the below:

Retry running UPerf
Check the stderr and notify the user the workload is failing
Both 1 and 2.

redis synchronization timeout in smallfile

I love the new redis sync feature in smallfile, very important. But when I run a big test (more below) with smallfile, I get the timeout below from all the smallfile pods. But if I cut the number of files to 1/10 of that, then the test succeeds. I think the redis_timeout environment variable (default 60 sec) may need to be adjusted in cases like this. This can be done by editing resources/operator.yaml env: section, but it requires restarting the operator. Is there a better way?

2020-03-24T00:14:16Z - INFO     - MainProcess - trigger_smallfile: Complete message from channel: {'type': 'su
bscribe', 'pattern': None, 'channel': b'smallfile-22b7124d-c3c5-5849-adb1-76056dabaa0f', 'data': 1}
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 190, in _read_from_socket
    data = recv(self._sock, socket_read_size)
  File "/usr/local/lib/python3.6/site-packages/redis/_compat.py", line 71, in recv
    return sock.recv(*args, **kwargs)
socket.timeout: timed out

This is followed by:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/snafu/run_snafu.py", line 137, in <module>
    sys.exit(main())
  File "/opt/snafu/run_snafu.py", line 86, in main
    parser))
  File "/opt/snafu/utils/py_es_bulk.py", line 156, in streaming_bulk
    for ok, resp_payload in streaming_bulk_generator:
  File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 212, in streaming_bulk
    actions, chunk_size, max_chunk_bytes, client.transport.serializer
  File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 63, in _chunk_actions
    for action, data in actions:
  File "/opt/snafu/utils/py_es_bulk.py", line 117, in actions_tracking_closure
    for cl_action in cl_actions:
  File "/opt/snafu/run_snafu.py", line 119, in process_generator
    for action, index in data_object.emit_actions():
  File "/opt/snafu/smallfile_wrapper/trigger_smallfile.py", line 165, in emit_actions
    for msg in p.listen():
  File "/usr/local/lib/python3.6/site-packages/redis/client.py", line 3553, in listen
    response = self.handle_message(self.parse_response(block=True))
  File "/usr/local/lib/python3.6/site-packages/redis/client.py", line 3453, in parse_response
    response = self._execute(conn, conn.read_response)
  File "/usr/local/lib/python3.6/site-packages/redis/client.py", line 3427, in _execute
    return command(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 734, in read_response
    response = self._parser.read_response()
  File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 316, in read_response
    response = self._buffer.readline()
  File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 248, in readline
    self._read_from_socket()
  File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 204, in _read_from_socket
    raise TimeoutError("Timeout reading from socket")
redis.exceptions.TimeoutError: Timeout reading from socket

The CR is:

[root@e24-h17-740xd ripsaw]# more ../smf.yaml 
apiVersion: ripsaw.cloudbulldozer.io/v1alpha1
kind: Benchmark
metadata:
  name: smf-benchmark-big
  namespace: my-ripsaw
spec:
  test_user: BenE
  clustername: bm-alias-cloud02-2020-03-23
  elasticsearch:
    server: snafu:[email protected]
    port: 9200
  es_index: ripsaw-smallfile
  workload:
    name: smallfile
    args:
      clients: 35
      samples: 3
      operation: ["create", "read"]
      threads: 3
      file_size: 1024
      files: 20000
      storageclass: example-storagecluster-cephfs
      storagesize: 500Gi

Update PGbench to use run_snafu

In order to reduce duplicate indexing functionality PGbench-wrapper should be updated to use run_snafu.

[uperf][enhancement] Migrate to bulk indexer

Current indexing code is extremely trivial and can drop results. We need to migrate to the bulk indexer with retries ASAP.

add a "concurrent databases" or "number of databases" counter to pgbench summary document

add a "concurrent databases" or "number of databases" counter to pgbench summary document? I am trying to iterate over a set tests using the add1 feature for # of DBs. The problem is associating the X number of DBs on grafana. the way we are currently indexing the results we are just indicating which DB the summary results are associated to and includes nothing about the number of DBs currently under test for that iteration.
the table at the link below should show 3 sets of test with 1, 2 , and 3 data bases, unfortunately there isn't any field I can use to identify and group to accomplish the desired result so it just shows a single set with a unique count of DBs at 3.

http://marquez.perf.lab.eng.rdu2.redhat.com:3000/d/5ni0mp0Zz/aws-pgbench-summary-with-rook-ceph?orgId=1&from=1572545911231&to=1572551015024&var-user=acalhoun&var-clustername=acalhoun-10-31-2019-3db-test&var-UUID=e9db825c-14ad-534d-b237-15a99ea27e94&var-interval=10s&var-datasource=Prometheus%20-%20Public%20cloud&var-net_device=All&var-block_device=All&var-instances=All

@aakarshg @dustinblack

[trigger_cluster_loader.py] Handle case when clusterloader fails due to API issues etc

Currently if clusterloader fails due to an API issue etc in the middle of the run, we see an error like

  File "/tmp/snafu/run_snafu.py", line 154, in <module>
    sys.exit(main())
  File "/tmp/snafu/run_snafu.py", line 115, in main
    for i in process_generator(index_args, parser):
  File "/tmp/snafu/run_snafu.py", line 133, in process_generator
    for action, index in data_object.emit_actions():
  File "/tmp/snafu/cluster_loader/trigger_cluster_loader.py", line 56, in emit_actions
    cl_output_json = list(filter(pattern.match, output_file_content))[0].strip()
IndexError: list index out of range

We need a check that makes sure that the list generated from the filter after checking for the regex is non-empty and exit if it is.

Error due to the output file not being found for trigger_fio

I have seen multiple occurrences of the following error in fio client pods:

ERROR - MainProcess - trigger_fio: Fio failed to execute
ERROR - MainProcess - trigger_fio: Output file: <fio-server-1-benchmark-f003-drkjc> fio: output file open error: No such file or directory

The errors seem to be random. Has anyone else seen this?

cloud-bulldozer / benchmark-wrapper Goto Github PK

benchmark-wrapper's People

Stargazers

Watchers

Forkers

benchmark-wrapper's Issues

Redis - As message bus

Recommend Projects

Recommend Topics

Recommend Org