cloud-bulldozer / benchmark-wrapper Goto Github PK
View Code? Open in Web Editor NEWPython Library to run benchmarks
Home Page: https://benchmark-wrapper.readthedocs.io
License: Apache License 2.0
Python Library to run benchmarks
Home Page: https://benchmark-wrapper.readthedocs.io
License: Apache License 2.0
Running the stressng benchmark on an OpenShift 4.5 cluster, the stressng benchmark never completes successfully as its working directory is read only and stressng cannot write the YAML result file. The stressng process completes, but the snafu exits unsuccessfuly when it cannot load the YAML result file from stressng:
stress-ng: error: [15] Cannot open log file stressng.log
stress-ng: debug: [15] 4 processors online, 4 processors configured
stress-ng: info: [15] Working directory / is not read/writeable, some I/O tests may fail
(..)
stress-ng: info: [15] successful run completed in 30.00s
stress-ng: error: [15] Cannot output YAML data to stressng.yml
(..)
2020-10-28T16:36:26Z - INFO - MainProcess - trigger_stressng: Starting output parsing
Traceback (most recent call last):
File "/usr/local/bin/run_snafu", line 33, in <module>
sys.exit(load_entry_point('snafu', 'console_scripts', 'run_snafu')())
File "/opt/snafu/snafu/run_snafu.py", line 122, in main
for i in process_generator(index_args, parser):
File "/opt/snafu/snafu/run_snafu.py", line 141, in process_generator
for action, index in data_object.emit_actions():
File "/opt/snafu/snafu/stressng_wrapper/trigger_stressng.py", line 90, in emit_actions
data = self._parse_outfile()
File "/opt/snafu/snafu/stressng_wrapper/trigger_stressng.py", line 27, in _parse_outfile
stream = open('stressng.yml', 'r')
FileNotFoundError: [Errno 2] No such file or directory: 'stressng.yml'
The benchmark is left in a Running state while the pod started from the job exits in an Error state until the job reaches its backoff limit:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 12m job-controller Created pod: stressng-workload-4c8ac04d-w8jlr
Normal SuccessfulDelete 3m15s job-controller Deleted pod: stressng-workload-4c8ac04d-w8jlr
Warning BackoffLimitExceeded 3m15s job-controller Job has reached the specified backoff limit
So, to begin with the wrapper scripts for smallfile, I tried to go through the code for currently available benchmarks. But it was quite rough to go through, since, there were no comments added in the code. I highly recommend to follow this good practice to add comments in the code, which is currently residing and upcoming in snafu.
@aakarshg Can you please add the ok to test, etc labels for CI?
We have documentation around the supported workload wrappers and how to a new workload wrapper, it might be useful to add information around things like brief description about the benchmark as well as the wrapper, metrics it collects/indexes, and instructions on how to run the wrapper locally using podman.
Currently we don't do any sort of checks to see if a field is non-empty before shipping off to ES. This can lead to a lot of empty fields being sent out, depending on the use case and the benchmark. For instance, here is a document from a Uperf CI test ran by ripsaw, where most of the fields are populated through environment variables:
{
"_index" : "ripsaw-uperf-results-000002",
"_type" : "_doc",
"_id" : "297e5bb466870ad7f90916e68f60c440a43be6dbe0707ab6f3c4d7e30007e807",
"_score" : 7.654378,
"_source" : {
"workload" : "uperf",
"uuid" : "79690e49-479e-593a-8a51-0a1ef032de88",
"user" : "ripsaw",
"cluster_name" : "myk8scluster",
"hostnetwork" : "True",
"iteration" : 2,
"remote_ip" : "10.0.133.30",
"client_ips" : "10.0.173.62 10.130.0.1 ",
"uperf_ts" : "2021-06-01T22:32:39.594000",
"service_ip" : "False",
"bytes" : 519864320,
"norm_byte" : 258605056,
"ops" : 1015360,
"norm_ops" : 505088,
"norm_ltcy" : 2.3778479412250935,
"kind" : "pod",
"client_node" : "ip-10-0-173-62.us-west-2.compute.internal",
"server_node" : "unknown",
"num_pairs" : "1",
"multus_client" : "",
"networkpolicy" : "",
"density" : "1",
"nodes_in_iter" : "1",
"step_size" : "",
"colocate" : "False",
"density_range" : [ ],
"node_range" : [ ],
"pod_id" : "0",
"test_type" : "stream",
"protocol" : "udp",
"message_size" : 512,
"read_message_size" : 512,
"num_threads" : 2,
"duration" : 3,
"run_id" : "NA"
}
}
And here is a document exported from just running the command run_snafu --tool uperf --user ryan --uuid 1234 --proto tcp --remoteip localhost -w iperf.xml --resourcetype container -s 1 --verbose
:
{
"_index": "snafu-uperf-results",
"_op_type": "create",
"_source": {
"test_type": "",
"protocol": "tcp",
"message_size": null,
"read_message_size": null,
"num_threads": 1,
"duration": 31,
"kind": "container",
"hostnetwork": "False",
"remote_ip": "localhost",
"client_ips": "",
"service_ip": "False",
"client_node": "",
"server_node": "",
"num_pairs": "",
"multus_client": "",
"networkpolicy": "",
"density": "",
"nodes_in_iter": "",
"step_size": "",
"colocate": "",
"density_range": "",
"node_range": "",
"pod_id": null,
"uperf_ts": "2021-06-28T14:42:37.066000",
"bytes": 48083435520,
"norm_byte": 1505771520,
"ops": 5869560,
"norm_ops": 183810,
"norm_ltcy": 6.534098878427453,
"iteration": 1,
"user": "ryan",
"uuid": "1234",
"workload": "uperf",
"run_id": "NA"
},
"_id": "ae6d9dfc7083e94c569d1999c2eb2ae1dce4a77fc5c7052c128103783f9acc70",
"run_id": "NA"
}
I think it would be cool to add in a CLI option called --no-empty-fields
or something, that would remove any field from exported documents which is null or an empty string. This way teams only get the fields and the data that they care about, rather than also getting the extra fields which we use as a team.
@acalhounRH in the development branch at this line of code I see this:
self.args.cluster_name = "mycluster"
if "clustername" in os.environ:
self.args.cluster_name = os.environ["clustername"]
the args structure is generated by argparse module. In all other fields you extract data from this structure but you don't modify it, why are you doing that here? Think you meant to do
self.cluster_name = 'mycluster'
if "clustername" in os.environ:
self.cluster_name = os.environ["clustername"]
Benchmark-wrapper needs to test the upgrade wrapper similar to other wrappers in the CI to make sure the code changes are not breaking the functionality. The version to upgrade to can be set to the same version as the CI version to avoid upgrading the OpenShift cluster on which CI is running.
Since checking ES was added into ripsaw checking it again here is redundant. This can simplify the snafu CI process by just checking the return code of the test.
Have been noticing this oddity with snafu ci, where even if the PR changes more than a benchmark, only the first benchmark is run.
Example PR: #151 this only ran fio, despite changes affecting dockerfiles of other benchmarks
The desired version field of the clusterversion takes 2-3 seconds to change after an upgrade has been triggered. This is leading to issues with capturing the current/desired versions, introducing a delay/sleep 10 before grabbing the desired version for comparison should fix it.
SNAFU requires specific packages to be installed to run. We should include what we expect.
When I use fs-drift with cephfs and access_mode: ReadWriteMany, snafu fails to write results for at least 1 pod because of:
"reason": "Limit of total fields [1000] in index [ripsaw-fs-drift-results-000001] has been exceeded",
I think this is caused because all the pods are sharing the same directory tree (intentionally) and are thus seeing each others' output JSON and not filtering that. Can be fixed by selecting just response time and JSON files that have a matching hostname (i.e. pod name).
see this pastebin, the key indicator is:
client: host=172.17.0.6 disconnected
client: host=172.17.0.5 disconnected
This is with fio 3.19. looking into whether this happens with earlier fio versions.
generates error
Traceback (most recent call last):
File "/usr/local/bin/run_snafu", line 11, in <module>
load_entry_point('snafu', 'console_scripts', 'run_snafu')()
File "/opt/snafu/src/run_snafu.py", line 99, in main
parser))
File "/opt/snafu/src/utils/py_es_bulk.py", line 156, in streaming_bulk
for ok, resp_payload in streaming_bulk_generator:
File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 212, in streaming_bulk
actions, chunk_size, max_chunk_bytes, client.transport.serializer
File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 63, in _chunk_actions
for action, data in actions:
File "/opt/snafu/src/utils/py_es_bulk.py", line 117, in actions_tracking_closure
for cl_action in cl_actions:
File "/opt/snafu/src/run_snafu.py", line 139, in process_generator
es_valid_document["_id"] = hashlib.md5(str(action).encode()).hexdigest()
ValueError: [digital envelope routines: EVP_DigestInit_ex] disabled for FIPS
Here is an example :
<?xml version=1.0?>
<profile name="stream-udp-16384-8">
<group nthreads="8">
<transaction iterations="1">
<flowop type="connect" options="remotehost=$h protocol=udp"/>
</transaction>
<transaction duration="60">
<flowop type=write options="count=16 size=16384"/>
</transaction>
<transaction iterations="1">
<flowop type=disconnect />
</transaction>
</group>
</profile>
Traceback (most recent call last):
File "/opt/snafu/uperf-wrapper/uperf-wrapper.py", line 227, in <module>
sys.exit(main())
File "/opt/snafu/uperf-wrapper/uperf-wrapper.py", line 221, in main
_index_result("ripsaw-uperf-results",server,port,documents)
File "/opt/snafu/uperf-wrapper/uperf-wrapper.py", line 27, in _index_result
es.index(index=index, body=result)
File "/usr/lib/python2.7/site-packages/elasticsearch/client/utils.py", line 84, in _wrapped
return func(*args, params=params, **kwargs)
File "/usr/lib/python2.7/site-packages/elasticsearch/client/__init__.py", line 364, in index
"POST", _make_path(index, doc_type, id), params=params, body=body
File "/usr/lib/python2.7/site-packages/elasticsearch/transport.py", line 353, in perform_request
timeout=timeout,
File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 244, in perform_request
raise ConnectionError("N/A", str(e), e)
elasticsearch.exceptions.ConnectionError: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7fc7b1822e90>: Failed to establish a new connection: [Errno -2] Name or service not known) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7fc7b1822e90>: Failed to establish a new connection: [Errno -2] Name or service not known)
We should catch this error, report a connection failure to the user, but still present the results.
Looks like one of the RPMs cyclictest uses has been purged
dnf -y install https://www.rpmfind.net/linux/centos/8-stream/AppStream/x86_64/os/Packages/rt-tests-1.8-11.el8.x86_64.rpm
08/01/2021, 17:38:47
08/01/2021, 17:38:47 ---> Running in e7d6a1619497
08/01/2021, 17:38:49 Extra Packages for Enterprise Linux Modular 8 - 1.8 MB/s | 527 kB 00:00
08/01/2021, 17:38:49 Extra Packages for Enterprise Linux 8 - x86_64 29 MB/s | 8.7 MB 00:00
08/01/2021, 17:38:53 [MIRROR] rt-tests-1.8-11.el8.x86_64.rpm: Status code: 404 for https://www.rpmfind.net/linux/centos/8-stream/AppStream/x86_64/os/Packages/rt-tests-1.8-11.el8.x86_64.rpm (IP: 195.220.108.108)
08/01/2021, 17:38:53 [MIRROR] rt-tests-1.8-11.el8.x86_64.rpm: Status code: 404 for https://www.rpmfind.net/linux/centos/8-stream/AppStream/x86_64/os/Packages/rt-tests-1.8-11.el8.x86_64.rpm (IP: 195.220.108.108)
08/01/2021, 17:38:53 [MIRROR] rt-tests-1.8-11.el8.x86_64.rpm: Status code: 404 for https://www.rpmfind.net/linux/centos/8-stream/AppStream/x86_64/os/Packages/rt-tests-1.8-11.el8.x86_64.rpm (IP: 195.220.108.108)
08/01/2021, 17:38:53 [MIRROR] rt-tests-1.8-11.el8.x86_64.rpm: Status code: 404 for https://www.rpmfind.net/linux/centos/8-stream/AppStream/x86_64/os/Packages/rt-tests-1.8-11.el8.x86_64.rpm (IP: 195.220.108.108) [FAILED] rt-tests-1.8-11.el8.x86_64.rpm: Status code: 404 for https://www.rpmfind.net/linux/centos/8-stream/AppStream/x86_64/os/Packages/rt-tests-1.8-11.el8.x86_64.rpm (IP: 195.220.108.108)
08/01/2021, 17:38:53 Status code: 404 for https://www.rpmfind.net/linux/centos/8-stream/AppStream/x86_64/os/Packages/rt-tests-1.8-11.el8.x86_64.rpm (IP: 195.220.108.108)
Hi Team, I am trying to build the uperf image from benchmark-wrapper repository for ppc64le but it keeps failing at the last step - RUN pip3 install -e /opt/snafu/
with errors pointing to missing packages/dependencies as below -
ModuleNotFoundError: No module named 'numpy'
ModuleNotFoundError: No module named 'cython'
RuntimeError: Broken toolchain: cannot link a simple C program
gcc: error trying to exec 'cc1plus': execvp: No such file or directory
Adding these 'missing' packages is not straight as a new one shows up with every image build failure. The expectation was that all dependencies mentioned in setup.cfg would be sufficient but that isn't the case ppc64le. Main issue seems to be scipy & numpy packages.
As per a workaround suggested on perf-scale slack, I tried removing scipy & numpy in setup.cfg but it didn't help and the build failed with error as numpy is again required by prometheus_api_client -
Collecting numpy (from prometheus_api_client->snafu==0.0.1)
Downloading https://files.pythonhosted.org/packages/c5/63/a48648ebc57711348420670bb074998f79828291f68aebfff1642be212ec/numpy-1.19.4.zip (7.3MB)
100% |████████████████████████████████| 7.3MB 175kB/s
Collecting pandas>=1.0.0 (from prometheus_api_client->snafu==0.0.1)
Downloading https://files.pythonhosted.org/packages/09/39/fb93ed98962d032963418cd1ea5927b9e11c4c80cb1e0b45dea769d8f9a5/pandas-1.1.4.tar.gz (5.2MB)
100% |████████████████████████████████| 5.2MB 243kB/s
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 346, in get_provider
module = sys.modules[moduleOrReq]
KeyError: 'numpy'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-build-6u08iba0/pandas/setup.py", line 792, in <module>
setup_package()
File "/tmp/pip-build-6u08iba0/pandas/setup.py", line 762, in setup_package
ext_modules=maybe_cythonize(extensions, compiler_directives=directives),
File "/tmp/pip-build-6u08iba0/pandas/setup.py", line 521, in maybe_cythonize
numpy_incl = pkg_resources.resource_filename("numpy", "core/include")
File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1132, in resource_filename
return get_provider(package_or_requirement).get_resource_filename(
File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 348, in get_provider
__import__(moduleOrReq)
ModuleNotFoundError: No module named 'numpy'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-6u08iba0/pandas/
Error: error building at STEP "RUN pip3 install -e /opt/snafu/": error while running runtime: exit status 1
Please help suggest a fix or a workaround for the same.
With the increasing amount of tests, it is getting difficult to locate/identify a given test. By adding a human readable description to the ES Documents users would be able to describe each test, and this would assist with future analysis/look up.
we depend on the std-dev value in the ripsaw-fio-analyzed-result in ocs-ci to evaluate whether the data sample is valid before proceeding the validation of regression.
Currently the snafu CI or err ripsaw CI rebuilds the ripsaw image for every benchmark. We can just build the image for benchmark-operator once and use it for snafu testing.
When I run smallfile wrapper, run_snafu.py logs these messages below, and the cluster_name is wrong, not sure where perf-lta-es comes from as a cluster name but I did not specify it at all, either in the wrapper or in env. vars.
2019-11-04T15:33:33Z - INFO - MainProcess -
run_snafu: Connected to the elasticsearch cluster with info as follows:
{
u'cluster_name': u'perf-lta-es',
u'cluster_uuid': u'fLlQtf5cTiqtRoc02RaQdQ',
u'version': {u'build_date': u'2019-09-06T14:40:30.409026Z',
u'minimum_wire_compatibility_version': u'6.8.0',
u'build_hash': u'1c1faf1',
u'number': u'7.3.2',
u'lucene_version': u'8.1.0',
u'minimum_index_compatibility_version': u'6.0.0-beta1',
u'build_flavor': u'default',
u'build_snapshot': False,
u'build_type': u'rpm'},
u'name': u'perf-es-1',
u'tagline': u'You Know, for Search'
}
2019-11-04T15:33:33Z - INFO - MainProcess -
wrapper_factory: identified smallfile as the benchmark wrapper
Can we not print stuff that's overridden in the wrapper? Or make more common variables inherited by all wrappers (i.e. not implemented in per-wrapper parsing)? I could help with this.
All wrapper(script) names and directory names need to be changed to be complaint with python module syntax standards, if this is not resolved python will not import those modules.
change dash "-" to underscore "_" in all directory names and script names.
https://docs.python.org/3/reference/lexical_analysis.html#identifiers
Need more checking for the output of YCSB -- for example running with YCSB.
[OVERALL], RunTime(ms), 2471
[OVERALL], Throughput(ops/sec), 40.46944556859571
[TOTAL_GCS_Copy], Count, 1
[TOTAL_GC_TIME_Copy], Time(ms), 17
[TOTAL_GC_TIME_%Copy], Time(%), 0.6879805746661272
[TOTAL_GCS_MarkSweepCompact], Count, 0
[TOTAL_GC_TIME_MarkSweepCompact], Time(ms), 0
[TOTAL_GC_TIME%MarkSweepCompact], Time(%), 0.0
[TOTAL_GCs], Count, 1
[TOTAL_GC_TIME], Time(ms), 17
[TOTAL_GC_TIME%], Time(%), 0.6879805746661272
[CLEANUP], Operations, 1
[CLEANUP], AverageLatency(us), 3.0
[CLEANUP], MinLatency(us), 3
[CLEANUP], MaxLatency(us), 3
[CLEANUP], 95thPercentileLatency(us), 3
[CLEANUP], 99thPercentileLatency(us), 3
[INSERT], Operations, 100
[INSERT], AverageLatency(us), 9752.74
[INSERT], MinLatency(us), 1739
[INSERT], MaxLatency(us), 41407
[INSERT], 95thPercentileLatency(us), 22831
[INSERT], 99thPercentileLatency(us), 39167
[INSERT], Return=OK, 100
java -cp /ycsb/couchbase2-binding/conf:/ycsb/conf:/ycsb/lib/HdrHistogram-2.1.4.jar:/ycsb/lib/core-0.15.0.jar:/ycsb/lib/htrace-core4-4.1.0-incubating.jar:/ycsb/lib/jackson-core-asl-1.9.4.jar:/ycsb/lib/jackson-mapper-asl-1.9.4.jar:/ycsb/couchbase2-binding/lib/core-io-1.3.1.jar:/ycsb/couchbase2-binding/lib/couchbase2-binding-0.15.0.jar:/ycsb/couchbase2-binding/lib/java-client-2.3.1.jar:/ycsb/couchbase2-binding/lib/rxjava-1.1.5.jar com.yahoo.ycsb.Client -db com.yahoo.ycsb.db.couchbase2.Couchbase2Client -s -P /tmp/ycsb/workloada -p couchbase.host=cb-benchmark-0000.cb-benchmark.builder-infra.svc -p couchbase.password=password -load
Command line: -db com.yahoo.ycsb.db.couchbase2.Couchbase2Client -s -P /tmp/ycsb/workloada -p couchbase.host=cb-benchmark-0000.cb-benchmark.builder-infra.svc -p couchbase.password=password -load
YCSB Client 0.15.0
Loading workload...
Starting test.
2019-08-03 13:52:22:084 0 sec: 0 operations; est completion in 0 second
Aug 03, 2019 1:52:22 PM com.couchbase.client.core.env.DefaultCoreEnvironment
INFO: ioPoolSize is less than 3 (1), setting to: 3
Aug 03, 2019 1:52:22 PM com.couchbase.client.core.env.DefaultCoreEnvironment
INFO: computationPoolSize is less than 3 (1), setting to: 3
Aug 03, 2019 1:52:22 PM com.yahoo.ycsb.db.couchbase2.Couchbase2Client logParams
INFO: ===> Using Params: host=cb-benchmark-0000.cb-benchmark.builder-infra.svc, bucket=default, upsert=false, persistTo=NONE, replicateTo=NONE, syncMutResponse=true, adhoc=false, kv=true, maxParallelism=1, queryEndpoints=1, kvEndpoints=1, queryEndpoints=1, epoll=false, boost=3, networkMetricsInterval=0, runtimeMetricsInterval=0
Aug 03, 2019 1:52:22 PM com.couchbase.client.core.CouchbaseCore
INFO: CouchbaseEnvironment: {sslEnabled=false, sslKeystoreFile='null', sslKeystorePassword=false, sslKeystore=null, bootstrapHttpEnabled=true, bootstrapCarrierEnabled=true, bootstrapHttpDirectPort=8091, bootstrapHttpSslPort=18091, bootstrapCarrierDirectPort=11210, bootstrapCarrierSslPort=11207, ioPoolSize=3, computationPoolSize=3, responseBufferSize=16384, requestBufferSize=16384, kvServiceEndpoints=1, viewServiceEndpoints=1, queryServiceEndpoints=1, searchServiceEndpoints=1, ioPool=NioEventLoopGroup, coreScheduler=CoreScheduler, eventBus=DefaultEventBus, packageNameAndVersion=couchbase-java-client/2.3.1 (git: 2.3.1, core: 1.3.1), dcpEnabled=false, retryStrategy=BestEffort, maxRequestLifetime=75000, retryDelay=ExponentialDelay{growBy 1.0 MICROSECONDS, powers of 2; lower=100, upper=100000}, reconnectDelay=ExponentialDelay{growBy 1.0 MILLISECONDS, powers of 2; lower=32, upper=4096}, observeIntervalDelay=ExponentialDelay{growBy 1.0 MICROSECONDS, powers of 2; lower=10, upper=100000}, keepAliveInterval=30000, autoreleaseAfter=2000, bufferPoolingEnabled=true, tcpNodelayEnabled=true, mutationTokensEnabled=false, socketConnectTimeout=10000, dcpConnectionBufferSize=20971520, dcpConnectionBufferAckThreshold=0.2, dcpConnectionName=dcp/core-io, callbacksOnIoPool=true, disconnectTimeout=25000, requestBufferWaitStrategy=com.couchbase.client.core.env.DefaultCoreEnvironment$2@419a6765, queryTimeout=75000, viewTimeout=75000, kvTimeout=10000, connectTimeout=30000, dnsSrvEnabled=false}
Aug 03, 2019 1:52:22 PM com.couchbase.client.core.node.CouchbaseNode signalConnected
INFO: Connected to Node cb-benchmark-0000.cb-benchmark.builder-infra.svc
Aug 03, 2019 1:52:23 PM com.couchbase.client.core.node.CouchbaseNode signalConnected
INFO: Connected to Node cb-benchmark-0001.cb-benchmark.builder-infra.svc
Aug 03, 2019 1:52:23 PM com.couchbase.client.core.config.DefaultConfigurationProvider$8 call
INFO: Opened bucket default
DBWrapper: report latency for each error is false and specific error codes to track for latency are: []
Aug 03, 2019 1:52:23 PM com.couchbase.client.core.node.CouchbaseNode signalConnected
INFO: Connected to Node cb-benchmark-0002.cb-benchmark.builder-infra.svc
2019-08-03 13:52:24:475 2 sec: 100 operations; 40.58 current ops/sec; [CLEANUP: Count=1, Max=3, Min=3, Avg=3, 90=3, 99=3, 99.9=3, 99.99=3] [INSERT: Count=100, Max=41407, Min=1739, Avg=9752.74, 90=18271, 99=39167, 99.9=41407, 99.99=41407]
Traceback (most recent call last):
File "/opt/snafu/ycsb-wrapper/ycsb-wrapper.py", line 191, in
sys.exit(main())
File "/opt/snafu/ycsb-wrapper/ycsb-wrapper.py", line 180, in main
documents,summary = _json_payload(data,args.run[0],uuid,user,phase,workload,args.driver[0],recordcount,operationcount)
File "/opt/snafu/ycsb-wrapper/ycsb-wrapper.py", line 83, in _json_payload
summary_dict[summ[0].strip('[').strip(']')][summ[1]] = float(summ[2])
ValueError: could not convert string to float: runtimeMetricsInterval=0
Snafu should be capability of indexing the progress report that is generated by pgbnech. a detailed description of what the progress report is below.
"...The report includes the time since the beginning of the run, the tps since the last report, and the transaction latency average and standard deviation since the last report. Under throttling (-R), the latency is computed with respect to the transaction scheduled start time, not the actual transaction beginning time, thus it also includes the average schedule lag time."
The project needs a linting job.
If would be useful to write the data to be indexed to a file in the case where ES fails or is not provided.This would allow us to retrieve the data seperately and not have to do a full re-run if an issue is hit.
Uperf docs are missing number of pairs being used for a particular test
Today the ycsb summary docs don't include the ycsb arguments, this will be helpful when comparing to ensure an apple to apple comparison is happening.
Currently indexing is performed by opening a connection to elasticsearch for each test and document type, the recommended method would be to open a single connection and as documents are created/parsed those documents are emitted to the indexer, this can be achieved with the utilization of py_es_bulk and python generators.
pseudo code follows:
main ():
handle args
initialize es
status = py_es_bulk.streaming_bulk(es, process_generator(args)
def process_generator(args)
object_generator = process_data(args)
for obj in object_generator:
for action in obj.emit_actions():
yield action
def process_data(args)
for i in args.sample:
trigger_fio_generator = _trigger_fio(args) # _trigger_fio would become an object containing the emit_actions method
yield trigger_fio_generator
Goal here is to discuss and then determine how we want to version snafu. When we make a determination, we can update the docs and make the necessary config changes as needed.
Spawning this issue from #261 which was caused because we did not test on anything outside of openshift (ie VMs, etc).
We currently do not have a method of CI testing CNV/VM's for benchmark-wrapper. As we continue to move forward this is something we should look into as it will likely become more prevelant.
Previously we would write the raw data to ES in a byte64encode. However, we have gotten away from that. We should re-enable this in our workflow.
Failing to account for ramp time when processing logs. As specified by FIO ramp time is:
If set, fio will run the specified workload for this amount of time before logging any performance numbers. Useful for letting performance settle before logging results, thus minimizing the runtime required for stable results. Note that the ramp_time is considered lead in time for a job, thus it will increase the total runtime if a special timeout or runtime is specified.
Because of this when ever ramp time is used, the logged start time must be adjusted based on it before processing log results. If this is not fixed and ramp time is used, results will be indexed with incorrect timing.
if ramp_time:
start time = ramp_time + fio_timestamp
else:
start_time = fio_timestamp
for example see graph a ramp time of 300s and run time of 600s were used. total iops should be delayed by 300s although it is recorded as starting at the beginning of the testing period.
With more benchmark scripts getting added the top level directory is getting cluttered, suggest creating a benchmark_wrapper directory and moving all wrapper directories into it.
see this message from pip:
Step 11/14 : RUN pip install "elasticsearch>=6.0.0,<=7.0.2"
---> Running in 31c46c82647f
DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support
It's time, right? There are no platforms that we support that don't have python3 at this point, along with packages that we need, right? for example, python3-elasticsearch RPM.
causes CI to fail. For example, see this log error
Snafu should be capability of indexing the report latency that is generated by pgbnech. a detailed description of what the progress report is below.
"Report the average per-statement latency (execution time from the perspective of the client) of each command after the benchmark finishes. See below for details."
When looking to utilize run_snafu for the first time there is no documentation on how to actually run any of the exiting benchmarks. There is some information in the building your own section however there is no "Running Snafu" section. Similarly there is no documentation for each of the wrappers. Lastly, there is no "help" functionality in run_snafu so even just running it blindly doesn't give any helpful information with a failure message describing usage.
Snafu needs to be published a python package to pip, to make it easier to install/upgrade and run. This will also put more emphasis on snafu being used independent of ripsaw.
If we have a message bus available, we should have a defined interface to interact with the bus.
Each workload is an hash
hset uperf status "0|1|2"
0 - Complete
1 - Failed
2 - Running
By default have a status with each workload.
I've been reminded (by using elastic search and kibana!) that smallfile ripsaw pods are not synchronizing with each other, so if you request a sequence of operations, such as create,read,append,delete and multiple samples of each, pod 1 may start reads before pod 2 has finished creates, etc. and over time this becomes worse and worse, preventing us from really measuring what throughput and response time is generated by each operation type. The solution is probably to use redis to synchronize the different pods so that pod 1 doesn't start its read until pods other than pod 1 have finished their creates, and so on. smallfile is still usable to some extent but it would be better and more scalable if this was fixed.
We should look at adding snafu version to the indexed document.
If UPerf fails to execute it leaves drops a python error into stdout.
Instead we should consider one or a combination of the below:
I love the new redis sync feature in smallfile, very important. But when I run a big test (more below) with smallfile, I get the timeout below from all the smallfile pods. But if I cut the number of files to 1/10 of that, then the test succeeds. I think the redis_timeout environment variable (default 60 sec) may need to be adjusted in cases like this. This can be done by editing resources/operator.yaml env: section, but it requires restarting the operator. Is there a better way?
2020-03-24T00:14:16Z - INFO - MainProcess - trigger_smallfile: Complete message from channel: {'type': 'su
bscribe', 'pattern': None, 'channel': b'smallfile-22b7124d-c3c5-5849-adb1-76056dabaa0f', 'data': 1}
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 190, in _read_from_socket
data = recv(self._sock, socket_read_size)
File "/usr/local/lib/python3.6/site-packages/redis/_compat.py", line 71, in recv
return sock.recv(*args, **kwargs)
socket.timeout: timed out
This is followed by:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/snafu/run_snafu.py", line 137, in <module>
sys.exit(main())
File "/opt/snafu/run_snafu.py", line 86, in main
parser))
File "/opt/snafu/utils/py_es_bulk.py", line 156, in streaming_bulk
for ok, resp_payload in streaming_bulk_generator:
File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 212, in streaming_bulk
actions, chunk_size, max_chunk_bytes, client.transport.serializer
File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 63, in _chunk_actions
for action, data in actions:
File "/opt/snafu/utils/py_es_bulk.py", line 117, in actions_tracking_closure
for cl_action in cl_actions:
File "/opt/snafu/run_snafu.py", line 119, in process_generator
for action, index in data_object.emit_actions():
File "/opt/snafu/smallfile_wrapper/trigger_smallfile.py", line 165, in emit_actions
for msg in p.listen():
File "/usr/local/lib/python3.6/site-packages/redis/client.py", line 3553, in listen
response = self.handle_message(self.parse_response(block=True))
File "/usr/local/lib/python3.6/site-packages/redis/client.py", line 3453, in parse_response
response = self._execute(conn, conn.read_response)
File "/usr/local/lib/python3.6/site-packages/redis/client.py", line 3427, in _execute
return command(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 734, in read_response
response = self._parser.read_response()
File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 316, in read_response
response = self._buffer.readline()
File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 248, in readline
self._read_from_socket()
File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 204, in _read_from_socket
raise TimeoutError("Timeout reading from socket")
redis.exceptions.TimeoutError: Timeout reading from socket
The CR is:
[root@e24-h17-740xd ripsaw]# more ../smf.yaml
apiVersion: ripsaw.cloudbulldozer.io/v1alpha1
kind: Benchmark
metadata:
name: smf-benchmark-big
namespace: my-ripsaw
spec:
test_user: BenE
clustername: bm-alias-cloud02-2020-03-23
elasticsearch:
server: snafu:[email protected]
port: 9200
es_index: ripsaw-smallfile
workload:
name: smallfile
args:
clients: 35
samples: 3
operation: ["create", "read"]
threads: 3
file_size: 1024
files: 20000
storageclass: example-storagecluster-cephfs
storagesize: 500Gi
In order to reduce duplicate indexing functionality PGbench-wrapper should be updated to use run_snafu.
Current indexing code is extremely trivial and can drop results. We need to migrate to the bulk indexer with retries ASAP.
add a "concurrent databases" or "number of databases" counter to pgbench summary document? I am trying to iterate over a set tests using the add1 feature for # of DBs. The problem is associating the X number of DBs on grafana. the way we are currently indexing the results we are just indicating which DB the summary results are associated to and includes nothing about the number of DBs currently under test for that iteration.
the table at the link below should show 3 sets of test with 1, 2 , and 3 data bases, unfortunately there isn't any field I can use to identify and group to accomplish the desired result so it just shows a single set with a unique count of DBs at 3.
Currently if clusterloader fails due to an API issue etc in the middle of the run, we see an error like
File "/tmp/snafu/run_snafu.py", line 154, in <module>
sys.exit(main())
File "/tmp/snafu/run_snafu.py", line 115, in main
for i in process_generator(index_args, parser):
File "/tmp/snafu/run_snafu.py", line 133, in process_generator
for action, index in data_object.emit_actions():
File "/tmp/snafu/cluster_loader/trigger_cluster_loader.py", line 56, in emit_actions
cl_output_json = list(filter(pattern.match, output_file_content))[0].strip()
IndexError: list index out of range
We need a check that makes sure that the list generated from the filter after checking for the regex is non-empty and exit if it is.
I have seen multiple occurrences of the following error in fio client pods:
ERROR - MainProcess - trigger_fio: Fio failed to execute
ERROR - MainProcess - trigger_fio: Output file: <fio-server-1-benchmark-f003-drkjc> fio: output file open error: No such file or directory
The errors seem to be random. Has anyone else seen this?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.