vectordotdev / vector-test-harness Goto Github PK

End-to-end test harness for the Vector observability data router

License: Mozilla Public License 2.0

HCL 50.14% Shell 34.98% Ruby 12.04% HTML 0.71% Dockerfile 1.47% Jinja 0.66%

vector test-harness testing benchmarking metallica logstash fluentd management fluentbit terraform

vector-test-harness's Introduction

Full end-to-end test harness for the Vector log & metrics router. This is the test framework used to generate the performance and correctness results displayed in the Vector docs. You can learn more about how this test harness works in the How It Works section, and you can begin using this test harness via the Usage section.

Contributions for additional benchmarks and tools are welcome! As required by the MPL 2.0 License, changes to this code base, including additional benchmarks and tools, must be made in the open. Please be skeptical of tools making performance claims without doing so in the public. The purpose of this repository is to create transparency around benchmarks and the resulting performance.

Performance Tests

Correctness Tests

Directories

/ansible - global ansible resources and tasks
/bin - contains all scripts
/cases - contains all test cases
/packer - packer script to build the AMIs necessart for tests
/terraform - global terraform state, resources, and modules

Setup

Ensure you have Ansible (2.7+) and Terraform (0.12.20+) installed.
This step is optional, but highly recommended. Setup a vector specific AWS profile in your ~/.aws/credentials file. We highly recommend running the Vector test harness in a separate AWS sandbox account if possible.
Create an Amazon compatible key pair. This will be used for SSH access to test instances.
Run cp .envrc.example .envrc. Read through the file, update as necessary.
Run source .envrc to prepare the environment. Alternatively install direnv to do this automatically. Note that the .env file, if it exists, will be automatically sourced into the scripts environment - so it's another option to set the environment variables for the bin/* commands of this repo.
Run:
```
./bin/test -t [tcp_to_tcp_performance]
```
This script will take care of running the necessary Terraform and Ansible scripts.

Usage

bin/build-amis - builds AMIs for use in test cases
bin/compare - compare of test results across all subjects
bin/ssh - utility script to SSH into a test server
bin/test - run a specific test

Results

High-level results can be found in the Vector performance and correctness documentation sections.
Detailed results can be found within each test case's README.
Raw performance result data can be found in our public S3 bucket.
You can run your own queries against the raw data. See the Usage section.

Development

Adding a test

We recommend cloning a similar to test since it removes a lot of the boilerplate. If you prefer to start from scratch:

Create a new folder in the /cases directory. Your name should end with _performance or _correctness to clarify the type of test this is.
Add a README.md providing an overview of the test. See the tcp_to_tcp_performance test for an example.
Add a terraform/main.tf file for provisioning test resources.
Add a ansible/bootstrap.yml to bootstrap the environment.
Add a ansible/run.yml to run the test againt each subject.
Add any additional files as you see fit for each test.
Run bin/test -t <name_of_test>.

Changing a test

You should not be changing tests with historical test data. You can change test subject versions since test data is partitioned by version, but you cannot change a test's execution strategy as this would corrupt historical test data. If you need to change the test in such a way that would violate historical data we recommend creating an entirely new test.

Deleting a test

Simply delete the folder and any data in the s3 bucket.

Debugging

On a VM end

If you encounter an error it's likely you'll need to SSH onto the server to investigate.

SSHing

ssh  -o 'IdentityFile="~/.ssh/vector_management"' [email protected]

Where:

~/.ssh/vector_management = the VECTOR_TEST_SSH_PRIVATE_KEY value provided in your .envrc file.
ubuntu = the default root username for the instance.
51.5.210.84 = the public IP address of the instance.

We provide a command that wraps the system ssh and provides the same credentials that ansible uses when connecting to the VM:

./bin/ssh 51.5.210.84

Viewing logs

All services are configured with systemd where their logs can be accessed with journalctl:

sudo journactl -fu <service>

Failed services

If you find that the service failed to start, I find it helpful to manually attempt to start the service by inspecting the command in the .service file:

cat /etc/systemd/system/<name>.service

Then copy the command specified in ExecStart and run it manually. Ex:

/usr/bin/vector

On your end

Things can go wrong on your end (i.e. on the local system you're running the test harness) too.

Ansible Task Debugger

export ANSIBLE_ENABLE_TASK_DEBUGGER=True

Set the environment variable above, and Ansible will drop you in a debug mode on any task failure.

See Ansible documentation on Playbook Debugger to learn more.

Some useful commands:

pprint task_vars['hostvars'][str(host)]['last_message']

Verbose Ansible Execution

export ANSIBLE_EXTRA_ARGS=-vvv

Set the environment variable above, and Ansible will print verbose debug information for every task it executes.

How It Works

Design

The Vector test harness is a mix of bash, Terraform, and Ansible scripts. Each test case lives in the /cases directory and has full reign of it's bootstrap and test process via it's own Terraform and Ansible scripts. The location of these scripts is dictated by the test script and is outlined in more detail in the Adding a test section. Each test falls into one of 2 categories: performance tests and correctness tests:

Performance tests

Performance tests measure performance and MUST capture detailed performance data as outlined in the Performance Data and Rules sections.

In addition to the test script, there is a compare scripts. This script analyzes the performance data captured when executing a test. More information on this data and how it's captured and analyzed can be found in the Performance Data section. Finally, each script includes a usage overview that you can access with the --help flag.

Performance data

Performance test data is captured via dstat, which is a lightweight utility that captures a variety of system statistics in 1-second snapshot intervals. The final result is a CSV where each row represents a snapshot. You can see the dstat command used in the ansible/roles/profiling/start.yml file.

Performance data schema

The performance data schema is reflected in the Athena table definition as well as the CSV itself. The following is an ordered list of columns:

Name	Type
`epoch`	`double`
`cpu_usr`	`double`
`cpu_sys`	`double`
`cpu_idl`	`double`
`cpu_wai`	`double`
`cpu_hiq`	`double`
`cpu_siq`	`double`
`disk_read`	`double`
`disk_writ`	`double`
`io_read`	`double`
`io_writ`	`double`
`load_avg_1m`	`double`
`load_avg_5m`	`double`
`load_avg_15m`	`double`
`mem_used`	`double`
`mem_buff`	`double`
`mem_cach`	`double`
`mem_free`	`double`
`net_recv`	`double`
`net_send`	`double`
`procs_run`	`double`
`procs_bulk`	`double`
`procs_new`	`double`
`procs_total`	`double`
`sys_init`	`double`
`sys_csw`	`double`
`sock_total`	`double`
`sock_tcp`	`double`
`sock_udp`	`double`
`sock_raw`	`double`
`sock_frg`	`double`
`tcp_lis`	`double`
`tcp_act`	`double`
`tcp_syn`	`double`
`tcp_tim`	`double`
`tcp_clo`	`double`

Performance data location

All performance data is made public via the vector-tests S3 bucket in the us-east-1 region. The partitioning structure follows the Hive partitioning structure with variable names in the path. For example:

name=tcp_to_tcp_performance/configuration=default/subject=vector/version=v0.2.0-dev.1-20-gae8eba2/timestamp=1559073720

And the same in a tree form:

name=tcp_to_tcp_performance/
  configuration=default/
    subject=vector/
      version=v0.2.0-dev.1-20-gae8eba2/
        timestamp=1559073720

name = the test name.
configuration = refers to the test's specific configuration (tests can have multiple configurations if necessary).
subject = the test subject, such as vector.
version = the version fo the test subject.
timestamp = when the test was executed.

Performance data analysis

Analysis of this data is performed through the AWS Athena service. This allows us to execute complex queries on the performance data stored in S3. You can see the queries ran in the compare script.

Correctness tests

Correctness tests simply verify behavior. These tests are not required to capture or to persist any data. The results can be manually verified and placed in the test's README.

Correctness data

Since correctness tests are pass/fail there is no data to capture other than the successful running of the test.

Correctness output

Generally, correctness tests verify the output. Because of the various test subjects, we use a variety of output methods to capture output (tcp, http, and file). This is highly dependent on the test subject and the methods available. For example, the Splunk Forwarders only support TCP and Splunk specific outputs.

To make capturing this data easy, we created a test_server Ansible role that spins up various test servers and provides a simple way to capture summary output.

Environments

Tests must operate in isolated reproducible environments, they must never run locally. The obvious benefit is that it removes variables across tests, but it also improves collaboration since remote environments are easily accessible and reproducible by other engineers.

Rules

ALWAYS filter to resources specific to your test_name, test_configuration, and user_id (ex: ansible host targeting)
ALWAYS make sure the initial instance state is identical across test subjects. We recommend explicitly stopping all test subjects to properly handle the case of preceding failure and the situation where a subject was not cleanly shutdown.
ALWAYS use the profile ansible role to capture data. This ensures a consistent data structure across tests.
ALWAYS run performance tests for at least 1 minute to calculate a 1m CPU load average.
Use ansible roles whenever possible.
If you are not testing local data collection we recommend using TCP as a data source since it is a lightweight source that is more likely to be consistent, performance wise, across subjects.

vector-test-harness's People

Contributors

Stargazers

Watchers

Forkers

blt perk-sumo aavileli asmtal ecogit-stage jsirianni ahlfors

vector-test-harness's Issues

Decide on integration benchmarking framework

The Vector test harness was created alongside Vector by myself to compare Vector against alternatives. I am by no means a benchmarking expert and I am certain there are better ways to implement black box benchmarking. Before we begin work on vectordotdev/vector#6510 we should obtain consensus that improving the existing framework is the best path forward. If not, we should consider writing RFC that proposes a new approach.

Soak Tests but for Configs We Can't Share

Today in Vector we have a notion of a 'soak test'. The soak test is an all-up, integrated benchmark for vector. It runs Vector with known configurations, against stable load generation in a repeatable environment and measures the throughput from the load generation side, comparing two SHAs in the process in a statistically significant way. We use this to control regressions and detect if optimizations actually have an impact when Vector is fully assembled and the most susceptible to Amdahl's law. The underlying mechanism is simplistic: minikube, some quiet EC2 machines, a bit of terraform, custom load generation, analysis scripts and a Github Action workflow.

At a high level we have a need for both performance and reliability tests in the Vector project. They are:

a. short duration performance soak tests
b. long duration performance soak tests
c. long duration memory reliability soak tests
d. short/long private customer configuration replication soak tests
e. statistically viable comparisons with competitor setups in a soak testing environment

The soak tests today cover point 'a' well. Covering 'b' is a matter of extending the runtime of the existing soaks. Covering 'c' is a matter of capture memory data from the vector pod in-kube. Point 'd' is not approachable with our current soak notion, in that we require configurations to be public and checked into the vector repository. Theoretically point 'e' is approachable with our current setup, though no work has been done to achieve this.

Add the ability to reference and compare individual test harness invocations (including for the same commit)

In some cases, we want to compare test harness runs for the same commit of the same PR, to validate the test results and gain confidence in the test harness correctness.

Measure TCP performance with a specially crafted receiver

Currently, we're using dstat for performance tests, but we can improve the accuracy if we implement a custom TCP receiver component that would measure the results more precisely. This solution will still be portable across subjects and can be used together with dstat data gathering.

Provide docker images

Since we want to include this in the CI, we probably want to dockerize this repo to be able to easily run it off Circe CI for the main vector repo.

Socat send task is broken

The new changes to the socat/send.yml task have broken it for the real_world_1_performance test. First, I received this error:

timeout: invalid time interval ‘timeout’\nTry 'timeout --help' for more information.

So I changed the arguments to use interpolation. Ex:

...
"{{ timeout }}",
...

Then I received this error:

bash: line 2: file: No such file or directory

I'm not sure what is going on, but reverting it back to the original bash loop fixes it.

Rate limited logging correctness test

This is an important feature we can showcase as a test.

Add the ability to compare different configurations

I wanted to compare a tcp_to_tcp_performance results in big_vms configuration default configuration, but seems it can't be done currently.

Automatically add Athena partition when new test results are available

In order for Athena to query new benchmarking data is must know about the partition. This is achieved by running a repair query:

MSCK REPAIR TABLE vector_tests

This query is expensive and takes a few minutes to execute. This is because it's listing every file in the bucket in order to discover new partitions. A much more efficient method is to add the partition directly:

ALTER TABLE vector_tests ADD PARTITION (...) location 's3://test-results.vector.dev/../../../'

For example, this could be done automatically by listening for new S3 files being placed in the bucket and firing a lambda to run this query.

Make/find a stable HTTP sink for #69

We need a sink that will not participate in coordinated omission with vector but will reliable report its bytes throughput. Depending on our measurement needs this may already exist. Relates to #69.

Limit `/test` invocations to Vector members with write access

Calling /test in a comment should be limited to users who have write access to the Vector repo. This is somewhat important since someone with nefarious intentions could repeatedly call this to run up a cost.

Get test-harness performance tests running again

Just opening this to track the work I did today in #57 to get all of the tests running again.

Logstash - JVM Config

Hi there.

I'm quite sceptical about the Logstash performance results.

I'm assuming it's a JVM config issue. Logstash can become very slow without enough memory.

Since it seems you are using the default config from apt, I think the memory will be limited to 1GB.

While the high memory usage is certainly a drawback of Logstash, no-one runs it in production with such a limited capacity.

For a fairer comparison, you'll probably want to give it at least 4GB of memory.

Some info here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html

Run all tests when a new major or minor release is made

When a major or minor release is made we should run all tests with the appropriate Vector version:

This should not use a custom build.
It should set $VECTOR_VERSION appropriately.
It should probably be triggered by a Github release (not tag) since a release happens after all release artifacts have been uploaded and distributed.
It should run bin/export to upload the latest and greatest results. The website is powered by this data.

Github Actions integration

Design plan so far:

Trigger a Github Action on issue comment.
Check the context:
- comment body must be in a form of /test command
- comment must be authored by a user with a write access to a repo
- comment must be on a Pull Request (not on a regular Issue)
Abort if any of the above are not fulfilled.
Clone the PR branch and build vector from it.

This needs verification since there's a concern that standard Github Action runner might not have
sufficient resources to build vector.
Note we can get the Pull Request info from the event file at $GITHUB_EVENT_PATH.

An alternative approach is building the vector at the test harness VM. It might be easier too, as
we'll be able to perform the standard cargo install from git:
```
cargo install --git https://github.com/timberio/vector --rev "<PR_COMMIT_SHA>"
```
Note that to determine the commit hash, we'll have to perform some Github API calls.
We can avoid doing that if we just use the PR branch: pull/<PR_ID>/head.
```
cargo install --git https://github.com/timberio/vector --branch "pull/<PR_ID>/head"
```
Invoke the vector-test-harness docker image.

We'll have to add support for using the vector executable built on a previous step - probably
add a way to upload it into the test VM and replace the preinstalled vector instance.

The command sequence we'll need to run is something like this:
- bin/test -t some_test
- bin/compare -t some_test > "$GITHUB_WORKSPACE/output"
We place the whatever output we want to respond with at $GITHUB_WORKSPACE/output.
Finally, we take the contents of $GITHUB_WORKSPACE/output and post it as a comment to the PR. We should also provide a link to the Github Action execution in the said comment, so additional context is easily available at hand.

Failure handling:

In general, we should never fail the action, cause it'll send a failure email to whoever did the latest commit to master.

If the command is not recognized in the comment - terminate the action without failure so that people don't get an email about failed action on every comment. Ideally, do not even run the action. Or remove the execution log so it doesn't clutter the Actions tab.
If the user is not authorized to run the action - post a comment about it in response to the command invocation comment. Do not fail the action job, just terminate it with success or other neutral state (if any).
If the deb build or test harness failed - post a comment about it in response to the user comment with the link to the action execution log - ideally to the exact location where the failure occurred. Do not fail the task (otherwise, the user will get two emails - one for a comment we post in response, and one for the action failure).

`Your query returned no results` error

When running ./bin/test it throws three times the same error

Error: Your query returned no results. Please change your search criteria and try again.

Tried mixing up the flags to see if I'm passing some invalid argument to them but nothing changed.

Maybe it has something to do with the DynamoDB table in aws that I've created as it seamed to be needed for storing a lock, but I've not seen it documented in the setup.
output.txt

Version of Fluent Bit used

Is the Fluent Bit version used in tests described in preinstalled_test_subject_versions.sh? 1.1.0 is quite old, so I suggest upgrading it to 1.8.11.

Namespace terraform state for test cases per execution to avoid lock contention

CI to enable unit testing and linting

Linting and unit tests are important for test harness too. 😄

Figure out a way to run different versions of vector with different configs

We've made backwards incompatible changes to Vector's TOML file which means that we need a way to run older vector versions with a different config for some tests (e.g. regex_parsing_performance).

Graph test-harness results somewhere

It would be useful to plot the test-harness performance results somewhere.

Since these results are in Athena, AWS Quicksight could be an option. Alternatively, we execute a GA job to pull the results and publish them somewhere, maybe as a gh-pages branch on this repo?

Fix failing test-harness tests

While attempting to run the test harness to investigate vectordotdev/vector#5374 I found some tests failed to run. I need to investigate to see if these are real failures that require changes to the test-harness or if compatibility issues were introduced with vector.

Add CI for test harness

vectordotdev/vector#2145 (comment)

Execute reset.yml playbooks in parallel

Before every test run we ensure all subject are stopped and that their data directories are cleaned. This reset process takes a couple of minutes, which isn't long, but could be avoided if we ran the cleanup steps in parallel. Ansible does not make this easy with the way we load in roles:

  become: true
  roles:
    - role: filebeat
      action: stop
    - role: fluentbit
      action: reset
    - role: fluentd
      action: stop
    - role: logstash
      action: reset
    - role: splunk_heavy_forwarder
      action: reset
    - role: splunk_universal_forwarder
      action: reset
    - role: vector
      action: reset

- hosts: '{{ test_namespace }}:&tag_TestRole_consumer'
  become: true
  roles:
    - role: logstash
      action: reset
    - role: tcp_test_server
      action: stop

Therefore we might be better switching these to include_role steps, which I believe allows for more flexibility around parallelization of steps.

Establish `file_gen` + http + vector rig

As a part of #69 we know we need the ability to assemble the title mentioned pieces into a repeatable rig. This should be a configured VM and instructions for setup on a personal machine, for both rigorous work and ad hoc experimentation. Non-goal: nightly tests from this issue.

Subsequent `/test` invocations should cancel previous runs

This is a nice to have feature but is not necessary. In other words, I only want to implement this if it's less than 2 hours of work.

Calling /test should cancel previous runs to save on cost.

Test comparisons are noisy

While investigating a recent suspected regression, I discovered there wasn't actually one, but just a nightly that had an odd difference in throughput that went away when I reran a specific test case.

Some noise is to be expected given we are running on commodity AWS EC2 instances, but I am wondering if it would be worthwhile to attempt to reduce this to be able to more clearly identify regressions.

Some ideas:

Run each benchmark case N times, on different hardware (i.e. running teardown after each one) and using the samples collected by all runs for comparison, rather than just the last one. This should reduce noisy neighbors.
Run on dedicated hardware instead of virtual instances
For comparisons, actually run the test cases once for each version being compared on the same EC2 instances

Explore means for building arbitrary commit, feature set

As a part of the work in #69 we want to performance test nightly runs of Vector, relying on existing artifacts. However from vectordotdev/vector#6614 we know that our existing release builds lack symbols, a hamper for future work. Explore options for build specific commits of vector with user-defined feature flags.

Use statistical analysis to identify regressions

Currently we just compare the means and output +/- but it would good to be able to compare two runs for as statistical difference using a t-test.

We should enhance bin/compare (or maybe it is a separate script) to be able to perform this and exit non-zero if regressions occurred. We can then use this with vectordotdev/vector#5396 to send a message to discord if a regression is detected between two nightly runs.

Prior art: https://bheisler.github.io/criterion.rs/book/analysis.html

Fix test cases and upgrade test subjects

We need to fix the test cases and upgrade test subjects. Almost all tests are stale, and don't complete properly.

Currently, the test harness not very useful for performance assessment: vectordotdev/vector#1922 (comment)

Ability to tear down test resources

See #44 (comment).

Right now we do not have a way to manually tear down resources after running tests. Currently, we use CloudWatch alarms to automatically shut down idle instances, which works well for our current tests. But this starts to break down when we expand our types of tests. Such as long-running tests (#44), tests that spin up Kafka clusters, etc. We should think about a simple cleanup mechanism that can tear down all test state after an idle period.

Do not pass versions to bin/test, detect them or rely on packer provided data instead

Just writing this down so that I don't forget.

One of the potential causes of inconsistencies is how we pass subject versions around.
Since we pre-install them at the packer-built AMI, we should probably write the installed versions as host facts for ansible to read later or, and not pass them as inputs to bin/test. Or just detect them at runtime at the profiling role.

Identify trends in test-harness regressions

Broken off from: #61

We should also cover off trending regressions where a cumulative change between a series of nightly tests indicates a pattern or trend of regression in some element of Vector.

Update link in the description

https://vectorproject.io/ doesn't work, so probably should be updated.

Custom $VECTOR_VERSION when running tests

When running tests for pull requests, we should be setting $VECTOR_VERSION to the appropriate version so that test results are properly namespaced. You can see here that we use the Ansible variable vector_version which is set by $VECTOR_VERSION.

I would prefer that we use a version that clearly represents the change in a pull request:

[branch-name]-[commit-sha]

Which would be something like:

my-branch-g816041c

Add Logagent

I would be interested in comparison with Logagent https://github.com/sematext/logagent-js

Compare Telegraf

I would love to see how this behaves compared to Telegraf, also a big player in collecting/transforming/sending metrics.

Edit: It seems there is some support for Telegraf added, but can't find any results of that.

Create a stable file generator

As a part of the completion criteria for #69 we need a program that can generate logs in a reproducible, stable fashion. There is ongoing work in file_gen. Once done we'll have a program that we can configure to produce logs of given hertz in a well-known place that is stable across runs, sparing of resources.

Ability to run all tests in parallel

WARNING: Github has asked up explicitly not to spin up a lot of parallel jobs on large instances. This task shold use regular/small instances.

It looks like Github Actions has the ability to run jobs as a matrix (do not use large instances!). I would like to run every test case in parallel unless otherwise specified. For example:

PR comment /test runs all tests in parallel.
PR comment /test -t docker_partial_events_merging_correctness run a single test.
PR comment /test -t docker_partial_events_merging_correctness -t disk_buffer_performance runs only the tests specified.
PR comment /test -t *_performance runs all tests ending with _performance.

All tests are designed to be namespaced, there should be no problem running them all in parallel. I am also not concerned about resource usage or cost. If we bump into AWS limits we can request an increase.

Focus on a singular harness performance test

In the performance testing RFC we lay out the changes we want to see done to this repository, specifically here. The first phase of that work "Focus our Efforts" envisions a clear break with our existing performance comparison tests. The completion criteria for this phase are:

Completion criteria: A new performance test file -> json_parser -> http will be runnable in a reproducible VM environment for 1 hour. The data from the file source and HTTP sink will be collected and shipped off-system. The test-harness will be rigged to run this test on-demand for a given nightly vector.