Git Product home page Git Product logo

influxdb-comparisons's Introduction

influxdb-comparisons

This repo contains code for benchmarking InfluxDB against other databases and time series solutions. You can access the detailed technical writeups for each here.

Current databases supported:

Testing Methodology

In an attempt to make our performance comparison both realistic and relatable, we decided to build our benchmark suite according to real-world use cases. Micro-benchmarks are useful for database engineers, but using realistic data helps us better understand how our software performs under practical workloads.

Currently, the benchmarking tools focus on the DevOps use case. We create data and queries that mimic what a system administrator would see when operating a fleet of hundreds or thousands of virtual machines. We create and query values like CPU load; RAM usage; number of active, sleeping, or stalled processes; and disk used. Future benchmarks will expand to include the IoT and application monitoring use cases.

We benchmark bulk load performance and synchronous query execution performance. The benchmark suite is written in Go, and attempts to be as fair to each database as possible by removing test-related computational overhead (by pre-generating our datasets and queries, and using database-specific drivers where possible).

Although the data is randomly generated, our data and queries are entirely deterministic. By supplying the same PRNG (pseudo-random number generator) seed to the test generation code, each database is loaded with identical data and queried using identical queries.

(Note: The use of more than one worker thread does lead to a non-deterministic ordering of events when writing and/or querying the databases.)

There are five phases when using the benchmark suite: data generation, data loading, query generation, query execution, and query validation.

Phase 1: Data generation

Each benchmark begins with data generation.

The DevOps data generator creates time series points that correspond to server telemetry, similar to what a server fleet would send at regular intervals to a metrics collections service (like Telegraf or collectd). Our DevOps data generator runs a simulation for a pre-specified number of hosts, and emits serialized points to stdout. For each simulated machine, nine different measurements are written in 10-second intervals.

The intended usage of the DevOps data generator is to create distinct datasets that simulate larger and larger server fleets over increasing amounts of time. As the host count or the time interval go up, the point count increases. This approach lets us examine how the databases scale on a real-world workload in the dimensions our DevOps users care about.

Each simulated host is initialized with a RAM size and a set of stateful probability distributions (Gaussian random walks with clamping), corresponding to nine statistics as reported by Telegraf. Here are the Telegraf collectors for CPU and memory:

https://github.com/influxdata/telegraf/blob/master/plugins/inputs/system/cpu.go https://github.com/influxdata/telegraf/blob/master/plugins/inputs/system/memory.go

For example, here is a graph of the simulated CPU usage through time for 10 hosts, when using the data generator:

(TODO screenshot of graph from Chronograf)

And, here is a graph of the simulated memory from the same simulation:

(TODO screenshot of graph from Chronograf)

Note that the generator shares its simulation logic between databases. This is not just for code quality; we did this to ensure that the generated data is, within floating point tolerances, exactly the same for each database.

A DevOps dataset is fully specified by the following parameters: Number of hosts to simulate (default 1) Start time (default January 1st 2016 at midnight, inclusive) End time (default January 2nd 2016 at midnight, exclusive) PRNG seed (default uses the current time)

The ‘scaling variable’ for the DevOps generator is the number of hosts to simulate. By default, the data is generated over a simulated period of one day. Each simulated host produces nine measurements per 10-second epoch, one each of:

  • cpu
  • diskio
  • disk
  • kernel
  • mem
  • net
  • nginx
  • postgresl
  • redis

Each measurement holds different values that are being stored. In total, all nine measurements store 100 field values.

The following equations describe how many points are generated for a 24 hour period:

seconds_in_day = (24 hours in a day) * (60 minutes in an hour) * (60 seconds in a minute) = 86,400 seconds
epochs = seconds_in_day / 10 = 8,640
point_count = epochs * host_count * 9

So, for one host we get 8,640 * 1 * 9 = 77,760 points, and for 1,000 hosts we get 8,640 * 1000 * 9 = 77,760,000 points.

For these benchmarks, we generated a dataset we call DevOps-100: 100 simulated hosts over various time periods (1-4 days).

Generated data is written in a database-specific format that directly equates to the bulk write protocol of each database. This helps make the following benchmark, bulk loading, as straightforward as possible.

For InfluxDB, the bulk load protocol is described at: https://docs.influxdata.com/influxdb/v0.12/guides/writing_data/#writing-multiple-points

For Elasticsearch, the bulk load protocol is described at: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html

For Cassandra, the native protocol version 4 described at: https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v4.spec

For MongoDB, we use standard BSON with the mgo client: http://labix.org/mgo

For OpenTSDB, we use the standard HTTP query interface (not the batch input tool) described at: http://opentsdb.net/docs/build/html/api_http/put.html

Phase 2: Data loading

After data generation comes data loading.

The data loading programs stream data from stdin; typically, this is from a file created by the data generator. As data is read, the loader performs a minimum of deserialization and queues up writes into a batch. As batches become ready, the points are loaded into the destination database as fast as possible.

(Each database currently has its own bulk loader program. In the future, we want to merge the programs together to minimize the amount of special-case code.)

Configuration

Each bulk loader takes a handful of parameters that affect performance:

ElasticSearch: Number of workers to use to make bulk load writes in parallel, Which index template to use (more on this later), Whether to force an index refresh after each write, and How many items to include in each write batch.

InfluxDB: Number of workers to use to make bulk load writes in parallel, and How many points to include in each write batch.

Loader programs for the other databases take similar parameters.

(For calibration, there is also an option to disable writing to the database; this mode is used to check the speed of data deserialization.)

Note that the bulk loaders will not start writing data if there is already data in the destination database at the beginning of a test. This helps ensure that the database is empty, as if it were newly-installed. It also prevents users from clobbering existing data.

Elasticsearch-specific configuration

Both Elasticsearch and InfluxDB are ready out-of-the-tarball for storing time series data. However, after meeting with Elasticsearch experts, we decided to make some reasonable configuration tweaks to Elasticsearch to try to optimize its performance.

First, the configuration for the Elasticsearch daemon was changed to set the ES_HEAP_SIZE environment variable to half of the server machine’s available RAM. For example, on a 32GB machine, ES_HEAP_SIZE is 16g. This is standard practice when administering Elasticsearch.

Second, the configuration file was also changed to increase the threadpool.bulk.queue_size parameter to 100000. When we tried bulk loading without this tweak, the server replied with errors indicating it had run out of buffer space for receiving bulk writes. This config change is standard practice for bulk write workloads.

Third, we developed two Elasticsearch index templates, each of which represents a way we think people use Elasticsearch to store time-series data:

The first template, called ‘default’, stores time-series data in a way that enables fast querying, while also storing the original document data. This is closest to Elasticsearch’s default behavior and is a reasonable starting point for most users, although its on-disk size may become large.

The second template, called ‘aggregation’, indexes time-series data in a way that saves disk space by discarding the original point data. All data is stored in a compressed form inside the Lucene indexes, therefore all queries are completely accurate. But, due to an implementation detail of Elastic, the underlying point data is no longer independently addressable. For users who only conduct aggregation queries, this saves quite a bit of disk space (and improves bulk load speed) without any downsides.

Fourth, after each bulk load in Elasticsearch, we trigger a forced compaction of all index data. This is not included in the speed measurements; we give this to Elasticsearch ‘for free’. We’ve chosen to do this because compactions occur continuously over the lifetime of a long-running Elasticsearch process, so this helps us obtain numbers that are representative of steady-state operation of Elasticsearch in production environments.

(Note that Elasticsearch does not immediately index data written with the bulk endpoint. To make written data immediately available for querying, users can set the URL query parameter ‘refresh’ to ‘true’. We didn’t do this because performance dropped considerably, and most users would not need this when performing a bulk load. InfluxDB performs an fsync after each bulk write, and makes data immediately available for querying.)

InfluxDB-specific configuration

The only change we made to a default InfluxDB install is to, like Elastic, cause a full database compaction after a bulk load benchmark is complete. This forces all eventual compaction to happen at once, simulating steady-state operation of the data store.

Measurements

For bulk loading, we care about two numerical outcomes: the total wall clock time taken to write the given dataset, and how much disk space is used by the database after all writes are complete.

When finished, the bulk load program prints out how long it took to load data, and what the average ingestion rate was.

Combining the following parameters gives a hypothetical ‘performance matrix’ for a given dataset:

Client parallelism: 1, 2, 4, 8, 16
Database: InfluxDB, Elasticsearch (with default template), Elasticsearch (with aggregation template)

Which gives a possible set of 15 bulk write benchmarks. Running all these tests is excessive, but it is possible and allows us to confidently determine how both write throughput and disk usage scale.

Phase 3: Query generation

The third phase makes serialized queries and saves them to a file.

We pre-generate all queries before benchmarking them, so that the query benchmarker can be as lightweight as possible. This allows us to reuse code between the database drivers. It also lets us prove that the runtime overhead of query generation does not impact the benchmarks.

Many benchmark suites generate and serialize queries at the same time as running benchmarks; this is typically a mistake. For example, Elasticsearch takes queries in JSON format, yet InfluxDB has a simpler wire format. If we included query generation in the query benchmarker, then the JSON serialization overhead would negatively, and unfairly, affect the Elasticsearch benchmark.

(In the case of JSON this effect is especially acute: the JSON encoder in Go’s standard library makes many heap allocations and uses reflection.)

The DevOps use case is focused on relating to the the needs of system administrators. As we saw above, the data for our benchmark is telemetry from a simulated server fleet.

The queries that administrators tend to run are focused on: 1) visualizing information on dashboards, 2) identifying trends in system utilization, and 3) drilling down into a particular server’s behavior.

To that end, we have identified the following query types as being representative of a sysadmin’s needs:

Maximum CPU usage for 1 host, over the course of an hour, in 1 minute intervals
Maximum CPU usage for 2 hosts, over the course of an hour, in 1 minute intervals
Maximum CPU usage for 4 hosts, over the course of an hour, in 1 minute intervals
Maximum CPU usage for 8 hosts, over the course of an hour, in 1 minute intervals
Maximum CPU usage for 16 hosts, over the course of an hour, in 1 minute intervals
Maximum CPU usage for 32 hosts, over the course of an hour, in 1 minute intervals

Each of these six abstract query types are parameterized to create millions of concrete queries, which are then serialized to files. (For example, the max CPU query for one host will be parameterized on 1) a random host id, and 2) a random 60-minute interval.) These requests will be read by the query benchmarker and then sent to the database.

Our query generator program uses a deterministic random number generator to fill in the parameters for each concrete query.

For example, here are two queries for InfluxDB that aggregate maximum CPU information for 2 hosts during a random 1-hour period, in 1 minute buckets. Each hostname was chosen from a set of 100 hosts, because in this example the Scaling Variable is 100:

SELECT max(usage_user) FROM cpu WHERE (hostname = 'host_73' OR hostname = 'host_24') AND time >= '2016-01-01T19:24:45Z' AND time < '2016-01-01T20:24:45Z' GROUP BY time(1m)
SELECT max(usage_user) FROM cpu WHERE (hostname = 'host_60' OR hostname = 'host_79') AND time >= '2016-01-01T11:14:49Z' AND time < '2016-01-01T12:14:49Z' GROUP BY time(1m)

Notice that the time range is always 60 minutes long, and that the start of the time range is randomly chosen.

The result of the query generation step is two files of serialized queries, one for each database.

Phase 4: Query execution

The final step is benchmarking query performance.

So far we have covered data generation, data loading, and query generation. Now, all of that culminates in a benchmark for each database that measures how fast they can satisfy queries.

Our query benchmarker is a small program that executes HTTP requests in parallel. It reads pre-generated requests from stdin, performs a minimum of deserialization, then executes those queries against the chosen endpoint. It supports making requests in parallel, and collects basic summary statistics during its execution.

The query benchmarker has zero knowledge of the database it is testing; it just executes HTTP requests and measures the outcome.

We use the fasthttp library for the HTTP client, because it minimizes heap allocations and can be up to 10x faster than Go’s default client.

Before every execution of the query benchmarker, we restart the given database daemon in order to flush any query caches.

Phase 5: Query validation

The final step is to validate the benchmark by sampling the query results for both databases.

The benchmark suite was engineered to be fully deterministic. However, that does not guard against possible semantic mistakes in the data or query set. For example, queries for one database could be valid, yet wrong, if they compute an undesired result.

To show the parity of both data and queries between the databases, we can compare the query responses themselves.

Our query benchmarker tool has a mode for pretty-printing the query responses it receives. By running it in this mode, we can inspect query results and compare the results for each database.

For example, here is a side-by-side comparison of the responses for the same query (a list of maximums, in 1-minute buckets):

InfluxDB query response:

{
  "results": [
    {
      "series": [
        {
          "name": "cpu",
          "columns": [
            "time",
            "max"
          ],
          "values": [
            [
              "2016-01-01T18:29:00Z",
              90.92765387779365
            ],
            [
              "2016-01-01T18:30:00Z",
              89.58087379178397
            ],
            [
              "2016-01-01T18:31:00Z",
              88.39341429374308
            ],
            [
              "2016-01-01T18:32:00Z",
              84.27665178871197
            ],
            [
              "2016-01-01T18:33:00Z",
              84.95048030509422
            ],
            ...

Elasticsearch query response:

{
  "took": 133,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1728000,
    "max_score": 0.0,
    "hits": []
  },
  "aggregations": {
    "result": {
      "doc_count": 360,
      "result2": {
        "buckets": [
          {
            "key_as_string": "2016-01-01-18",
            "key": 1451672940000,
            "doc_count": 4,
            "max_of_field": {
              "value": 90.92765387779365
            }
          },
          {
            "key_as_string": "2016-01-01-18",
            "key": 1451673000000,
            "doc_count": 6,
            "max_of_field": {
              "value": 89.58087379178397
            }
          },
          {
            "key_as_string": "2016-01-01-18",
            "key": 1451673060000,
            "doc_count": 6,
            "max_of_field": {
              "value": 88.39341429374308
            }
          },
          {
            "key_as_string": "2016-01-01-18",
            "key": 1451673120000,
            "doc_count": 6,
            "max_of_field": {
              "value": 84.27665178871197
            }
          },
          {
            "key_as_string": "2016-01-01-18",
            "key": 1451673180000,
            "doc_count": 6,
            "max_of_field": {
              "value": 84.95048030509422
            }
          },
          ...

By inspection, we can see that the results are (within floating point tolerance) identical. We have done this by hand for a representative selection of queries for each benchmark run.

Successful query validation implies that the benchmarking suite has end-to-end reproducibility, and is correct between both databases.

Quickstart

Executing the benchmarks requires the Go compiler and tools to be installed on your system. See https://golang.org/doc/install for package downloads and installation. Once Go is configured you can proceed to installing and running the benchmark.

Install

Running benchmarks requires installing the data and query generators along with loaders and benchmarkers for the platforms you wish to test. For example, to install and run load tests for InfluxDB, execute:

go install github.com/influxdata/influxdb-comparisons/cmd/bulk_data_gen@latest github.com/influxdata/influxdb-comparisons/cmd/bulk_load_influx@latest

This will download and install the latest code from GitHub (including dependencies). Check the cmd directory for additional database implementations to download and install. For query benchmarking, install the query generator and benchmark executor for your platform. E.g. for InfluxDB:

go install github.com/influxdata/influxdb-comparisons/cmd/bulk_query_gen@latest github.com/influxdata/influxdb-comparisons/cmd/query_benchmarker_influxdb@latest

Help

For any module, you can run the executable with the -h flag and it will print a list of command line parameters. E.g.

-bash-4.1$ $GOPATH/bin/bulk_data_gen -h
Usage of /home/clarsen/go/bin/bulk_data_gen:
  -debug int
    	Debug printing (choices: 0, 1, 2) (default 0).
  -format string
    	Format to emit. (choices: influx-bulk, es-bulk, cassandra, mongo, opentsdb) (default "influx-bulk")
  -interleaved-generation-group-id uint
    	Group (0-indexed) to perform round-robin serialization within. Use this to scale up data generation to multiple processes.
  -interleaved-generation-groups uint
    	The number of round-robin serialization groups. Use this to scale up data generation to multiple processes. (default 1)
  -scale-var int
    	Scaling variable specific to the use case. (default 1)
  -seed int
    	PRNG seed (default, or 0, uses the current timestamp).
  -timestamp-end string
    	Ending timestamp (RFC3339). (default "2016-01-01T06:00:00Z")
  -timestamp-start string
    	Beginning timestamp (RFC3339). (default "2016-01-01T00:00:00Z")
  -use-case string
    	Use case to model. (choices: devops, iot) (default "devops")

Loading Data

To generate and write data to a database, execute the bulk data generator using optional command line parameters and pipe the output to a bulk loader. For example, load data in an InfluxDB instance, run:

$GOPATH/bin/bulk_data_gen | $GOPATH/bin/bulk_load_influx -urls http://localhost:8086

This will automatically create a database instance and load about 19,440 data points. For additional data, set the start and end times. Also note that the default generation data format is influx-bulk. If you want to test another database, use the -format parameter with the proper loader. E.g. for OpenTSDB:

$GOPATH/bin/bulk_data_gen -format opentsdb | $GOPATH/bin/bulk_load_opentsdb -urls http://localhost:4242

A successful run will the number of items generated and stored along with the total time and mean rate per second.

-bash-4.1$ $GOPATH/bin/bulk_data_gen | $GOPATH/bin/bulk_load_influx  -urls http://druidzoo-1.yms.gq1.yahoo.com:8086
using random seed 329234002
daemon URLs: [http://druidzoo-1.yms.gq1.yahoo.com:8086]
[worker 0] backoffs took a total of 0.000000sec of runtime
loaded 19440 items in 0.751433sec with 1 workers (mean rate 25870.568346/sec, 8.60MB/sec from stdin)

Querying Data

Querying the database is similar to loading data. Execute the bulk query generator and pipe it's output to the benchmark tool for the database under test. Each run requires a -query-type argument to determine what type of query to execute. These are meant to mimic actual queries such as searching for data on a single host out of many, multiple hosts from many or grouping by various tags. To find out what query types are available, execute $GOPATH/bin/bulk_query_gen -h and look for the use case matrix at the bottom of the output. An example run command looks like:

$GOPATH/bin/bulk_query_gen -query-type "1-host-1-hr" | $GOPATH/bin/query_benchmarker_influxdb -urls http://druidzoo-1.yms.gq1.yahoo.com:8086

A successful run will execute multiple queries and periodically print status information to standard out.

-bash-4.1$ $GOPATH/bin/bulk_query_gen -query-type "1-host-1-hr" | $GOPATH/bin/query_benchmarker_influxdb -urls http://druidzoo-1.yms.gq1.yahoo.com:8086
using random seed 684941023
after 100 queries with 1 workers:
Influx max cpu, rand    1 hosts, rand 1h0m0s by 1m : min:     1.50ms ( 668.55/sec), mean:     1.98ms ( 506.32/sec), max:    3.10ms (322.34/sec), count:      100, sum:   0.2sec
all queries                                        : min:     1.50ms ( 668.55/sec), mean:     1.98ms ( 506.32/sec), max:    3.10ms (322.34/sec), count:      100, sum:   0.2sec

...

run complete after 1000 queries with 1 workers:
Influx max cpu, rand    1 hosts, rand 1h0m0s by 1m : min:     1.45ms ( 689.62/sec), mean:     2.07ms ( 482.67/sec), max:   12.21ms ( 81.92/sec), count:     1000, sum:   2.1sec
all queries                                        : min:     1.45ms ( 689.62/sec), mean:     2.07ms ( 482.67/sec), max:   12.21ms ( 81.92/sec), count:     1000, sum:   2.1sec
wall clock time: 2.084896sec

influxdb-comparisons's People

Contributors

alespour avatar cmd-influx avatar codyshepherd avatar dandv avatar danxmoran avatar dependabot[bot] avatar ivankudibal avatar jdstrand avatar madumitha-ravi avatar manolama avatar mkm-influx avatar pauldix avatar rw avatar serenibyss avatar tomklapka avatar vlastahajek avatar williamhbaker avatar zealzhangz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

influxdb-comparisons's Issues

invalid character '<' looking for beginning of value

I am trying to benchmark the Influxdb and while I am trying to run bulk_load_influx executable after executing bulk_data_gen executable successfully, with Influxdb v2.0.0 OSS (Beta), I keep on getting the same error again and again which is mentioned in the issue header.

Configurations of my machine are:

OS: Ubuntu 18.04 LTS
Ram: 24GB
Go-Language version: 1.15.2

Command I am trying to execute,

./bulk_load_influx --report-auth-token "<Auth_Token_is_supplied_here>" --report-database "InfluxdbBenchmark" --report-bucket-id "06516b985d6a6000" --report-org-id "<Organization_name_here>" --urls "http://localhost:9999"

I am stuck on this issue for a long time, any kind of help will be really helpful.

Thank you in advance

Commit elasticsearch default and aggregation index templates

Within the project readme, 2 elasticsearch index templates are mentioned for the setup and configuration, however they are not found within the repository.

Third, we developed two Elasticsearch index templates, each of which represents a way we think people use Elasticsearch to store time-series data:

The first template, called ‘default’, stores time-series data in a way that enables fast querying, while also storing the original document data. This is closest to Elasticsearch’s default behavior and is a reasonable starting point for most users, although its on-disk size may become large.

The second template, called ‘aggregation’, indexes time-series data in a way that saves disk space by discarding the original point data. All data is stored in a compressed form inside the Lucene indexes, therefore all queries are completely accurate. But, due to an implementation detail of Elastic, the underlying point data is no longer independently addressable. For users who only conduct aggregation queries, this saves quite a bit of disk space (and improves bulk load speed) without any downsides.

In order to reproduce these benchmarks, it'd would be helpful to have the actual index templates used checked in.

The following databases already exist in the data store: _monitoring, _tasks

When I use bulk_data_gen | bulk_load_influx -urls http://localhost:8086, I get the following output:

using random seed 40576053
2022/07/22 03:46:02 Using sampling interval 10s
2022/07/22 03:46:02 Using cardinality of 9
2022/07/22 03:46:02 daemon URLs: [http://localhost:8086]
2022/07/22 03:46:02 Ingestion rate control is off
2022/07/22 03:46:02 Using InfluxDB API version 2
2022/07/22 03:46:02 SysInfo:
2022/07/22 03:46:02 Current GOMAXPROCS: 8
2022/07/22 03:46:02 Num CPUs: 8
2022/07/22 03:46:02 The following databases already exist in the data store: _monitoring, _tasks. If you know what you are doing, run the command:
curl 'http://localhost:8086/query?q=drop%20database%20benchmark_db'

and i fail.

What can i do next? _monitoring, _tasks are the default bucket and i cannot remove them. Thanks!

Bare aggregate use cases return more data in Flux than InfluxQL

Currently the bare aggregate queries for Flux look something like this:

from(bucket:"benchmark_db") 
  |> range(start:2018-01-01T01:00:00Z, stop:2018-01-01T03:00:00Z) 
  |> filter(fn:(r) => r._measurement == "air_condition_room" and r._field == "temperature") 
  |> last() 
  |> yield()

and for InfluxQL look like this:

SELECT last(temperature) FROM air_condition_room WHERE time > '2018-01-01T01:00:00Z' AND time < '2018-01-01T03:00:00Z'

This kind of flux query triggers the PushDownBareAggregateRule, and so testing it is good to make sure that pushdown makes the query fast.

The output from the Flux query includes the aggregated point (first, last, count, etc.) for each separate series covered by the query, returning a (possibly large) list of tables for each series. The current InfluxQL query returns a single point, with its default behavior being to group all series together as one. This could be skewing the timing results, especially for the first and last aggregates, since as it is Flux must return a much larger set of data.

To make these two queries output an equivalent amount of data and still activate the PushDownBareAggregateRule, the InfluxQL query could be updated to be something like this:

SELECT last(temperature) FROM air_condition_room WHERE time >= '2018-01-01T01:00:00Z' AND time < '2018-01-01T03:00:00Z' GROUP BY home_id,room_id,sensor_id

CentOS 7.8 2003 install query_benchmarker_influxdb with error

I install bulk_data_gen, bulk_load_influx,bulk_query_gen successfully. But I get something unusual with query_benchmarker_influxdb.

# go version
go version go1.20.12 linux/amd64

# pwd
/root/go/bin
# ls
bulk_data_gen  bulk_load_influx  bulk_query_gen  influx-stress

# go install github.com/influxdata/influxdb-comparisons/cmd/query_benchmarker_influxdb@latest
# github.com/influxdata/influxdb-comparisons/bulk_query
../pkg/mod/github.com/influxdata/[email protected]/bulk_query/query.go:431:2: waitFinished declared and not used

Bench-Marker tool is not able to read queries from a file having 1M queries

Use case : "Identifying a data base which supports 1.08 million events per hour"

Steps I am following :

  1. I am populating data in InfluxDB by using Java API. ( Not using bulk_data_gen and bulk_data_load tools ) .
  2. In my Data model, I have one measurement with 1 million rows. All tags and keys are String field.
  3. now I am using "bulk_query_gen " tool to generate queries and write it into a file.
    I have modified the tool to create queries like "select from measurement where time < X and Time < y"

Bulk_gen_query tool

./bulk_query_gen --debug=0 --seed=321 --format=influx-http --timestamp-end="2016-01-18T14:44:00Z" -queries=1000000 --query-type="1-host-1-hr" --db="benchmark" --debug 1 | gzip > EventQueries1M.gz

** Query_Benchmarker_tool**

bash-3.2$
cat EventQueries1M.gz | gunzip | ./query_benchmarker_influxdb --url=http://localhost:8086 --print-interval=0 –limit=1000 --workers=2 --print-responses=true -- db="benchmark" --debug 1

**2017/01/27 21:45:51 Error during request: dialing to the given TCP address timed out**

( If the query size is > 10k, I am running into this issue )

Please note, I could reproduce this issue without making any changes to the benchmarking tool.

mongo_serialization

Cannot find Tag in mongo_serialization module.

go build ./bulk_data_gen/
# github.com/influxdata/influxdb-comparisons/mongo_serialization
../../golang/src/github.com/influxdata/influxdb-comparisons/mongo_serialization/Item.go:75: undefined: Tag
../../golang/src/github.com/influxdata/influxdb-comparisons/mongo_serialization/Item.go:82: undefined: Tag

go test -v ./... failures

Trying to run the tests, it fails:

$ go test -v ./...
...
cmd/bulk_load_timescale/main.go:20:2: module github.com/jackc/pgx@latest found (v3.6.2+incompatible), but does not contain package github.com/jackc/pgx/pgxpool

Googling, this should be github.com/jackc/pgx/v4/pgxpool now.

So apply the following diff:

$ git diff
diff --git a/cmd/bulk_load_timescale/main.go b/cmd/bulk_load_timescale/main.go
index e366584..6303b71 100644
--- a/cmd/bulk_load_timescale/main.go
+++ b/cmd/bulk_load_timescale/main.go
@@ -16,8 +16,8 @@ import (
 
        "github.com/influxdata/influxdb-comparisons/bulk_load"
        "github.com/influxdata/influxdb-comparisons/util/report"
-       "github.com/jackc/pgx"
-       "github.com/jackc/pgx/pgxpool"
+       "github.com/jackc/pgx/v4"
+       "github.com/jackc/pgx/v4/pgxpool"
 
        "bytes"
        "context"
diff --git a/cmd/query_benchmarker_timescale/main.go b/cmd/query_benchmarker_timescale/main.go
index 3c79f4c..74f0ee6 100644
--- a/cmd/query_benchmarker_timescale/main.go
+++ b/cmd/query_benchmarker_timescale/main.go
@@ -22,8 +22,8 @@ import (
        "strings"
 
        "github.com/influxdata/influxdb-comparisons/util/report"
-       "github.com/jackc/pgx"
-       "github.com/jackc/pgx/pgxpool"
+       "github.com/jackc/pgx/v4"
+       "github.com/jackc/pgx/v4/pgxpool"
 )
 
 type TimescaleQueryBenchmarker struct {

Then run go mod tidy. Now we can get some test results but still see failures (apparently need a V1 and and V2 influxdb running?):

# github.com/gocql/gocql
../../go/pkg/mod/github.com/gocql/[email protected]/dial.go:79:18: tconn.HandshakeContext undefined (type *tls.Conn has no field or method HandshakeContext)
?   	github.com/influxdata/influxdb-comparisons/bulk_data_gen/common	[no test files]
?   	github.com/influxdata/influxdb-comparisons/bulk_data_gen/dashboard	[no test files]
?   	github.com/influxdata/influxdb-comparisons/bulk_data_gen/devops	[no test files]
?   	github.com/influxdata/influxdb-comparisons/bulk_data_gen/iot	[no test files]
?   	github.com/influxdata/influxdb-comparisons/bulk_data_gen/metaqueries	[no test files]
?   	github.com/influxdata/influxdb-comparisons/bulk_data_gen/multi_measurement	[no test files]
# github.com/influxdata/influxdb-comparisons/bulk_query/http
bulk_query/http/http_client.go:124:14: Fprintf format %s has arg pretty of wrong type bytes.Buffer
?   	github.com/influxdata/influxdb-comparisons/bulk_load	[no test files]
# github.com/influxdata/influxdb-comparisons/bulk_query
vet: bulk_query/query.go:431:2: waitFinished declared but not used
=== RUN   TestResultsInfluxDbV1
    result_test.go:29: 
        	Error Trace:	/home/ubuntu/code/influxdb-comparisons.git/util/report/result_test.go:29
        	Error:      	Received unexpected error:
        	            	dial tcp4 127.0.0.1:8086: connect: connection refused
        	Test:       	TestResultsInfluxDbV1
--- FAIL: TestResultsInfluxDbV1 (0.00s)
=== RUN   TestResultsInfluxDbV2
    result_test.go:52: 
        	Error Trace:	/home/ubuntu/code/influxdb-comparisons.git/util/report/result_test.go:52
        	Error:      	Received unexpected error:
        	            	dial tcp4 127.0.0.1:9999: connect: connection refused
        	Test:       	TestResultsInfluxDbV2
--- FAIL: TestResultsInfluxDbV2 (0.00s)
FAIL
FAIL	github.com/influxdata/influxdb-comparisons/util/report	0.003s

Add comparison to RiakTS

Recently I've seen an uptick in the number of potential users asking for a formal comparison to RiakTS.

Would be great if we could get a comparison.

Unable to execute queries using Cassandra

I'm trying to do a bencmark between Cassandra and influxDB, everything is fine with influxDB. I was able to generate and load Data to Cassandra. But i can't execute the queries.
This is what i'm getting.
[mccstan@bmgs-soat bench-soat]$ $GOBIN/bulk_query_gen -query-type "1-host-1-hr" | $GOBIN/query_benchmarker_cassandra --url=127.0.0.1:9042 using random seed 24982743 2017/06/07 17:17:04 invalid aggregation plan

Can someone help ?
Thank you !

panic when create RandWindow()

in https://github.com/influxdata/influxdb-comparisons/blob/master/bulk_query_gen/time_interval.go#L30:

upper := ti.End.Add(-window).UnixNano()
if upper <= lower {
        panic("logic error: bad time bounds")
    }

Will cause problem when start and end time are 2016-01-01 00:00:00 +0000 UTC 2016-01-01 06:00:00 +0000 UTC, but window is 12 hours.

./bulk_query_gen --debug=0 --seed=321 --format=influx-http --query-type=single-host

using random seed 321
panic: logic error: bad time bounds

goroutine 1 [running]:
panic(0x5590a0, 0xc820078920)
    /usr/local/go/src/runtime/panic.go:481 +0x3e6
main.(*TimeInterval).RandWindow(0xc8200734d0, 0x274a48a78000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /home/pao/caicloud/influxdb-comparisons/bulk_query_gen/time_interval.go:33 +0x1df
main.(*InfluxDevops).maxCPUUsageHourByMinuteNHosts(0xc8200734c0, 0x7f6417b41560, 0xc8200b6400, 0x1, 0x1)
    /home/pao/caicloud/influxdb-comparisons/bulk_query_gen/influx_devops_common.go:66 +0x6d
main.(*InfluxDevops).MaxCPUUsageHourByMinuteOneHost(0xc8200734c0, 0x7f6417b41560, 0xc8200b6400, 0x1)
    /home/pao/caicloud/influxdb-comparisons/bulk_query_gen/influx_devops_common.go:40 +0x89
main.(*InfluxDevopsSingleHost).Dispatch(0xc8200734c0, 0x0, 0x1, 0x0, 0x0)
    /home/pao/caicloud/influxdb-comparisons/bulk_query_gen/influx_devops_singlehost.go:19 +0x84
main.main()
    /home/pao/caicloud/influxdb-comparisons/bulk_query_gen/main.go:156 +0x3e4

Test environment question

Hi.
I have see the “Benchmarking InfluxDB vs. Elasticsearch for Time-Series Data & Metrics Management” technical paper, and I have a few question:

  1. What's the test environment? (number of machines, machine configuration)
  2. which template dose the write and query performance depend on? (default template or aggregation template)
    Any helps would be appreciated!

Some question after reading the comparison with Cassandra tech paper

Hi, after reading the Benchmarking InfluxDB vs. Cassandra for Time-Series Data, Metrics & Management paper I have some question about the benchmark.

It is said in the paper that a lot of application code are needed to get have Cassandra works as a tsdb. However there are KairosDB and Heroic which both support Cassandra, have u considered comparing with them?

The test is run on two dedicated machines. But Cassandra has a lot features for multi machine deployment like built in support for replication and consistent. It would be more convincing if InfuxDB is using its HA solutions to compete with Cassandra cluster.

Also some configuration on system and databases can effect performance a lot. Though it's too verbose to talk about them in the paper, I'd like to know how u choose those parameters, by running some smaller bench first?

Thanks

dead lock of TimescaleDB loading test with 5 workers

My postgresql version is 16 and timescaleDB version is 2.14.2.
When I execute the command "bulk_data_gen -format timescaledb-sql -scale-var 10 | bulk_load_timescale -format timescaledb-sql -workers 5",it throws four same errors:"Error writing:Error: Detecting dead locks (SQLSTATE 40P01)", and it finishs with "load finished prematurely: Worker error"
Can you provide some suggestions so that I can complete the bulk_load_timescale program?I will grateful if you can provide some suggestions.

first
second

unknown field 'MaxConnsPerHost' in struct literal of type http.Transport

When I want run ‘go get github.com/influxdata/influxdb-comparisons/cmd/bulk_query_gen github.com/influxdata/influxdb-comparisons/cmd/query_benchmarker_influxdb’. The console shows '# github.com/influxdata/influxdb-comparisons/bulk_query/httpgolang/gopath/src/github.com/influxdata/influxdb-comparis'

Document how to use/execute tools within project README

Not really being a Golang developer but doing a bit of work with InfluxDB, I wanted to leverage these tools for benchmarking and generating data for testing purposes. I will eventually figure it out, but it'd really be helpful to spell it out for those who need help.

Thanks!

Error when using TimescaleDB copyFrom format

When executing the following lines, I encounter multiple errors:

$GOPATH/bin/bulk_data_gen -format timescaledb-copyFrom | $GOPATH/bin/bulk_load_timescale 
$GOPATH/bin/bulk_data_gen -format timescaledb-copyFrom | $GOPATH/bin/bulk_load_timescale -format timescaledb-copyFrom

Two types of such errors are as follows:

cannot unmarshall 0 item: proto: FlatPoint: wiretype end group for non-group

cannot unmarshall 0 item: proto: wrong wireType = 1 for field MeasurementName

Compilation error

I'm trying to build bulk_query_gen using go 1.6.

go build
# influxdb-comparisons/cmd/bulk_query_gen
./main.go:38: undefined: NewMongoDevops8Hosts1Hr

build bulk_load_timescale error!

When I build influxdb-comparisons/cmd/bulk_load_timescale, error occurs below:

go: finding module for package github.com/jackc/pgx/pgxpool
main.go:20:2: module github.com/jackc/pgx@latest found (v3.6.2+incompatible), but does not contain package github.com/jackc/pgx/pgxpool

General complex bechmark support

Extend complex benchmark support done for InfluxDB also for remaining DBs. Introduce common API and let general algorithms be the sharable among db specific load and benchmark tools

InfluxQL queries fail for 2.x

Influxql queries generated by influxdb-comparisons currently fail during the benchmarking phase of performance tests of 2.x.

The queries are rejected, seemingly with an error stating that GETs should be POSTs. However, after some further debugging, this seems likely to be due to queries not being well-formed / updated to 2.x standards.

See comments in the following PR for more context and initial debugging attempts: #181

Elasticsearch: groupby query error in case of default template is used

2019/06/06 16:37:38 Error during request: Invalid write response (status 400): {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Fielddata is disabled on text fields by default. Set fielddata=true on [hostname] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"cpu","node":"a-7hFd9DRaaejFX2MNRMjg","reason":{"type":"illegal_argument_exception","reason":"Fielddata is disabled on text fields by default. Set fielddata=true on [hostname] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."}}],"caused_by":{"type":"illegal_argument_exception","reason":"Fielddata is disabled on text fields by default. Set fielddata=true on [hostname] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.","caused_by":{"type":"illegal_argument_exception","reason":"Fielddata is disabled on text fields by default. Set fielddata=true on [hostname] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."}}},"status":400}

go get bulk_data_gen failed

go get bulk_data_gen failed like below:

root@mgt01# go get github.com/influxdata/influxdb-comparisons/cmd/bulk_data_gen
go: found github.com/influxdata/influxdb-comparisons/cmd/bulk_data_gen in github.com/influxdata/influxdb-comparisons v0.0.0-20200702145229-f3a7a1e11bb4
go: finding module for package github.com/pelletier/go-toml
go: finding module for package github.com/google/flatbuffers/go
go: finding module for package github.com/golang/protobuf/proto
go: downloading github.com/google/flatbuffers v1.12.0
go: found github.com/pelletier/go-toml in github.com/pelletier/go-toml v1.8.1
go: found github.com/golang/protobuf/proto in github.com/golang/protobuf v1.4.2
go: finding module for package github.com/google/flatbuffers/go
go: finding module for package github.com/google/flatbuffers/go
go: finding module for package github.com/google/flatbuffers/go
../../pkg/mod/github.com/influxdata/[email protected]/bulk_data_gen/common/pools.go:6:2: github.com/google/[email protected]: verifying module: checksum mismatch
	downloaded: h1:N8EguYFm2wwdpoNcpchQY0tPs85vOJkboFb2dPxmixo=
	sum.golang.org: h1:/PtAHvnBY4Kqnx/xCQ3OIV9uYcSFGScBsWI3Oogeh6w=

SECURITY ERROR
This download does NOT match the one reported by the checksum server.
The bits may have been replaced on the origin server, or an attacker may
have intercepted the download attempt.

For more information, see 'go help module-auth'.

Clusterd setup

Maybe I have overseen this topic in the README but how is the benchmark when you compare mongodb cluster with 3 or 5 instances with influxdb cluster with 3 or 5 instances .

Normally in a cloud you have more the one instanze.
What happens when one of the instances is gone at run time (read, write, query)?

ES7 support

I'm using ES7 to test, but I found the indexTemplateChoices did not have version 7 . Any support with It?

should be benchmarking again 2.3.2

the 5 release is alpha and while it contains new features - may also be less stable or have issues with performance that have not been addressed. You should benchmark again 2.3.2 which is GA and in production with many people.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.