Git Product home page Git Product logo

dsq's Introduction

Not under active development

While development may continue in the future with a different architecture, for the moment you should probably instead use DuckDB, ClickHouse-local, or GlareDB (based on DataFusion).

These are built on stronger analytics foundations than projects like dsq based on SQLite. For example, column-oriented storage and vectorized execution, let alone JIT-compiled expression evaluation, are possible with these other projects.

More here.

Commandline tool for running SQL queries against JSON, CSV, Excel, Parquet, and more

Since Github doesn't provide a great way for you to learn about new releases and features, don't just star the repo, join the mailing list.

About

This is a CLI companion to DataStation (a GUI) for running SQL queries against data files. So if you want the GUI version of this, check out DataStation.

Install

Binaries for amd64 (x86_64) are provided for each release.

macOS Homebrew

dsq is available on macOS Homebrew:

$ brew install dsq

Binaries on macOS, Linux, WSL

On macOS, Linux, and WSL you can run the following:

$ VERSION="v0.23.0"
$ FILE="dsq-$(uname -s | awk '{ print tolower($0) }')-x64-$VERSION.zip"
$ curl -LO "https://github.com/multiprocessio/dsq/releases/download/$VERSION/$FILE"
$ unzip $FILE
$ sudo mv ./dsq /usr/local/bin/dsq

Or install manually from the releases page, unzip and add dsq to your $PATH.

Binaries on Windows (not WSL)

Download the latest Windows release, unzip it, and add dsq to your $PATH.

Build and install from source

If you are on another platform or architecture or want to grab the latest release, you can do so with Go 1.18+:

$ go install github.com/multiprocessio/dsq@latest

dsq will likely work on other platforms that Go is ported to such as AARCH64 and OpenBSD, but tests and builds are only run against x86_64 Windows/Linux/macOS.

Usage

You can either pipe data to dsq or you can pass a file name to it. NOTE: piping data doesn't work on Windows.

If you are passing a file, it must have the usual extension for its content type.

For example:

$ dsq testdata.json "SELECT * FROM {} WHERE x > 10"

Or:

$ dsq testdata.ndjson "SELECT name, AVG(time) FROM {} GROUP BY name ORDER BY AVG(time) DESC"

Pretty print

By default dsq prints ugly JSON. This is the most efficient mode.

$ dsq testdata/userdata.parquet 'select count(*) from {}'
[{"count(*)":1000}
]

If you want prettier JSON you can pipe dsq to jq.

$ dsq testdata/userdata.parquet 'select count(*) from {}' | jq
[
  {
    "count(*)": 1000
  }
]

Or you can enable pretty printing with -p or --pretty in dsq which will display your results in an ASCII table.

$ dsq --pretty testdata/userdata.parquet 'select count(*) from {}'
+----------+
| count(*) |
+----------+
|     1000 |
+----------+

Piping data to dsq

When piping data to dsq you need to set the -s flag and specify the file extension or MIME type.

For example:

$ cat testdata.csv | dsq -s csv "SELECT * FROM {} LIMIT 1"

Or:

$ cat testdata.parquet | dsq -s parquet "SELECT COUNT(1) FROM {}"

Multiple files and joins

You can pass multiple files to DSQ. As long as they are supported data files in a valid format, you can run SQL against all files as tables. Each table can be accessed by the string {N} where N is the 0-based index of the file in the list of files passed on the commandline.

For example this joins two datasets of differing origin types (CSV and JSON).

$ dsq testdata/join/users.csv testdata/join/ages.json \
  "select {0}.name, {1}.age from {0} join {1} on {0}.id = {1}.id"
[{"age":88,"name":"Ted"},
{"age":56,"name":"Marjory"},
{"age":33,"name":"Micah"}]

You can also give file-table-names aliases since dsq uses standard SQL:

$ dsq testdata/join/users.csv testdata/join/ages.json \
  "select u.name, a.age from {0} u join {1} a on u.id = a.id"
[{"age":88,"name":"Ted"},
{"age":56,"name":"Marjory"},
{"age":33,"name":"Micah"}]

SQL query from file

As your query becomes more complex, it might be useful to store it in a file rather than specify it on the command line. To do so replace the query argument with --file or -f and the path to the file.

$ dsq data1.csv data2.csv -f query.sql

Transforming data to JSON without querying

As a shorthand for dsq testdata.csv "SELECT * FROM {}" to convert supported file types to JSON you can skip the query and the converted JSON will be dumped to stdout.

For example:

$ dsq testdata.csv
[{...some csv data...},{...some csv data...},...]

Array of objects nested within an object

DataStation and dsq's SQL integration operates on an array of objects. If your array of objects happens to be at the top-level, you don't need to do anything. But if your array data is nested within an object you can add a "path" parameter to the table reference.

For example if you have this data:

$ cat api-results.json
{
  "data": {
    "data": [
      {"id": 1, "name": "Corah"},
      {"id": 3, "name": "Minh"}
    ]
  },
  "total": 2
}

You need to tell dsq that the path to the array data is "data.data":

$ dsq --pretty api-results.json 'SELECT * FROM {0, "data.data"} ORDER BY id DESC'
+----+-------+
| id | name  |
+----+-------+
|  3 | Minh  |
|  1 | Corah |
+----+-------+

You can also use the shorthand {"path"} or {'path'} if you only have one table:

$ dsq --pretty api-results.json 'SELECT * FROM {"data.data"} ORDER BY id DESC'
+----+-------+
| id | name  |
+----+-------+
|  3 | Minh  |
|  1 | Corah |
+----+-------+

You can use either single or double quotes for the path.

Multiple Excel sheets

Excel files with multiple sheets are stored as an object with key being the sheet name and value being the sheet data as an array of objects.

If you have an Excel file with two sheets called Sheet1 and Sheet2 you can run dsq on the second sheet by specifying the sheet name as the path:

$ dsq data.xlsx 'SELECT COUNT(1) FROM {"Sheet2"}'

Limitation: nested arrays

You cannot specify a path through an array, only objects.

Nested object values

It's easiest to show an example. Let's say you have the following JSON file called user_addresses.json:

$ cat user_addresses.json
[
  {"name": "Agarrah", "location": {"city": "Toronto", "address": { "number": 1002 }}},
  {"name": "Minoara", "location": {"city": "Mexico City", "address": { "number": 19 }}},
  {"name": "Fontoon", "location": {"city": "New London", "address": { "number": 12 }}}
]

You can query the nested fields like so:

$ dsq user_addresses.json 'SELECT name, "location.city" FROM {}'

And if you need to disambiguate the table:

$ dsq user_addresses.json 'SELECT name, {}."location.city" FROM {}'

Caveat: PowerShell, CMD.exe

On PowerShell and CMD.exe you must escape inner double quotes with backslashes:

> dsq user_addresses.json 'select name, \"location.city\" from {}'
[{"location.city":"Toronto","name":"Agarrah"},
{"location.city":"Mexico City","name":"Minoara"},
{"location.city":"New London","name":"Fontoon"}]

Nested objects explained

Nested objects are collapsed and their new column name becomes the JSON path to the value connected by .. Actual dots in the path must be escaped with a backslash. Since . is a special character in SQL you must quote the whole new column name.

Limitation: whole object retrieval

You cannot query whole objects, you must ask for a specific path that results in a scalar value.

For example in the user_addresses.json example above you CANNOT do this:

$ dsq user_addresses.json 'SELECT name, {}."location" FROM {}'

Because location is not a scalar value. It is an object.

Nested arrays

Nested arrays are converted to a JSON string when stored in SQLite. Since SQLite supports querying JSON strings you can access that data as structured data even though it is a string.

So if you have data like this in fields.json:

[
  {"field1": [1]},
  {"field1": [2]},
]

You can request the entire field:

$ dsq fields.json "SELECT field1 FROM {}" | jq
[
  {
    "field1": "[1]"
  },
  {
    "field1": "[2]",
  }
]

JSON operators

You can get the first value in the array using SQL JSON operators.

$ dsq fields.json "SELECT field1->0 FROM {}" | jq
[
  {
    "field1->0": "1"
  },
  {
    "field1->0": "2"
  }
]

REGEXP

Since DataStation and dsq are built on SQLite, you can filter using x REGEXP 'y' where x is some column or value and y is a REGEXP string. SQLite doesn't pick a regexp implementation. DataStation and dsq use Go's regexp implementation which is more limited than PCRE2 because Go support for PCRE2 is not yet very mature.

$ dsq user_addresses.json "SELECT * FROM {} WHERE name REGEXP 'A.*'"
[{"location.address.number":1002,"location.city":"Toronto","name":"Agarrah"}]

Standard Library

dsq registers go-sqlite3-stdlib so you get access to numerous statistics, url, math, string, and regexp functions that aren't part of the SQLite base.

View that project docs for all available extended functions.

Output column order

When emitting JSON (i.e. without the --pretty flag) keys within an object are unordered.

If order is important to you you can filter with jq: dsq x.csv 'SELECT a, b FROM {}' | jq --sort-keys.

With the --pretty flag, column order is purely alphabetical. It is not possible at the moment for the order to depend on the SQL query order.

Dumping inferred schema

For any supported file you can dump the inferred schema rather than dumping the data or running a SQL query. Set the --schema flag to do this.

The inferred schema is very simple, only JSON types are supported. If the underlying format (like Parquet) supports finer-grained data types (like int64) this will not show up in the inferred schema. It will show up just as number.

For example:

$ dsq testdata/avro/test_data.avro --schema --pretty
Array of
  Object of
    birthdate of
      string
    cc of
      Varied of
        Object of
          long of
            number or
        Unknown
    comments of
      string
    country of
      string
    email of
      string
    first_name of
      string
    gender of
      string
    id of
      number
    ip_address of
      string
    last_name of
      string
    registration_dttm of
      string
    salary of
      Varied of
        Object of
          double of
            number or
        Unknown
    title of
      string

You can print this as a structured JSON string by omitting the --pretty flag when setting the --schema flag.

Caching

Sometimes you want to do some exploration on a dataset that isn't changing frequently. By turning on the --cache or -C flag DataStation will store the imported data on disk and not delete it when the run is over.

With caching on, DataStation calculates a SHA1 sum of all the files you specified. If the sum ever changes then it will reimport all the files. Otherwise when you run additional queries with the cache flag on it will reuse that existing database and not reimport the files.

Since without caching on DataStation uses an in-memory database, the initial query with caching on may take slightly longer than with caching off. Subsequent queries will be substantially faster though (for large datasets).

For example, in the first run with caching on this query might take 30s:

$ dsq some-large-file.json --cache 'SELECT COUNT(1) FROM {}'

But when you run another query it might only take 1s.

$ dsq some-large-file.json --cache 'SELECT SUM(age) FROM {}'

Not because we cache any result but because we cache importing the file into SQLite.

So even if you change the query, as long as the file doesn't change, the cache is effective.

To make this permanent you can export DSQ_CACHE=true in your environment.

Interactive REPL

Use the -i or --interactive flag to enter an interactive REPL where you can run multiple SQL queries.

$ dsq some-large-file.json -i
dsq> SELECT COUNT(1) FROM {};
+----------+
| COUNT(1) |
+----------+
|     1000 |
+----------+
(1 row)
dsq> SELECT * FROM {} WHERE NAME = 'Kevin';
(0 rows)

Converting numbers in CSV and TSV files

CSV and TSV files do not allow to specify the type of the individual values contained in them. All values are treated as strings by default.

This can lead to unexpected results in queries. Consider the following example:

$ cat scores.csv
name,score
Fritz,90
Rainer,95.2
Fountainer,100

$ dsq scores.csv "SELECT * FROM {} ORDER BY score"
[{"name":"Fountainer","score":"100"},
{"name":"Fritz","score":"90"},
{"name":"Rainer","score":"95.2"}]

Note how the score column contains numerical values only. Still, sorting by that column yields unexpected results because the values are treated as strings, and sorted lexically. (You can tell that the individual scores were imported as strings because they're quoted in the JSON result.)

Use the -n or --convert-numbers flag to auto-detect and convert numerical values (integers and floats) in imported files:

$ dsq ~/scores.csv --convert-numbers "SELECT * FROM {} ORDER BY score"
[{"name":"Fritz","score":90},
{"name":"Rainer","score":95.2},
{"name":"Fountainer","score":100}]

Note how the scores are imported as numbers now and how the records in the result set are sorted by their numerical value. Also note that the individual scores are no longer quoted in the JSON result.

To make this permanent you can export DSQ_CONVERT_NUMBERS=true in your environment. Turning this on disables some optimizations.

Supported Data Types

Name File Extension(s) Mime Type Notes
CSV csv text/csv
TSV tsv, tab text/tab-separated-values
JSON json application/json Must be an array of objects or a path to an array of objects.
Newline-delimited JSON ndjson, jsonl application/jsonlines
Concatenated JSON cjson application/jsonconcat
ORC orc orc
Parquet parquet parquet
Avro avro application/avro
YAML yaml, yml application/yaml
Excel xlsx, xls application/vnd.ms-excel If you have multiple sheets, you must specify a sheet path.
ODS ods application/vnd.oasis.opendocument.spreadsheet If you have multiple sheets, you must specify a sheet path.
Apache Error Logs NA text/apache2error Currently only works if being piped in.
Apache Access Logs NA text/apache2access Currently only works if being piped in.
Nginx Access Logs NA text/nginxaccess Currently only works if being piped in.
LogFmt Logs logfmt text/logfmt

Engine

Under the hood dsq uses DataStation as a library and under that hood DataStation uses SQLite to power these kinds of SQL queries on arbitrary (structured) data.

Comparisons

Name Link Caching Engine Supported File Types Binary Size
dsq Here Yes SQLite CSV, TSV, a few variations of JSON, Parquet, Excel, ODS (OpenOffice Calc), ORC, Avro, YAML, Logs 49M
q http://harelba.github.io/q/ Yes SQLite CSV, TSV 82M
textql https://github.com/dinedal/textql No SQLite CSV, TSV 7.3M
octoql https://github.com/cube2222/octosql No Custom engine JSON, CSV, Excel, Parquet 18M
csvq https://github.com/mithrandie/csvq No Custom engine CSV 15M
sqlite-utils https://github.com/simonw/sqlite-utils No SQLite CSV, TSV N/A, Not a single binary
trdsql https://github.com/noborus/trdsql No SQLite, MySQL or PostgreSQL Few variations of JSON, TSV, LTSV, TBLN, CSV 14M
spysql https://github.com/dcmoura/spyql No Custom engine CSV, JSON, TEXT N/A, Not a single binary
duckdb https://github.com/duckdb/duckdb ? Custom engine CSV, Parquet 35M

Not included:

  • clickhouse-local: fastest of anything listed here but so gigantic (over 2GB) that it can't reasonably be considered a good tool for any environment
  • sqlite3: requires multiple commands to ingest CSV, not great for one-liners
  • datafusion-cli: very fast (slower only than clickhouse-local) but requires multiple commands to ingest CSV, so not great for one-liners

Benchmark

This benchmark was run June 19, 2022. It is run on a dedicated bare metal instance on OVH with:

  • 64 GB DDR4 ECC 2,133 MHz
  • 2x450 GB SSD NVMe in Soft RAID
  • Intel Xeon E3-1230v6 - 4c/8t - 3.5 GHz/3.9 GHz

It runs a SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi.csv GROUP BY passenger_count query against the well-known NYC Yellow Taxi Trip Dataset. Specifically, the CSV file from April 2021 is used. It's a 200MB CSV file with ~2 million rows, 18 columns, and mostly numerical values.

The script is here. It is an adaptation of the benchmark that the octosql devs run.

Program Version Mean [s] Min [s] Max [s] Relative
dsq 0.20.1 (caching on) 1.151 ± 0.010 1.131 1.159 1.00
duckdb 0.3.4 1.723 ± 0.023 1.708 1.757 1.50 ± 0.02
octosql 0.7.3 2.005 ± 0.008 1.991 2.015 1.74 ± 0.02
q 3.1.6 (caching on) 2.028 ± 0.010 2.021 2.055 1.76 ± 0.02
sqlite3 * 3.36.0 4.204 ± 0.018 4.177 4.229 3.64 ± 0.04
trdsql 0.10.0 12.972 ± 0.225 12.554 13.392 11.27 ± 0.22
dsq 0.20.1 (default) 15.030 ± 0.086 14.895 15.149 13.06 ± 0.13
textql fca00ec 19.148 ± 0.183 18.865 19.500 16.63 ± 0.21
spyql 0.6.0 16.985 ± 0.105 16.854 17.161 14.75 ± 0.16
q 3.1.6 (default) 24.061 ± 0.095 23.954 24.220 20.90 ± 0.20

* While dsq and q are built on top of sqlite3 there is not a builtin way in sqlite3 to cache ingested files without a bit of scripting

Not included:

  • clickhouse-local: faster than any of these but over 2GB so not a reasonable general-purpose CLI
  • datafusion-cli: slower only than clickhouse-local but requires multiple commands to ingest CSV, can't do one-liners
  • sqlite-utils: takes minutes to finish

Notes

OctoSQL, duckdb, and SpyQL implement their own SQL engines. dsq, q, trdsql, and textql copy data into SQLite and depend on the SQLite engine for query execution.

Tools that implement their own SQL engines can do better on 1) ingestion and 2) queries that act on a subset of data (such as limited columns or limited rows). These tools implement ad-hoc subsets of SQL that may be missing or differ from your favorite syntax. On the other hand, tools that depend on SQLite have the benefit of providing a well-tested and well-documented SQL engine. DuckDB is exceptional since there is a dedicated company behind it.

dsq also comes with numerous useful functions (e.g. best-effort date parsing, URL parsing/extraction, statistics functions, etc.) on top of SQLite builtins.

Third-party integrations

Community

Join us at #dsq on the Multiprocess Discord.

How can I help?

Download dsq and use it! Report bugs on Discord.

If you're a developer with some Go experience looking to hack on open source, check out GOOD_FIRST_PROJECTS.md in the DataStation repo.

License

This software is licensed under an Apache 2.0 license.

dsq's People

Contributors

0michalsokolowski0 avatar 0xflotus avatar chrismytton avatar eatonphil avatar fritzgrabo avatar ghibranalj avatar gl28 avatar jberger avatar johnmanjiro13 avatar laureano avatar lifesign avatar posrabi avatar sajuno avatar vegarsti avatar xwjdsh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dsq's Issues

Error querying empty json file

Describe the bug and expected behavior
Querying an emtpy JSON file results in an (unexpected) error. I was expecting an empty result set, but no error.

Reproduction steps
Input file empty.json:

[ ]

Running query:

$ dsq empty.json  "select * from {}"
Input is not an array of objects: empty.json.

The same happens with an empty (no lines) input file empty.jsonl.

Versions

  • OS: Linux 5.10.0-18-amd64 SMP Debian 5.10.140-1 (2022-09-02) x86_64 GNU/Linux
  • Shell: bash
  • dsq version: latest (should be 0.22.0)

Additional context, screenshots

Add support for dumping result as CSV (and eventually other formats)

I'm trying to put together a simple data migration pipeline that reads from Excel sheets into a PostgreSQL database, and I'd like to avoid writing new code if possible. My current thought is to use dsq to write the queries that transform Excel data into something ingestible, then hand the data over to pgloader for robust loading into Pg.

But it seems that dsq only outputs json results, whereas pgloader does not support json as input type - it does support csv and sqlite. My question is: does dsq support csv output? Or if I understood correctly that DataStation uses sqlite under the hood, can I get access to that database from dsq?

Or am I barking at the wrong tree altogether? Thanks for your advice!

suggestion to use v prefix for tags

Hi! Based on this approach https://github.com/golang/go/wiki/Modules#how-can-i-track-tool-dependencies-for-a-module to track tool dependencies for a module, as an example, I found a slight inconvenience when working with dsq, because dsq uses a different style to name tags. For go mod, it's recommended by this https://go.dev/ref/mod#versions

Each version starts with the letter v, followed by a semantic version

As a result, I have a pseudo-version like this instead of v0.21.0. It's challenging to understand which version is under the hood, actually. Is it possible to change the naming convention for tags?

Filtering on nested object data

Maybe I simply missed this in the documentation and its really straightforward, but is there a way to filter on nested data (maybe some kind of expansion) of objects in tables?

For example, I have the following representative data from parquet:

[{"PARGO_PREFIX___violations":{},"Role":"master","Foo":{"Id":"bar1"},"Created":"\ufffd-\ufffd\u0019\ufffd\u001a\u0000\u0000\ufffd\ufffd%\u0000","Version":1}
,
{"PARGO_PREFIX___violations":{},"Role":"master","Foo":{"Id":"bar2"},"Created":"\ufffd\ufffd\ufffd\u0011\ufffd\u001a\u0000\u0000\ufffd\ufffd%\u0000","Version":1}
,
{"PARGO_PREFIX___violations":{},"Role":"master","Foo":{"Id":"bar3"},"Created":"\ufffd6ګ\ufffd\u001a\u0000\u0000\ufffd\ufffd%\u0000","Version":1}
]

I can use dsq to select everything with Version == 1, but how would I select all items where Foo.Id == "bar2" for example?

I've tried what I thought would be the most logical:

dsq ~/Downloads/part-00030.snappy.parquet "SELECT * FROM {} WHERE Foo.Id == 'bar2'"

But I get a response back:

no such column: Foo.Id

"go install @latest" not working

I tried installing latest (c6daddf) with
go install github.com/multiprocessio/dsq@latest
but it failed with

go: downloading github.com/multiprocessio/dsq v0.0.0-20220515195645-c6daddfd43b6
go: github.com/multiprocessio/dsq@latest (in github.com/multiprocessio/[email protected]):
	The go.mod file for the module providing named packages contains one or
	more replace directives. It must not contain directives that would cause
	it to be interpreted differently than if it were the main module.

however, this older version's install command worked
go install github.com/multiprocessio/dsq@4ca4a314a0380c55ab9fc12bb03bc2c998d5580c

I think the issue is related to this line 4140b05#diff-33ef32bf6c23acb95f5902d7097b7a1d5128ca061167ec0716715b0b9eeaa5f6R5

Need `--version` flag

I would like to suggest adding --version / -v flag. It is helpful in scripts for determining if an update is needed etc, and is a common practice.

(I am aware of the fact that the version is at the top of the help).

Can't run SQL on log files.

Describe the bug and expected behavior

Getting error while running SQL query on log files like auth.log. Also tried with the file testdata/logfmt/log.logfmt (https://github.com/multiprocessio/dsq/blob/main/testdata/logfmt/log.logfmt). Is it possible to run SQL quries on the log files?

Reproduction steps

# cat log.logfmt | dsq -s text/logfmt 'SELECT level FROM {}'
Input is not an array of objects: /tmp/dsq-stdin3752862510.
# cat /var/log/auth.log | dsq -s text/logfmt 'SELECT * FROM {}'
Input is not an array of objects: /tmp/dsq-stdin1533956686.

Versions

  • OS: Ubuntu 22.04 LTS (Jammy Jellyfish)
  • Shell: bash
  • dsq version: dsq 0.20.2

Bug: duplicate column name in CSV input file

The input file:

"a","a b"
"line1","1"

Executing a query over this file:

$ dsq input.csv  "select * from {}"
duplicate column name: a

Not sure if there is any definition of whether CSV headers support whitespaces, but i just stumbled over an input file that does that.

Adding Miller / mlr to the comparison

Could you also include miller/mlr to the list of tools?

Here is the command line to get the same result with pretty formatting:
mlr --icsv --opprint --barred stats1 -a count,mean -f total_amount -g passenger_count then sort -f passenger_count taxi.csv

table "t_0" already exists on Windows 10

OS: Windows 10
dsq Version: dsq-win32-x64-v0.23.0

Error:

C:\Users\Administrator\Downloads\dsq-win32-x64-v0.23.0>dsq.exe eleme_shops_ningbo_20221125.csv -i
table "t_0" already exists

Parquet is missing rows

I have a Parquet file that should have 30,000+ rows, but SELECT COUNT(*) FROM {} returns 7000. Another one with more than 40,000 rows returns exactly 8000. Converting the same data to JSON works fine.

panic: runtime error: index out of range [2] with length 2

When i run the following commend
dsq --pretty ~/Downloads/test\ 202208.xlsx ~/Downloads/test\ 202207.xlsx ~/Downloads/test\ 202207.xlsx"select count(*) from (select * from {0} union select * from {1} union select * from {2})"

Got the following error
panic: runtime error: index out of range [2] with length 2

goroutine 1 [running]:
main.runQuery({0x7ffeefbffc37?, 0x0?}, 0xc001525b58, 0xc001525ae8, 0xc000040b90, {0xc0001cbe40, 0x2, 0x5cb98d6?}, 0x0?)
github.com/multiprocessio/dsq/main.go:256 +0x565
main._main()
github.com/multiprocessio/dsq/main.go:681 +0x1aa5
main.main()
github.com/multiprocessio/dsq/main.go:685 +0x19

Versions

  • OS: [Mac]
  • Shell: [bash]
  • dsq version: [e.g. 0.22]

Additional context, screenshots
If I just link 2 files is all right, Could we extend to bigger data size support?

Cleanup files after run

This leaves a bunch of tmp files it should clean up afterward. Easiest might be to override the datastation project root, make a tmp directory, then remove the entire temp directory afterward.

Note to self: make tests run on Mac and windows too.

Regression between old and new version for loading parquet

Previously with an older release of dsq I could do a basic SQL select on a parquet file.

With the latest release (0.2.0), I get this error:

panic: Missing type equality condition for unknown merge.

Command:

./dsq ~/Downloads/part-00030.snappy.parquet "SELECT * FROM {}"

If you can't reproduce this easily I can see if I can get a sample parquet file together and attached to this.

Stacktrace:

goroutine 1 [running]:
github.com/multiprocessio/datastation/runner.shapeMerge({{0x4c4111c, 0x7}, 0x0, 0x0, 0x0, 0x0}, {{0x4c4111c, 0x7}, 0x0, 0x0, ...})
	/Users/runner/go/pkg/mod/github.com/multiprocessio/datastation/[email protected]/shape.go:248 +0x648
github.com/multiprocessio/datastation/runner.objectMerge({0x4bd1da0}, {0x4bd1da0})
	/Users/runner/go/pkg/mod/github.com/multiprocessio/datastation/[email protected]/shape.go:209 +0x1b9
github.com/multiprocessio/datastation/runner.shapeMerge({{0x4c3f476, 0x6}, 0x0, 0xc000010168, 0x0, 0x0}, {{0x4c3f476, 0x6}, 0x0, 0xc000010198, ...})
	/Users/runner/go/pkg/mod/github.com/multiprocessio/datastation/[email protected]/shape.go:232 +0x115
github.com/multiprocessio/datastation/runner.getArrayShape({0x7ffeefbff919, 0x30}, {0xc0005183c0, 0x3, 0x6}, 0x16)
	/Users/runner/go/pkg/mod/github.com/multiprocessio/datastation/[email protected]/shape.go:268 +0x3e5
github.com/multiprocessio/datastation/runner.GetShape({0x7ffeefbff919, 0x4ac43e0}, {0x4ad1700, 0xc00051c4f8}, 0x0)
	/Users/runner/go/pkg/mod/github.com/multiprocessio/datastation/[email protected]/shape.go:277 +0x245
github.com/multiprocessio/datastation/runner.ShapeFromFile({0xc0005bb570, 0x4adfe60}, {0x7ffeefbff919, 0x30}, 0x2710, 0x7ffeefbff919)
	/Users/runner/go/pkg/mod/github.com/multiprocessio/datastation/[email protected]/shape.go:328 +0x16c
main.getShape({0xc0005bb570, 0x30}, {0x7ffeefbff919, 0x30})
	/Users/runner/work/dsq/dsq/main.go:46 +0x4e
main.main()
	/Users/runner/work/dsq/dsq/main.go:202 +0xac7

The result is wrong when I loading a JSON file with path specified

Describe the bug and expected behavior
The result is wrong when I loading a JSON file with path specified.

Reproduction steps

~ cat test.json
{
  "data": {
    "data1": [
      {"id": 1, "name": "Corah"},
      {"id": 3, "name": "Minh"}
    ],
    "data2": [
      {"id": 2, "name": "Corah"},
      {"id": 4, "name": "Minh"}
    ]
  },
  "total": 2
}

~ ./dsq --pretty test.json 'SELECT * FROM {"data.data2"} ORDER BY id DESC'
+----+-------+
| id | name  |
+----+-------+
|  3 | Minh  |
|  1 | Corah |
+----+-------+
(2 rows)

The expected result should be,

+----+-------+
| id | name  |
+----+-------+
|  4 | Minh  |
|  2 | Corah |
+----+-------+
(2 rows)

Versions

  • OS: macOS 12.3.1
  • Shell: zsh
  • dsq version: 0.22.0

Opening more than 256 files or file descriptors at once?

The default top limit of file descriptors as of OSX Monterey 12.4 seems to be 256 file descriptors (a bit low when compared with the linux one, which is 1024, IIRC), but regardless:

(base) rvalls@m1 out % dsq SBJ02239_PRJ221187.parquet
open SBJ02239_PRJ221187.parquet: too many open files
(base) rvalls@m1 out % dsq SBJ02239_PRJ221187.parquet "SELECT * FROM {} LIMIT 10"
open SBJ02239_PRJ221187.parquet: too many open files
(base) rvalls@m1 out % ulimit -a
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8176
-c: core file size (blocks)         0
-v: address space (kbytes)          unlimited
-l: locked-in-memory size (kbytes)  unlimited
-u: processes                       2666
-n: file descriptors                256
(base) rvalls@m1 out % ulimit -n 8912
(base) rvalls@m1 out % dsq SBJ02239_PRJ221187.parquet "SELECT * FROM {} LIMIT 10"
(cannot show the actual contents of the file, but it works after `ulimit`)

How come so many (intermediate?) files are required to open a regular .parquet file around the ~200KB filesize mark?

test.py fails on caching duration

I have installed dsq on my manjaro (arch linux like) raspberry pi 4 with this file: PKGBUILD

Everything has worked fine until the check part which runs the test.py script.

I have one failure (48 of 49 succeeded) related to this line:

test("Caching from pipe (second time so import not required)", to_run, want, sort=True, winSkip=True, within_seconds=5)

cat taxi.csv | ./dsq --cache -s csv 'SELECT passenger_count, COUNT(*), AVG(total_amount) FROM {} GROUP BY passenger_count ORDER BY COUNT(*) DESC'

FAILURE: completed in 9 seconds. Wanted <5s

I have bypassed manually this test.

So, my questions are:
why do you want a less than 5 seconds result ?
Why is it a failure ?
How can we compute the "correct time duration" related to devices (RAM size, CPU clock, thread number, etc.) ?

`GLIBC_2.28' not foun on Ubuntu 16.04

OS: Ubuntu 16.04.5 64bit
dsq Version: dsq-linux-x64-v0.22.0.zip

Error:
./dsq: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.28' not found (required by ./dsq)

Running dsq in interactive modes more than once on the same file results in an error

I want to use dsq in the interactive mode -i, however, when I open the same file more than once with interactive mode dsq always fails to start interactive mode. It fails with the following error message: table "t_0" already exists.

  • I'm able to produce this consistently with any valid csv file I pass to it.
  • I have not tried other file types.
  • I'm using whatever version of dsq is available on brew
  • I've also validated that this behavior exists on main by building the source locally

Here's an example file I created and saved as test.csv

$ cat test.csv 
column1, column2, column3
a,b,c
d,e,f

The first invocation of dsq -i is usually successful, however, it emits this log line: Cache invalid, re-import required. even if I've never run -i before.

$ dsq -i test.csv       
Cache invalid, re-import required.
dsq> exit
EOF

All subsequent invocations of dsq -i test.csv fail with:

$ dsq -i test.csv
table "t_0" already exists

$ echo $?        
1

Removing the sqlite DB for that file allows me to run -i, however, with the same Cache invalid... log error as before:

$ rm $(dsq -D test.csv)

Order of columns in `--pretty` does not always match the order in `SELECT`

Steps to reproduce:

$ cat <<EOF > test.csv
first,second,third
1,a,X
2,b,Y
EOF

$ dsq --pretty test.csv 'select first, second, third from {}'
+-------+--------+-------+
| first | second | third |
+-------+--------+-------+
|     1 | a      | X     |
|     2 | b      | Y     |
+-------+--------+-------+

# run the command again and again until you see something like:

$ dsq --pretty test.csv 'select first, second, third from {}'
+--------+-------+-------+
| second | third | first |
+--------+-------+-------+
| a      | X     |     1 |
| b      | Y     |     2 |
+--------+-------+-------+

Happens on Linux and M1 Mac.

Add support for repl

Grab https://github.com/chzyer/readline and wrap main.go's query parts (likely line 469-503) in a loop. Allow the loop to keep going if the user passes an -i flag or --interactive flag.

When this flag is in, caching must also be turned on so that the sqlite database is not ephemeral per run (I think).

Terminate output with newline?

What a cool and great tool! Not sure about you, but I always like command-line tools to add a final newline at the end. Maybe just because my terminal complains when it isn't the case: When I run this, I get a % sign at the end.

$ dsq imdb.csv 'select title from {}'
[{"Title":"Shawshank Redemption"}
,
{"Title":"The Godfather"}
,
{"Title":"The Godfather: Part II"}
,
{"Title":"The Dark Knight"}
,
{"Title":"12 Angry Men"}
]%
$

I looked around in the code but wasn't immediately sure where to suggest adding this, though.

Suggestion for Comparisons section

This section here: https://github.com/multiprocessio/dsq/tree/dedd036b10c5e2b2991bc27ef94204eed8c0703f#comparisons

sqlite-utils memory is another tool (that I built) which can do this.

Documentation: https://sqlite-utils.datasette.io/en/stable/cli.html#querying-data-directly-using-an-in-memory-database

Blog entry with more examples: https://simonwillison.net/2021/Jun/19/sqlite-utils-memory/ - plus a video demo at https://www.youtube.com/watch?v=OUjd0rkc678

Supported file types are JSON, newline-JSON, CSV and TSV. I haven't done any benchmarks on it - it's using SQLite under the hood but the tool itself is written in Python.

Performance ideas

Catchall for now for potential improvements to datastation/dsq.

  • SQL pre-processing
    • Import only used fields (see #71)
    • Do pre-filtering of data in SQLiteWriter, only insert things that match the WHERE clause
  • Support more input types using SQLiteWriter, basically requires supporting expanded nested objects in (see notes in #67 )
  • Maybe Handle jsonl in parallel since newlines must not be within individual JSON lines
  • Get rid of map[string]any inside datastation
    • At the very least put WriteRecord into the ResultWriter interface so SQLiteWriter can avoid map[string]any which it converts from anyway
  • CSV parser improvements
  • Add benchmarks for every file format, not just CSV. Basically every file format needs to be worked on individually

Ingest limited columns in basic SQL queries

There may not be an existing SQLite parser we can use from Go but for simple queries we can use a PostgreSQL parser, see here for a good one and example of use.

The way this would work is that it would attempt to parse the query. If it can parse the query and the query consists of only syntax that we support, return all fields in the query. Then we pass this list of fields to the SQLiteWriter. If this list is set in the SQLiteWriter then when we write fields to SQLite we only write the ones in this list.

For a first pass I'd suggest supporting:

  • SELECT x FROM {} WHERE y = 1 where this returns ['x', 'y']

Additional ones that won't be too bad:

  • SELECT COUNT(x) FROM {} WHERE y = 2 returns ['x', 'y']
  • SELECT x FROM {} GROUP BY z returns ['x', 'z']

Harder but reasonable examples:

  • SELECT a.x FROM {0} a JOIN {1} b ON a.id = b.json_id returns {'a': ['x', 'id'], 'b': ['json_id']}

Examples this must fail on (this is not a comprehensive list):

  • SELECT x, * FROM {} (because of the star operator
  • SELECT x FROM {0} JOIN {1} ON id=json_id (ambiguous where x, id, and json_id come from; also requires supporting different columns for different tables)

Support for caching SQL-ingested data

This should be relatively easy. When a -c, --cache flag is on, instead of generating a random project name, generate one based on the hashed contents of the files being queried.

Don't use an in-memory :memory: SQLite table but put the SQLite file on disk with the same name as the project (based on the hash).

But in actually eval-ling the SQL/Program panel there might need to be a new flag added to DataStation and set in the evalContext maybe that says "don't reingest" and just run the SQL query with the DM_getPanel() calls substituted as usual for table names in SQL.

Finally, just don't delete the generated directory at the end of the script.

Newline in xlsx header causes crash

Using the attached xlsx sheet, running dsq newline.xlsx --schema generates the following error:

json: error calling MarshalJSON for type *runner.Shape: invalid character '\n' in string literal

Running dsq newline.xlsx does not crash:

[{"Header with \nNewline":"1"},
{"Header with \nNewline":"2"},
{"Header with \nNewline":"3"},
{"Header with \nNewline":"4"}]

Removing the newline from the header cell no longer crashes the schema command.

deprecated INT96 timestamp not supported for older Parquet files

Describe the bug and expected behavior
I have a large number of customer Paquet files which have the timestamp in INT96 format. Timestamp is being interpreted as STR rather than a timestamp type. Timestamp should display as ISO8601 format
Reproduction steps
file/metadata version:
############ file meta data ############
created_by: parquet-mr version 1.10.1-databricks6 (build bd2ebc87e42b3936ac673e1556fa10fb8358307a)
num_columns: 3
num_rows: 376504
num_row_groups: 1
format_version: 1.0
serialized_size: 863

############ Columns ############
event_time
key
value

############ Column(event_time) ############
name: event_time
path: event_time
max_definition_level: 1
max_repetition_level: 0
physical_type: INT96
logical_type: None
converted_type (legacy): NONE

command:
dsq c:/tmp/easy-path.parquet "select event_time from {} limit 4"

Versions
windows 10: cmd.exe

  • dsq version: [e.g. 0.20.1]

Additional context, screenshots
c:\tmp>dsq --pretty c:/tmp/easy-path.parquet "select event_time from {} limit 2"
+------------+
| Event_time |
+------------+
♣�% |
| ��►►)6 �% |
+------------+
(2 rows
easy-path.parquet.txt
)

note: change extension to ".txt" strictly for attaching to this issue. Remove .txt prior to testing with it.

Regression between v0.19.0 and v0.20.0 around processing arrays in JSONL files?

Hi! Thanks for the new version and the new SQLite Writer feature. I'm really looking forward to the speed-up! 🎉

I gave v0.20.0 a spin and noticed what seems to be a regression around processing arrays in JSONL files. It's entirely possible that what I saw is just a symptom of another, underlying issue, but hopefully it provides a good lead.

Here's a minimal test case to replicate.

(1) Processing JSONL files with arrays in them works in v0.19.0

$ dsq --version
dsq 0.19.0

$ cat ~/test.jsonl
{"foo":[]}

$ dsq ~/test.jsonl "select count(1) from {}"
[{"count(1)":1}]

(2) It no longer works in v0.20.0

$ git log -1 --oneline
ba33348 Bump version in readme

$ ./dsq --version
dsq latest

$ ./dsq ~/test.jsonl "select count(1) from {}"
sql: converting argument $1 type: unsupported type []interface {}, a slice of interface

(3) Turning off the new SQLite Writer feature fixes the issue in v0.20.0

./dsq --no-sqlite-writer ~/test.jsonl "select count(1) from {}"
[{"count(1)":1}]

(4) Parsing JSON files (vs. JSONL) with arrays in them still works in v0.20.0, even with the new SQLite Writer feature turned on

$ cat ~/test.json
[{"foo":[]}]

$ ./dsq ~/test.json "select count(1) from {}"
[{"count(1)":1}]

Hope this helps. Thank you!

Support of an input SQL file

Like jq or some other CLIs, there is a flag for an external file -f.

For example, with jq, we can use jq -f script.jq

That will be really useful in some cases, for example, when you want to define a lot of queries.

dsq -f query.sql file1.json file2.json file3.json

Thank you

support csv files with no header

dsq expects the first line of a CSV file to be a header. Would be nice to be able to work with csv files that don't have a header in the first row.

Versions
dsq 0.21.0

Issue with pretty print

Describe the bug and expected behavior
Long field values are broken into 2 lines which means the results, this is difficult because this means data can't be exported to Excel or any similar spreadsjeet

./dsq Summary.csv 'select workflow,requested_test,count(*) from {} group by workflow,requested_test' -p

Selection_441

Reproduction steps
Just run the sample command against the csv file.Summary.csv

Versions

  • OS: Linux mk-cgm 5.14.0-1048-oem #55-Ubuntu SMP Mon Aug 8 14:58:10 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  • Shell: bash
  • dsq version: Version v0.22.0]

add https://github.com/liquidaty/zsv to the benchmark comparison

Hello and thank you for the dsq project, and this great benchmark analysis!

Would you mind to add https://github.com/liquidaty/zsv to this list? Usage:

zsv file1.csv "select * from data"
zsv file1.csv file2.csv "select * from data inner join data2 on data.a = data2.a"

Other info:
Caching: no
Engine: customized sqlite3
Supported file types: csv, tsv, txt with arbitrary delimiter, single- or multi-row header
Binary size: 2.2MB (bundle), 1.6MB (standalone sql utility)

(sorry, did not mean to open this in a manner that would automatically this a bug, but now I can't remove that)

dsq --schema missing array in 11GB file

Describe the bug and expected behavior

In my testing with large datasets, there is at least one array of objects that is not being reported with --schema when the array begins on line 1,326,612,715 out of 1,495,055,188 lines in the 11GB file.

Is it possible that schema only reviews the first X lines or bytes of a file? If so, is there any way that I can override that?

Reproduction steps
With a 11GB (or larger) file:
dsq --schema --pretty LARGE_FILE.json

Versions

  • OS: Ubuntu 22.04 LTS, AMD EPYC 7R32
  • Shell: bash
  • dsq version: dsq 0.20.2 from apt

Building with -buildmode=pie exposes crash in parquet test

See #15 for the original report.

This crash shows up when you go build -buildmode=pie && ./scripts/test.py. This crash does not happen without -buildmode=pie.

panic: runtime error: index out of range [576457833716731764] with length 117670

goroutine 1 [running]:
github.com/goccy/go-json/internal/encoder.CompileToGetCodeSet(0xc000f70f90?, 0x55b1294306cc?)
      github.com/goccy/[email protected]/internal/encoder/compiler_norace.go:11 +0x1df
github.com/goccy/go-json.encode(0xc001161ba0, {0xc0009540c0, 0xc00112a750})
      github.com/goccy/[email protected]/encode.go:224 +0xd0
github.com/goccy/go-json.marshal({0xc0009540c0, 0xc00112a750}, {0x0, 0x0, 0x1?})
      github.com/goccy/[email protected]/encode.go:148 +0xba
github.com/goccy/go-json.MarshalWithOption(...)
      github.com/goccy/[email protected]/json.go:186
github.com/goccy/go-json.Marshal({0xc0009540c0?, 0xc00112a750?})
      github.com/goccy/[email protected]/json.go:171 +0x2a
github.com/multiprocessio/go-json.(*StreamEncoder).EncodeRow(0xc000958060, {0xc0009540c0?, 0xc00112a750})
      github.com/multiprocessio/[email protected]/encoder.go:57 +0x1dd
github.com/multiprocessio/datastation/runner.transformParquet.func1(0x0?)
      github.com/multiprocessio/datastation/[email protected]/file.go:121 +0xc6
github.com/multiprocessio/datastation/runner.withJSONArrayOutWriter({0x55b12b25b338?, 0xc000011218}, 0xc000f71288)
      github.com/multiprocessio/datastation/[email protected]/json.go:36 +0xf6
github.com/multiprocessio/datastation/runner.withJSONArrayOutWriterFile(...)
      github.com/multiprocessio/datastation/[email protected]/json.go:51
github.com/multiprocessio/datastation/runner.transformParquet({0x55b12b26a2c0?, 0xc000c35788?}, {0x55b12b25b338, 0xc000011218})
      github.com/multiprocessio/datastation/[email protected]/file.go:106 +0xd8
github.com/multiprocessio/datastation/runner.transformParquetFile({0x7ffddb498a31?, 0x1b?}, {0x55b12b25b338, 0xc000011218})
      github.com/multiprocessio/datastation/[email protected]/file.go:143 +0xec
github.com/multiprocessio/datastation/runner.TransformFile({0x7ffddb498a31, 0x1b}, {{0x0?, 0x1ff?}, {0x0?, 0xc000aff440?}}, {0x55b12b25b338, 0xc000011218})
      github.com/multiprocessio/datastation/[email protected]/file.go:594 +0x1e5
main.evalFileInto({0x7ffddb498a31, 0x1b}, 0x0?)
      github.com/multiprocessio/dsq/main.go:47 +0xc5
main._main()
      github.com/multiprocessio/dsq/main.go:241 +0xaec
main.main()
      github.com/multiprocessio/dsq/main.go:381 +0x19

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.