manojkarthick / pqrs Goto Github PK

View Code? Open in Web Editor NEW

281.0 6.0 27.0 193 KB

Command line tool for inspecting Parquet files

License: Apache License 2.0

Rust 100.00%

parquet arrow rust

pqrs's Introduction

pqrs

pqrs is a command line tool for inspecting Parquet files
This is a replacement for the parquet-tools utility written in Rust
Built using the Rust implementation of Parquet and Arrow
pqrs roughly means "parquet-tools in rust"

Installation

Recommended Method

You can download release binaries here

Alternative methods

Using Homebrew

For macOS users, pqrs is available as a homebrew tap.

brew install manojkarthick/tap/pqrs

NOTE: For users upgrading from v0.2 or prior, note that the location of the pqrs homebrew tap has been updated. To update to v0.2.1+, please uninstall using brew uninstall pqrs and use the above command to re-install.

Using cargo

pqrs is also available for installation from crates.io using cargo, the rust package manager.

cargo install pqrs

Building and running from source

Make sure you have rustc and cargo installed on your machine.

git clone https://github.com/manojkarthick/pqrs.git
cargo build --release
./target/release/pqrs

Running

The below snippet shows the available subcommands:

❯ pqrs --help
pqrs 0.2.1
Manoj Karthick
Apache Parquet command-line utility

USAGE:
    pqrs [FLAGS] [SUBCOMMAND]

FLAGS:
    -d, --debug      Show debug output
    -h, --help       Prints help information
    -V, --version    Prints version information

SUBCOMMANDS:
    cat         Prints the contents of Parquet file(s)
    head        Prints the first n records of the Parquet file
    help        Prints this message or the help of the given subcommand(s)
    merge       Merge file(s) into another parquet file
    rowcount    Prints the count of rows in Parquet file(s)
    sample      Prints a random sample of records from the Parquet file
    schema      Prints the schema of Parquet file(s)
    size        Prints the size of Parquet file(s)

Subcommand: cat

Prints the contents of the given files and folders. Recursively traverses and prints all the files if the input is a directory. Supports json-like, json or CSV format. Use --json for JSON output, --csv for CSV output with column names in the first row, and --csv-data-only for CSV output without the column names row.

❯ pqrs cat data/cities.parquet
{continent: "Europe", country: {name: "France", city: ["Paris", "Nice", "Marseilles", "Cannes"]}}
{continent: "Europe", country: {name: "Greece", city: ["Athens", "Piraeus", "Hania", "Heraklion", "Rethymnon", "Fira"]}}
{continent: "North America", country: {name: "Canada", city: ["Toronto", "Vancouver", "St. John's", "Saint John", "Montreal", "Halifax", "Winnipeg", "Calgary", "Saskatoon", "Ottawa", "Yellowknife"]}}

❯ pqrs cat data/cities.parquet --json
{"continent":"Europe","country":{"name":"France","city":["Paris","Nice","Marseilles","Cannes"]}}
{"continent":"Europe","country":{"name":"Greece","city":["Athens","Piraeus","Hania","Heraklion","Rethymnon","Fira"]}}
{"continent":"North America","country":{"name":"Canada","city":["Toronto","Vancouver","St. John's","Saint John","Montreal","Halifax","Winnipeg","Calgary","Saskatoon","Ottawa","Yellowknife"]}}

❯ pqrs cat data/simple.parquet --csv
foo,bar
1,2
10,20

❯ pqrs cat data/simple.parquet --csv --no-header
1,2
10,20

NOTE: CSV format is not supported for files that contain Struct or Byte fields.

Subcommand: head

Prints the first N records of the parquet file. Use --records flag to set the number of records.

❯ pqrs head data/cities.parquet --json --records 2
{"continent":"Europe","country":{"name":"France","city":["Paris","Nice","Marseilles","Cannes"]}}
{"continent":"Europe","country":{"name":"Greece","city":["Athens","Piraeus","Hania","Heraklion","Rethymnon","Fira"]}}

Subcommand: merge

Merge two Parquet files by placing row groups (or blocks) from the two files one after the other.

Disclaimer: This does not combine the files to have optimized row groups, do not use it in production!

❯ pqrs merge --input data/pems-1.snappy.parquet data/pems-2.snappy.parquet --output data/pems-merged.snappy.parquet

❯ ls -al data
total 408
drwxr-xr-x   6 manojkarthick  staff     192 Feb 14 08:53 .
drwxr-xr-x  20 manojkarthick  staff     640 Feb 14 08:52 ..
-rw-r--r--   1 manojkarthick  staff     866 Feb  8 19:50 cities.parquet
-rw-r--r--   1 manojkarthick  staff   16468 Feb  8 19:50 pems-1.snappy.parquet
-rw-r--r--   1 manojkarthick  staff   17342 Feb  8 19:50 pems-2.snappy.parquet
-rw-r--r--   1 manojkarthick  staff  160950 Feb 14 08:53 pems-merged.snappy.parquet

Subcommand: rowcount

Print the number of rows present in the parquet file.

❯ pqrs row-count data/pems-1.snappy.parquet data/pems-2.snappy.parquet
File Name: data/pems-1.snappy.parquet: 2693 rows
File Name: data/pems-2.snappy.parquet: 2880 rows

Subcommand: sample

Prints a random sample of records from the given parquet file.

❯ pqrs sample data/pems-1.snappy.parquet --records 3
{timeperiod: "01/17/2016 07:01:27", flow1: 0, occupancy1: 0E0, speed1: 0E0, flow2: 0, occupancy2: 0E0, speed2: 0E0, flow3: 0, occupancy3: 0E0, speed3: 0E0, flow4: null, occupancy4: null, speed4: null, flow5: null, occupancy5: null, speed5: null, flow6: null, occupancy6: null, speed6: null, flow7: null, occupancy7: null, speed7: null, flow8: null, occupancy8: null, speed8: null}
{timeperiod: "01/17/2016 07:47:27", flow1: 0, occupancy1: 0E0, speed1: 0E0, flow2: 0, occupancy2: 0E0, speed2: 0E0, flow3: 0, occupancy3: 0E0, speed3: 0E0, flow4: null, occupancy4: null, speed4: null, flow5: null, occupancy5: null, speed5: null, flow6: null, occupancy6: null, speed6: null, flow7: null, occupancy7: null, speed7: null, flow8: null, occupancy8: null, speed8: null}
{timeperiod: "01/17/2016 09:44:27", flow1: 0, occupancy1: 0E0, speed1: 0E0, flow2: 0, occupancy2: 0E0, speed2: 0E0, flow3: 0, occupancy3: 0E0, speed3: 0E0, flow4: null, occupancy4: null, speed4: null, flow5: null, occupancy5: null, speed5: null, flow6: null, occupancy6: null, speed6: null, flow7: null, occupancy7: null, speed7: null, flow8: null, occupancy8: null, speed8: null}

Subcommand: schema

Print the schema from the given parquet file. Use the --detailed flag to get more detailed stats.

❯ pqrs schema data/cities.parquet
Metadata for file: data/cities.parquet

version: 1
num of rows: 3
created by: parquet-mr version 1.5.0-cdh5.7.0 (build ${buildNumber})
message hive_schema {
  OPTIONAL BYTE_ARRAY continent (UTF8);
  OPTIONAL group country {
    OPTIONAL BYTE_ARRAY name (UTF8);
    OPTIONAL group city (LIST) {
      REPEATED group bag {
        OPTIONAL BYTE_ARRAY array_element (UTF8);
      }
    }
  }
}

❯ pqrs schema data/cities.parquet --detailed

num of row groups: 1
row groups:

row group 0:
--------------------------------------------------------------------------------
total byte size: 466
num of rows: 3

num of columns: 3
columns:

column 0:
--------------------------------------------------------------------------------
column type: BYTE_ARRAY
column path: "continent"
encodings: BIT_PACKED PLAIN_DICTIONARY RLE
file path: N/A
file offset: 4
num of values: 3
total compressed size (in bytes): 93
total uncompressed size (in bytes): 93
data page offset: 4
index page offset: N/A
dictionary page offset: N/A
statistics: {min: [69, 117, 114, 111, 112, 101], max: [78, 111, 114, 116, 104, 32, 65, 109, 101, 114, 105, 99, 97], distinct_count: N/A, null_count: 0, min_max_deprecated: true}

<....output clipped>

❯ pqrs schema --json data/cities.parquet
{"version":1,"num_rows":3,"created_by":"parquet-mr version 1.5.0-cdh5.7.0 (build ${buildNumber})","metadata":null,"columns":[{"optional":"true","physical_type":"BYTE_ARRAY","name":"continent","path":"continent","converted_type":"UTF8"},{"name":"name","converted_type":"UTF8","path":"country.name","physical_type":"BYTE_ARRAY","optional":"true"},{"optional":"true","name":"array_element","physical_type":"BYTE_ARRAY","path":"country.city.bag.array_element","converted_type":"UTF8"}],"message":"message hive_schema {\n  OPTIONAL BYTE_ARRAY continent (UTF8);\n  OPTIONAL group country {\n    OPTIONAL BYTE_ARRAY name (UTF8);\n    OPTIONAL group city (LIST) {\n      REPEATED group bag {\n        OPTIONAL BYTE_ARRAY array_element (UTF8);\n      }\n    }\n  }\n}\n"}

Subcommand: size

Print the compressed/uncompressed size of the parquet file. Shows uncompressed size by default

❯ pqrs size data/pems-1.snappy.parquet --pretty
Size in Bytes:

File Name: data/pems-1.snappy.parquet
Uncompressed Size: 61 KiB

❯ pqrs size data/pems-1.snappy.parquet --pretty --compressed
Size in Bytes:

File Name: data/pems-1.snappy.parquet
Compressed Size: 12 KiB

TODO

Test on Windows

pqrs's People

Contributors

Stargazers

Watchers

pqrs's Issues

Add support for SQL schemas

It would be handy if something like this would be available:

$ pqrs schema --sql-dialect=clickhouse --name=ticker /path/to/file.parquet
CREATE TABLE ticker (
  isin String NOT NULL,
  O Int32 NOT NULL,
  H Int32 NOT NULL,
  L Int32 NOT NULL,
  C Int32 NOT NULL
) ENGINE = Parquet('/path/to/file.parquet');

Silence non-row output in `pqrs cat`

I have to convert formats from time to time and being able to just pipe the outputs is super helpful.

Unfortunately having the extra 6 lines means having a (slightly) more complicated command which is harder to share among my team.

pqrs cat -j foo.parquet | tail -n +6 |jq -c . - | gzip -c > foo.ndjson.gz

Adding a --quiet/-q flag to mute any non-data output would be neat, or possibly just moving that type of output to stderr instead (ie. changing lines 110-112 from println! to eprintln)?

Add support for nested fields in CSV by encoding them as JSON

Great tool, glad I found it, because I almost started writing something like this myself!

One thing I ran into, though... nested types can't be exported to CSV.

Error: ArrowReadWriteError(CsvError("Nested type List(Field { name: \"item\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) is not supported in CSV"))

Sometimes, we just don't care about the exact format, or don't even care about this particular column, and just want to load the dang thing into a spreadsheet. Encoding non-primitive column types as JSON helps to accomplish just that, and also happens to be non-ambiguous and therefore possibly even useful.

[Feature request] "Verify" command

Hi, would it be possible to add a "verify" command to check for corrupt parquet files?

Support for parquet directories

Hi, I just tried your tool and I find it super useful 👍
However, I noticed that one cannot simply cat whole directories:

# pqrs cat vcf.parquet 
Error: ParquetError(General("underlying IO error: Is a directory (os error 21)"))

However, this works:
pqrs cat vcf.parquet/**/*.parquet

It would be a nice convenience function if pqrs could handle this case.

Directory support in `head` / pipe support

Hi, we're very happily using pqrs now and found two small issues with it:

head does not support directories:

#> pqhead data.parquet 
Error: ParquetError(General("underlying IO error: Is a directory (os error 21)"))

It panics when used in a pipe:

#> pqcat data.parquet | head

###########################################################################################################################################################################################################
File: data.parquet/d66ac6554cc44c3cbfaa56b75fa446e4.parquet
###########################################################################################################################################################################################################

[...]
thread 'main' panicked at 'failed printing to stdout: Broken pipe (os error 32)', library/std/src/io/stdio.rs:935:9

Zstd compression level always reported as 1?

I created a bunch of Parquet files with Zstd compression and tried different levels, the time taken was different and file size changed but PQRS always reported Zstd(level 1) when I did schema --detailed.

This is on MAC / MacOS / M3.

consider removing first few lines of output from pqrs cat --json?

I would love to pipe the contents of pqrs cat --json <filename into jq or other tools. Currently the header from pqrs cat prevents me from doing this.


######################################################
File: ../myfolder/myfile.parquet
######################################################

Is there a chance we can either remove this entirely or remove it specifically when used with the --json flag?

My current version:

❯ pqrs --version
pqrs 0.2.0

Could not parse metadata

Hi, with pqrs v0.2.1 I get the following error when trying to read a parquet file written with Polars:

Error: ParquetError(General("Could not parse metadata: bad data"))

Pandas can read it without issues.
Could it be that there is some missing feature flag for the parquet reader?
E.g. some missing compression library?

Support for feather IPC format?

Hi, would it be possible to extend pqrs for supporting the feather ipc format?

Timestamp CSV conversion issue

There is an issue when running pqrs cat --csv [infile] without timestamp objects. For me they are all set to 1970-01-01. But when I cat with json, it is fine. parquet-tools can convert it to csv fine.

I suspect something with an integer overflow?

Support for parquet metadata

Hi, I'm using your tool and I find it great !

Would it be possible to add Parquet metadata support in the same way that parquet-tools does ?

For example, this is what the parquet-tools meta command outputs:

parquet-tools meta part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet 
file:                  file:/Users/foobar/part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet 
creator:               parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) 
extra:                 org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":":created_at","type":"string","nullable":true,"metadata":{}},{"name":":id","type":"string","nullable":true,"metadata":{}},{"name":":updated_at","type":"string","nullable":true,"metadata":{}},{"name":"agency","type":"integer","nullable":true,"metadata":{}},{"name":"body_style","type":"string","nullable":true,"metadata":{}},{"name":"color","type":"string","nullable":true,"metadata":{}},{"name":"fine_amount","type":"integer","nullable":true,"metadata":{}},{"name":"issue_date","type":"date","nullable":true,"metadata":{}},{"name":"issue_time","type":"integer","nullable":true,"metadata":{}},{"name":"latitude","type":"decimal(8,1)","nullable":true,"metadata":{}},{"name":"location","type":"string","nullable":true,"metadata":{}},{"name":"longitude","type":"decimal(8,1)","nullable":true,"metadata":{}},{"name":"make","type":"string","nullable":true,"metadata":{}},{"name":"marked_time","type":"string","nullable":true,"metadata":{}},{"name":"meter_id","type":"string","nullable":true,"metadata":{}},{"name":"plate_expiry_date","type":"date","nullable":true,"metadata":{}},{"name":"route","type":"string","nullable":true,"metadata":{}},{"name":"rp_state_plate","type":"string","nullable":true,"metadata":{}},{"name":"ticket_number","type":"string","nullable":false,"metadata":{}},{"name":"vin","type":"string","nullable":true,"metadata":{}},{"name":"violation_code","type":"string","nullable":true,"metadata":{}},{"name":"violation_description","type":"string","nullable":true,"metadata":{}}]} 

file schema:           spark_schema 
--------------------------------------------------------------------------------
:                      created_at: OPTIONAL BINARY O:UTF8 R:0 D:1
:                      id: OPTIONAL BINARY O:UTF8 R:0 D:1
:                      updated_at: OPTIONAL BINARY O:UTF8 R:0 D:1
agency:                OPTIONAL INT32 R:0 D:1
body_style:            OPTIONAL BINARY O:UTF8 R:0 D:1
color:                 OPTIONAL BINARY O:UTF8 R:0 D:1
fine_amount:           OPTIONAL INT32 R:0 D:1
issue_date:            OPTIONAL INT32 O:DATE R:0 D:1
issue_time:            OPTIONAL INT32 R:0 D:1
latitude:              OPTIONAL INT32 O:DECIMAL R:0 D:1
location:              OPTIONAL BINARY O:UTF8 R:0 D:1
longitude:             OPTIONAL INT32 O:DECIMAL R:0 D:1
make:                  OPTIONAL BINARY O:UTF8 R:0 D:1
marked_time:           OPTIONAL BINARY O:UTF8 R:0 D:1
meter_id:              OPTIONAL BINARY O:UTF8 R:0 D:1
plate_expiry_date:     OPTIONAL INT32 O:DATE R:0 D:1
route:                 OPTIONAL BINARY O:UTF8 R:0 D:1
rp_state_plate:        OPTIONAL BINARY O:UTF8 R:0 D:1
ticket_number:         REQUIRED BINARY O:UTF8 R:0 D:0
vin:                   OPTIONAL BINARY O:UTF8 R:0 D:1
violation_code:        OPTIONAL BINARY O:UTF8 R:0 D:1
violation_description: OPTIONAL BINARY O:UTF8 R:0 D:1

row group 1:           RC:148192 TS:10503944 OFFSET:4 
--------------------------------------------------------------------------------
:                      created_at:  BINARY SNAPPY DO:0 FPO:4 SZ:607/616/1.01 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 2019-02-28T00:16:06.329Z, max: 2019-03-02T00:20:00.249Z, num_nulls: 0]
:                      id:  BINARY SNAPPY DO:0 FPO:611 SZ:2365472/3260525/1.38 VC:148192 ENC:BIT_PACKED,PLAIN,RLE ST:[min: row-2229_y75z.ftdu, max: row-zzzs_4hta.8fub, num_nulls: 0]
:                      updated_at:  BINARY SNAPPY DO:0 FPO:2366083 SZ:602/611/1.01 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 2019-02-28T00:16:06.329Z, max: 2019-03-02T00:20:00.249Z, num_nulls: 0]
agency:                 INT32 SNAPPY DO:0 FPO:2366685 SZ:4871/5267/1.08 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 1, max: 58, num_nulls: 0]
body_style:             BINARY SNAPPY DO:0 FPO:2371556 SZ:36244/61827/1.71 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: WR, num_nulls: 0]
color:                  BINARY SNAPPY DO:0 FPO:2407800 SZ:111267/111708/1.00 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: YL, num_nulls: 0]
fine_amount:            INT32 SNAPPY DO:0 FPO:2519067 SZ:71989/82138/1.14 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 25, max: 363, num_nulls: 63]
issue_date:             INT32 SNAPPY DO:0 FPO:2591056 SZ:20872/23185/1.11 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 2019-02-01, max: 2019-02-27, num_nulls: 0]
issue_time:             INT32 SNAPPY DO:0 FPO:2611928 SZ:210026/210013/1.00 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 1, max: 2359, num_nulls: 41]
latitude:               INT32 SNAPPY DO:0 FPO:2821954 SZ:508049/512228/1.01 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 99999.0, max: 6513161.2, num_nulls: 0]
location:               BINARY SNAPPY DO:0 FPO:3330003 SZ:1251364/2693435/2.15 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,PLAIN,RLE ST:[min: , max: ZOMBAR/VALERIO, num_nulls: 0]
longitude:              INT32 SNAPPY DO:0 FPO:4581367 SZ:516233/520692/1.01 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 99999.0, max: 1941557.4, num_nulls: 0]
make:                   BINARY SNAPPY DO:0 FPO:5097600 SZ:147034/150364/1.02 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: YAMA, num_nulls: 0]
marked_time:            BINARY SNAPPY DO:0 FPO:5244634 SZ:11675/17658/1.51 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: 959.0, num_nulls: 0]
meter_id:               BINARY SNAPPY DO:0 FPO:5256309 SZ:172432/256692/1.49 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: YO97, num_nulls: 0]
plate_expiry_date:      INT32 SNAPPY DO:0 FPO:5428741 SZ:149849/152288/1.02 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 2000-02-01, max: 2099-12-01, num_nulls: 18624]
route:                  BINARY SNAPPY DO:0 FPO:5578590 SZ:38377/45948/1.20 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: WTD, num_nulls: 0]
rp_state_plate:         BINARY SNAPPY DO:0 FPO:5616967 SZ:33281/60186/1.81 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: AB, max: XX, num_nulls: 0]
ticket_number:          BINARY SNAPPY DO:0 FPO:5650248 SZ:801039/2074791/2.59 VC:148192 ENC:BIT_PACKED,PLAIN ST:[min: 1020798376, max: 4350802142, num_nulls: 0]
vin:                    BINARY SNAPPY DO:0 FPO:6451287 SZ:64/60/0.94 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: , num_nulls: 0]
violation_code:         BINARY SNAPPY DO:0 FPO:6451351 SZ:94784/131071/1.38 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 000, max: 8942, num_nulls: 0]
violation_description:  BINARY SNAPPY DO:0 FPO:6546135 SZ:95937/132641/1.38 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: YELLOW ZONE, num_nulls: 0]

Cannot show parquet file

Hi @manojkarthick , I am trying to open the attached file but it fails with the following error:

# pqrs --version
pqrs 0.2.2
# pqrs cat test.parquet

##################
File: test.parquet
##################

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: General("insufficient values read from column - expected: 1024, got: 0")', /home/user/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-12.0.0/src/record/reader.rs:578:36
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

The same works with pyarrow:

import pyarrow as pa
import pyarrow.parquet as pq

pq.read_table("test.parquet")
Out[3]: 
pyarrow.Table
chromosome: string not null
position: int32 not null
identifier: list<id: string not null> not null
  child 0, id: string not null
reference: string not null
alternate: list<alternate: string not null> not null
  child 0, alternate: string not null
quality: float
filter: list<filter: string not null> not null
  child 0, filter: string not null
info_END: int32 not null
info_SVTYPE: string not null
----
chromosome: [["chr1","chr1","chr1","chr1","chr1","chr1","chr1","chr1","chr1","chr1",...,"chr1","chr1","chr1","chr1","chr1","chr1","chr1","chr1","chr1","chr1"]]
position: [[10427,10427,10439,10440,13459,14397,15219,16766,16871,29231,...,6094404,6094410,6094858,6095109,6095224,6095265,6095278,6095299,6095300,6095491]]
identifier: [[["chr1:10426:10429:ACC>A"],["chr1:10426:10429:ACC>*"],["chr1:10438:10440:AC>*"],["chr1:10439:10440:C>*"],["chr1:13458:13462:CAGA>C"],["chr1:14396:14399:CTG>C"],["chr1:15218:15230:GAGCCACCTCCC>G"],["chr1:16765:16766:C>CT"],["chr1:16870:16872:GC>G"],["chr1:29230:29231:G>T"],...,["chr1:6094403:6094404:C>A"],["chr1:6094409:6094410:C>T"],["chr1:6094857:6094858:C>T"],["chr1:6095108:6095109:G>A"],["chr1:6095223:6095224:C>T"],["chr1:6095264:6095265:G>A"],["chr1:6095277:6095278:C>T"],["chr1:6095298:6095299:C>T"],["chr1:6095299:6095300:G>A"],["chr1:6095490:6095491:C>G"]]]
reference: [["ACC","ACC","AC","C","CAGA","CTG","GAGCCACCTCCC","C","GC","G",...,"C","C","C","G","C","G","C","C","G","C"]]
alternate: [[["A"],["*"],["*"],["*"],["C"],["C"],["G"],["CT"],["G"],["T"],...,["A"],["T"],["T"],["A"],["T"],["A"],["T"],["T"],["A"],["G"]]]
quality: [[null,null,null,null,null,null,null,null,null,null,...,null,null,null,null,null,null,null,null,null,null]]
filter: [[[""],[""],[""],[""],[""],[""],[""],[""],[""],[""],...,[""],[""],[""],[""],[""],[""],[""],[""],[""],[""]]]
info_END: [[10429,10429,10440,10440,13462,14399,15230,16766,16872,29231,...,6094404,6094410,6094858,6095109,6095224,6095265,6095278,6095299,6095300,6095491]]
info_SVTYPE: [["","","","","","","","","","",...,"","","","","","","","","",""]]

Would you mind having a look to find out why?

test.parquet.zip

ParquetError(ArrowError("creating ListArrayReader with type FixedSizeList ... should be unreachable

hi there, ran into a new error today. I'm guessing it might have to do with the fact that the inner list field is called item instead of element?

❯ pqrs head --json output.parquet
Error: ParquetError(ArrowError("creating ListArrayReader with type FixedSizeList(Field { name: \"item\", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, 256) should be unreachable"))

❯ pqrs schema output.parquet 
Metadata for file: output.parquet trying to cat a parquet. I'm guessing it might have something to do with the 

version: 1
num of rows: 30
created by: parquet-cpp-arrow version 6.0.1
metadata:
  ARROW:schema: /////4gBAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAUAAAAoAQAA7AAAAMAAAACAAAAABAAAAPz+//8AAAAQFAAAACQAAAAEAAAAAQAAADAAAAAJAAAAZW1iZWRkaW5nAAYACAAEAAYAAAAAAQAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEDEAAAABwAAAAEAAAAAAAAAAQAAABpdGVtAAAGAAgABgAGAAAAAAABAHT///8AAAACEAAAACQAAAAEAAAAAAAAAAgAAABzaGFyZF9pZAAAAAAIAAwACAAHAAgAAAAAAAABIAAAALD///8AAAAFEAAAABgAAAAEAAAAAAAAAAQAAAB0ZXh0AAAAAJz////Y////AAAABRAAAAAYAAAABAAAAAAAAAAFAAAAdGl0bGUAAADE////EAAUAAgAAAAHAAwAAAAQABAAAAAAAAAFEAAAACAAAAAEAAAAAAAAAAoAAABwYXNzYWdlX2lkAAAEAAQABAAAAA==
message schema {
  REQUIRED BYTE_ARRAY passage_id (STRING);
  REQUIRED BYTE_ARRAY title (STRING);
  REQUIRED BYTE_ARRAY text (STRING);
  REQUIRED INT32 shard_id;
  REQUIRED group embedding (LIST) {
    REPEATED group list {
      OPTIONAL FLOAT item;
    }
  }
}

EDIT: I tried changing the name from item to element via pyarrow's use_compliant_nested_type=True but that didn't fix the issue so it might be something else.

pqrs fails to read valid parquet file

Reading the schema works:

#> RUST_BACKTRACE=full pqrs schema example/output/vcf.parquet/clinvar_chr1_pathogenic.vcf.gz.parquet
Metadata for file: example/output/vcf.parquet/clinvar_chr1_pathogenic.vcf.gz.parquet

version: 2
num of rows: 4770
created by: Arrow2 - Native Rust implementation of Arrow
metadata:
  ARROW:schema: /////+8DAAAEAAAA8v///xQAAAAEAAEAAAAKAAsACAAKAAQA+P///wwAAAAIAAgAAAAEAAoAAACAAwAAMAMAALACAABsAgAA5AEAAKABAAAgAQAA0AAAAEgAAAAEAAAA7P///ywAAAAgAAAAGAAAAAUAAAAQABEABAAAABAACAAAAAwAAAAAAPz///8EAAQACwAAAGluZm9fU1ZUWVBFAOz///9wAAAAZAAAABgAAAAMAAAAEAARAAQAAAAQAAgAAAAMAAEAAAAEAAAA7P///ywAAAAgAAAAGAAAAAUAAAAQABEABAAAABAACAAAAAwAAAAAAPz///8EAAQACQAAAGluZm9fVFlQRQAAAPz///8EAAQACQAAAGluZm9fVFlQRQAAAOz///84AAAAIAAAABgAAAACAAAAEAARAAQAAAAQAAgAAAAMAAAAAAD0////IAAAAAEAAAAIAAkABAAIAAgAAABpbmZvX0VORAAAAADs////bAAAAGAAAAAYAAAADAAAABAAEQAEAAAAEAAIAAAADAABAAAABAAAAOz///8sAAAAIAAAABgAAAAFAAAAEAARAAQAAAAQAAgAAAAMAAAAAAD8////BAAEAAYAAABmaWx0ZXIAAPz///8EAAQABgAAAGZpbHRlcgAA7P///zAAAAAgAAAAGAAAAAEDAAAQABIABAAQABEACAAAAAwAAAAAAPr///8BAAYABgAEAAcAAABxdWFsaXR5AOz///9wAAAAZAAAABgAAAAMAAAAEAARAAQAAAAQAAgAAAAMAAEAAAAEAAAA7P///ywAAAAgAAAAGAAAAAUAAAAQABEABAAAABAACAAAAAwAAAAAAPz///8EAAQACQAAAGFsdGVybmF0ZQAAAPz///8EAAQACQAAAGFsdGVybmF0ZQAAAOz///8sAAAAIAAAABgAAAAFAAAAEAARAAQAAAAQAAgAAAAMAAAAAAD8////BAAEAAkAAAByZWZlcmVuY2UAAADs////aAAAAFwAAAAYAAAADAAAABAAEQAEAAAAEAAIAAAADAABAAAABAAAAOz///8sAAAAIAAAABgAAAAFAAAAEAARAAQAAAAQAAgAAAAMAAAAAAD8////BAAEAAIAAABpZAAA/P///wQABAAKAAAAaWRlbnRpZmllcgAA7P///zgAAAAgAAAAGAAAAAIAAAAQABEABAAAABAACAAAAAwAAAAAAPT///8gAAAAAQAAAAgACQAEAAgACAAAAHBvc2l0aW9uAAAAAOz///8sAAAAIAAAABgAAAAFAAAAEAARAAQAAAAQAAgAAAAMAAAAAAD8////BAAEAAoAAABjaHJvbW9zb21lAA==
message root {
  REQUIRED BYTE_ARRAY chromosome (STRING);
  REQUIRED INT32 position;
  REQUIRED group identifier (LIST) {
    REPEATED group list {
      REQUIRED BYTE_ARRAY id (STRING);
    }
  }
  REQUIRED BYTE_ARRAY reference (STRING);
  REQUIRED group alternate (LIST) {
    REPEATED group list {
      REQUIRED BYTE_ARRAY alternate (STRING);
    }
  }
  OPTIONAL FLOAT quality;
  REQUIRED group filter (LIST) {
    REPEATED group list {
      REQUIRED BYTE_ARRAY filter (STRING);
    }
  }
  REQUIRED INT32 info_END;
  REQUIRED group info_TYPE (LIST) {
    REPEATED group list {
      REQUIRED BYTE_ARRAY info_TYPE (STRING);
    }
  }
  REQUIRED BYTE_ARRAY info_SVTYPE (STRING);
}

cat'ting it does not:

#> RUST_BACKTRACE=full pqrs head example/output/vcf.parquet/clinvar_chr1_pathogenic.vcf.gz.parquet
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: General("insufficient values read from column - expected: 1024, got: 0")', /data/ouga/home/ag_gagneur/hoelzlwi/.cargo/registry/src/index.crates.io-6f17d22bba15001f/parquet-40.0.0/src/record/reader.rs:577:36
stack backtrace:
   0:     0x55c1eab8c3a1 - std::backtrace_rs::backtrace::libunwind::trace::h6aeaf83abc038fe6
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x55c1eab8c3a1 - std::backtrace_rs::backtrace::trace_unsynchronized::h4f9875212db0ad97
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x55c1eab8c3a1 - std::sys_common::backtrace::_print_fmt::h3f820027e9c39d3b
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/sys_common/backtrace.rs:65:5
   3:     0x55c1eab8c3a1 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hded4932df41373b3
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x55c1eabb114f - core::fmt::rt::Argument::fmt::hc8ead7746b2406d6
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/fmt/rt.rs:138:9
   5:     0x55c1eabb114f - core::fmt::write::hb1cb56105a082ad9
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/fmt/mod.rs:1094:21
   6:     0x55c1eab8a071 - std::io::Write::write_fmt::h797fda7085c97e57
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/io/mod.rs:1713:15
   7:     0x55c1eab8c1b5 - std::sys_common::backtrace::_print::h492d3c92d7400346
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x55c1eab8c1b5 - std::sys_common::backtrace::print::hf74aa2eef05af215
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x55c1eab8d537 - std::panicking::default_hook::{{closure}}::h8cad394227ea3de8
  10:     0x55c1eab8d324 - std::panicking::default_hook::h249cc184fec99a8a
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:288:9
  11:     0x55c1eab8d9ec - std::panicking::rust_panic_with_hook::h82ebcd5d5ed2fad4
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:705:13
  12:     0x55c1eab8d8e7 - std::panicking::begin_panic_handler::{{closure}}::h810bed8ecbe66f1a
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:597:13
  13:     0x55c1eab8c7d6 - std::sys_common::backtrace::__rust_end_short_backtrace::h1410008071796261
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/sys_common/backtrace.rs:151:18
  14:     0x55c1eab8d632 - rust_begin_unwind
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:593:5
  15:     0x55c1ea3efef3 - core::panicking::panic_fmt::ha0a42a25e0cf258d
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/panicking.rs:67:14
  16:     0x55c1ea3f0393 - core::result::unwrap_failed::h100c4d67576990cf
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/result.rs:1651:5
  17:     0x55c1ea58111c - parquet::record::reader::Reader::advance_columns::he78d66a8310bbc6d
  18:     0x55c1ea581179 - parquet::record::reader::Reader::advance_columns::he78d66a8310bbc6d
  19:     0x55c1ea581971 - <parquet::record::reader::RowIter as core::iter::traits::iterator::Iterator>::next::h612da20bf81bedfa
  20:     0x55c1ea40307f - pqrs::utils::print_rows::h9bf7a7f08e6bc5ee
  21:     0x55c1ea3f9ec3 - pqrs::commands::head::execute::h2058003142e3c2ac
  22:     0x55c1ea427b06 - pqrs::main::h38253338d29d66ac
  23:     0x55c1ea3fea3d - std::sys_common::backtrace::__rust_begin_short_backtrace::h2f1f623026f1777f
  24:     0x55c1ea41a5b8 - std::rt::lang_start::{{closure}}::hb53e3cd4c57743d8
  25:     0x55c1eab84755 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h5ce27e764c284c0a
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/ops/function.rs:284:13
  26:     0x55c1eab84755 - std::panicking::try::do_call::h4c1fc390ae241991
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:500:40
  27:     0x55c1eab84755 - std::panicking::try::h4d36e7eaed86af72
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:464:19
  28:     0x55c1eab84755 - std::panic::catch_unwind::h41cfb4dd65282b1e
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panic.rs:142:14
  29:     0x55c1eab84755 - std::rt::lang_start_internal::{{closure}}::hfed411c1c5fdb925
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/rt.rs:148:48
  30:     0x55c1eab84755 - std::panicking::try::do_call::h6893f6f32a464342
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:500:40
  31:     0x55c1eab84755 - std::panicking::try::h52b7102f469a0567
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:464:19
  32:     0x55c1eab84755 - std::panic::catch_unwind::h62120054677916b5
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panic.rs:142:14
  33:     0x55c1eab84755 - std::rt::lang_start_internal::hd66bf6b7da144005
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/rt.rs:148:20
  34:     0x55c1ea428fa5 - main
  35:     0x7ffbc5575d85 - __libc_start_main
  36:     0x55c1ea3f065e - _start
  37:                0x0 - <unknown>

Here the (zipped) file:
clinvar_chr1_pathogenic.vcf.gz.parquet.zip

Feature request: Compression algorithm information

Hello Manoj,

Very useful tool you have built!

One feature I would like to suggest is to display which compression algorithm was used on each column.
Currently, it is possible to see that compression was used based on the size difference of the "total compressed size" and "total uncompressed size" sizes but the actual algorithm used doesn't seem to be displayed.

So the idea would be to show "GZIP", "LZO", "ZSTD", "Brotli", etc. in the schema description table.

CSV-support in `pqrs cat`?

Hi, would it be possible to support CSV/TSV in addition to JSON in pqrs cat?

`merge` uses a lot of memory

Feature request!

Is it possible for merge to merge files without decompressing and recompressing them?

My usecase:

My parquet generator makes 1GB row groups (in memory), and writes them to individual parquet files. <40MB on disc, one row group per file. (It does this because it can't be bothered to deal with schema variations, another problem.).

I'd like to concatenate these files; take the row group out of any that have the same schema, and make one big file with multiple row groups, with exactly the same schema?

The current merge implementation can do this, but needs >>200GB of memory to merge 8GB of parquet files, which is not ideal.

BUG

joserfjunior@Clodovil Downloads % pqrs cat FND_USER_CLEAN.parquet --json

############################
File: FND_USER_CLEAN.parquet
############################

thread 'main' panicked at 'No such local time', /Users/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/chrono-0.4.19/src/offset/mod.rs:173:34
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

nightly required?

Going forward will this tool only be available on nightly?

  Installing pqrs v0.2.2
error: failed to compile `pqrs v0.2.2`, intermediate artifacts can be found at `/tmp/cargo-installWmgUiA`

Caused by:
  failed to download `once_cell v1.15.0`

Caused by:
  unable to get packages from source

Caused by:
  failed to parse manifest at `/home/amooren/.cargo/registry/src/github.com-1ecc6299db9ec823/once_cell-1.15.0/Cargo.toml`

Caused by:
  feature `edition2021` is required

  this Cargo does not support nightly features, but if you
  switch to nightly channel you can add
  `cargo-features = ["edition2021"]` to enable this feature

Fails to read file(s)

When calling head or cat on a single largish file, pqrs v0.2.2 (and previous versions) attempts to open lots of files? Why?
I only want to read one line at a time and translate .parquet to .csv.:

$ pqrs cat --csv live_int8.parquet > live.csv

#######################
File: live_int8.parquet
#######################

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', /home/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-12.0.0/src/util/io.rs:82:50
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Hi, need help with the project?

Hi,

I was looking for a command line util to read Parquet files and found your project. I was wondering, would you be interested in contributions to the project, e.g. open issues, new features, etc?

I'm a software engineer (mostly backend for professional work), and recently took an interest in learning Rust. Practice is the best way to learn so let me know what you think!

Thanks, have a nice day

Robo

Don't show file header when outputting json

When using pqrs cat --json the output still contains the file headers for each file, making it much less useful for quickly converting a file or a bunch of files to JSON.

I feel like the headers should not be present at all when outputting JSON or CSV. There could be an additional flag to add them.

Schema Command Should Support Structured Output

Stumbled upon this project and it looks like a real missing link in parquet tooling.

For the schema subcommand, it would be nice if there was an optional way to output the data in a structured way (eg json) for consumption by other tools. Something like pqrs schema data.parquet --json. Ideally both the simple and detailed flags, but even capturing only the non-detailed data would be handy.