circulosmeos / gztool Goto Github PK

extract random-positioned data from gzip files with no penalty, including gzip tailing like with 'tail -f' !

Home Page: https://circulosmeos.wordpress.com/2019/08/11/continuous-tailing-of-a-gzip-file-efficiently/

C 93.22% Makefile 0.09% M4 0.06% Roff 6.50% Shell 0.13%

gzip gzip-decompression gzip-compression gzipped-files gzip-stream indexing bgzip bgzf concatenate-files zlib

gztool's Introduction

gztool

GZIP files indexer, compressor and data retriever. Create small indexes for gzipped files and use them for quick and random data extraction. No more waiting when the end of a 10 GiB gzip is needed!

See Installation for Ubuntu, the Release page for executables for your platform, and Compilation in case you want to compile the tool.

Also, a magic file to correctly identify gztool's index files with linux file command is provided: you can append it (or overwrite your empty) /etc/magic file or append/copy it to your home directory as ~/.magic (note the point prepending the name).

Considerations

Please, note that the initial complete index creation for an already gzip-compressed file (-i) still consumes as much time as a complete file decompression.
Once created the index will reduce access time to the gzip file data.

Nonetheless, note that gztool creates index interleaved with extraction of data (-b), so in the practice there's no waste of time. Note that if extraction of data or just index creation are stopped at any moment, gztool will reuse the remaining index on the next run over the same data, so time consumption is always minimized.

Also gztool can monitor a growing gzip file (for example, a log created by rsyslog directly in gzip format) and generate the index on-the-fly, thus reducing in the practice to zero the time of index creation. See the -S (Supervise) option.

Index size is about 0.33% of a compressed gzip file size if the index is created after the file was compressed, or 10-100 times smaller (just a few Bytes or kiB) if gztool itself compresses the data (-c).

Note that the size of the index depends on the span between index points on the uncompressed stream - by default it is 10 MiB: this means that when retrieving randomly situated data only 10/2 = 5 MiB of uncompressed data must be decompressed (on average) no matter the size of the gzip file - which is a fairly low value!
The span between index points can be adjusted with -s (span) option (the minimum is -s 1 or 1 MiB).
For example, a span of -s 20 will create indexes half the size, and -s 5 will create indexes twice bigger.

Background

By default gzip-compressed files cannot be accessed in random mode: any byte required at position N requires the complete gzip file to be decompressed from the beginning to the N byte.
Nonetheless Mark Adler, the author of zlib, provided years ago a cryptic file named zran.c that creates an "index" of "windows" filled with 32 kiB of uncompressed data at different positions along the un/compressed file, which can be used to initialize the zlib library and make it behave as if compressed data begin there.

gztool builds upon zran.c to provide a useful command line tool.
Also, some optimizations and brand new behaviours have been added:

gztool can correctly read incomplete gzip-concatenated-files (using -p), that is, a gzip composed of a concatenation of gzip files, some of which are not correctly terminated. This can happen, for example, when using rsyslog's veryRobustZip omfile option and the process that is logging is abruptly terminated and then restarted.
gztool can store line numbering information in the index (use only if source data is text!), and retrieve data from a specific line number using -L. (Using -[xXz] when creating the index selects Unix new line format (default), old Mac new line format, or no line information respectively.)
gztool can Supervise an still-growing gzip file (for example, a log created by rsyslog directly in gzip format) and generate the index on-the-fly, thus reducing in the practice to zero the time of index creation. See -S.
extraction of data and index creation are interleaved, so there's no waste of time for the index creation.
index files are reusable, so they can be stopped at any time and reused and/or completed later.
an ex novo index file format has been created to store the index
span between index points is raised by default from 1 MiB to 10 MiB, and can be adjusted with -s (span).
windows are compressed in file
windows are not loaded in memory unless they're needed, so the application memory footprint is fairly low (< 1 MiB)
gztool can compress files (-c) and at the same time generate an index that is about 10-100 times smaller than if the index is generated after the file has already been compressed with gzip.
Compatible with bgzip files (short-uncompressed complete-gzip-block sizes)
Compatible with complete gzip-concatenated-files (aka gzip members)
Compatible with rsyslog's veryRobustZip omfile option (variable-short-uncompressed complete-gzip-block sizes)
data can be provided from/to stdin/stdout
gztool can be used to remotely retrieve just a small part of a bigger gzip compressed file and successfully decompress it locally. See this stackexchange thread. Just note that the gztool index file must be also available.

Installation

gztool is available using directly apt-get install gztool in Debian 11 and later and in Ubuntu Groovy Gorilla (20.10) and above.

In Ubuntu, using my repository:

  sudo add-apt-repository ppa:roberto.s.galende/gztool
  sudo apt-get update
  sudo apt-get install gztool

See the Release page for executables for your platform, including Windows. If none fit your needs, gztool is very easy to compile: see next sections.

Compilation

zlib.a archive library is needed in order to compile gztool: the package providing it is actually zlib1g-dev (this may vary on your system):

$ sudo apt-get install zlib1g-dev

$ gcc -O3 -o gztool gztool.c -lz -lm

If you wish you can use autoconf to check the dependencies, build and test gztool:

$ autoreconf && ./configure && make check

This will produce a binary in gztool.

Compilation in Windows

Compilation in Windows is possible using gcc for Windows and compiling the original zlib code to obtain the needed archive library libz.a.
Please, note that executables for different platforms are provided on the Release page.

download gcc for Windows: mingw-w64
Install it and add the path for gcc.exe to your Windows PATH
Download zlib code and compile it with your new gcc: zlib
The previous step generates the file zlib.a that you need to compile gztool: Copy gztool.c to the directory where you compiled zlib, and do:

gcc -static -O3 -I. -o gztool gztool.c libz.a -lm

Usage

  gztool (v1.6.1)
  GZIP files indexer, compressor and data retriever.
  Create small indexes for gzipped files and use them
  for quick and random-positioned data extraction.
  No more waiting when the end of a 10 GiB gzip is needed!
  //github.com/circulosmeos/gztool (by Roberto S. Galende)

  $ gztool [-[abLnsv] #] [-[1..9]AcCdDeEfFhilpPrRStTwWxXzZ|u[cCdD]] [-I <INDEX>] <FILE>...

  Note that actions `-bcStT` proceed to an index file creation (if
  none exists) INTERLEAVED with data flow. As data flow and
  index creation occur at the same time there's no waste of time.
  Also you can interrupt actions at any moment and the remaining
  index file will be reused (and completed if necessary) on the
  next gztool run over the same data.

 -[1..9]: Factor of compression to use with `-[c|u[cC]]`, from
     best speed (`-1`) to best compression (`-9`). Default is `-6`.
 -a #: Await # seconds between reads when `-[ST]|Ec`. Default is 4 s.
 -A: Modifier for `-[rR]` to indicate the range of bytes/lines in
     absolute values, instead of the default incremental values.
 -b #: extract data from indicated uncompressed byte position of
     gzip file (creating or reusing an index file) to STDOUT.
     Accepts '0', '0x', and suffixes 'kmgtpe' (^10) 'KMGTPE' (^2).
 -C: always create a 'Complete' index file, ignoring possible errors.
 -c: compress a file like with gzip, creating an index at the same time.
 -d: decompress a file like with gzip.
 -D: do not delete original file when using `-[cd]`.
 -e: if multiple files are indicated, continue on error (if any).
 -E: end processing on first GZIP end of file marker at EOF.
     Nonetheless with `-c`, `-E` waits for more data even at EOF.
 -f: force file overwriting if index file already exists.
 -F: force index creation/completion first, and then action: if
     `-F` is not used, index is created interleaved with actions.
 -h: print brief help; `-hh` prints this help.
 -i: create index for indicated gzip file (For 'file.gz' the default
     index file name will be 'file.gzi'). This is the default action.
 -I string: index file name will be the indicated string.
 -l: check and list info contained in indicated index file.
     `-ll` and `-lll` increase the level of index checking detail.
 -L #: extract data from indicated uncompressed line position of
     gzip file (creating or reusing an index file) to STDOUT.
     Accepts '0', '0x', and suffixes 'kmgtpe' (^10) 'KMGTPE' (^2).
 -n #: indicates that the first byte on compressed input is #, not 1,
     and so truncated compressed inputs can be used if an index exists.
 -p: indicates that the gzip input stream may be composed of various
     incorrectly terminated GZIP streams, and so then a careful
     Patching of the input may be needed to extract correct data.
 -P: like `-p`, but when used with `-[ST]` implies that checking
     for errors in stream is made as quick as possible as the gzip file
     grows. Warning: this may lead to some errors not being patched.
 -r #: (range): Number of bytes to extract when using `-[bL]`.
     Accepts '0', '0x', and suffixes 'kmgtpe' (^10) 'KMGTPE' (^2).
 -R #: (Range): Number of lines to extract when using `-[bL]`.
     Accepts '0', '0x', and suffixes 'kmgtpe' (^10) 'KMGTPE' (^2).
 -s #: span in uncompressed MiB between index points when
     creating the index. By default is `10`.
 -S: Supervise indicated file: create a growing index,
     for a still-growing gzip file. (`-i` is implicit).
 -t: tail (extract last bytes) to STDOUT on indicated gzip file.
 -T: tail (extract last bytes) to STDOUT on indicated still-growing
     gzip file, and continue Supervising & extracting to STDOUT.
 -u [cCdD]: utility to compress (`-u c`) or decompress (`-u d`)
          zlib-format files to STDOUT. Use `-u C` and `-u D`
          to manage raw compressed files. No index involved.
 -v #: output verbosity: from `0` (none) to `5` (nuts).
     Default is `1` (normal).
 -w: wait for creation if file doesn't exist, when using `-[cdST]`.
 -W: do not Write index to disk. But if one is already available
     read and use it. Useful if the index is still under an `-S` run.
 -x: create index with line number information (win/*nix compatible).
     (Index counts last line even w/o newline char (`wc` does not!)).
     This is implicit unless `-X` or `-z` are indicated.
 -X: like `-x`, but newline character is '\r' (old mac).
 -z: create index without line number information.
 -Z: adjust index points to a byte boundary: no previous byte needed.

  EXAMPLE: Extract data from 1 GiB byte (byte 2^30) on,
  from `myfile.gz` to the file `myfile.txt`. Also gztool will
  create (or reuse, or complete) an index file named `myfile.gzi`:
  $ gztool -b 1G myfile.gz > myfile.txt

Please, note that STDOUT is used for data extraction with -bLtTu modifiers.

Examples of use

Make an index for test.gz. The index will be named test.gzi:
```
  $ gztool -i test.gz
```
Make an index for test.gz with name test.index, using -I:
```
  $ gztool -I test.index test.gz
```
Also -I can be used to indicate the complete path to an index in another directory. This way the directory where the gzip file resides could be read-only and the index be created in another read-write path:
```
  $ gztool -I /tmp/test.gzi test.gz
```
Retrieve data from uncompressed byte position 1000000 inside test.gz. An index file will be created at the same time (named test.gzi):
```
  $ gztool -b 1m test.gz
```
Supervise an still-growing gzip file and generate the index for it on-the-fly. The index file name will be openldap.log.gzi in this case. gztool will execute until interrupted or until the supervised gzip file is overwritten from the beginning (this happens usually with compressed log files when they are rotated). It can also stop at first end-of-gzip data with -E.
```
  $ gztool -S openldap.log.gz
```
The previous command can be sent to background and with no verbosity, so we can forget about it:
```
  $ gztool -v0 -S openldap.log.gz &
```

Creating and index for all "*gz" files in a directory:

  $ gztool -i *gz

  ACTION: Create index

  Index file 'data.gzi' already exists and will be used.
  (Use `-f` to force overwriting.)
  Processing 'data.gz' ...
  Index already complete. Nothing to do.

  Processing 'data_project.2.tar.gz' ...
  Built index with 73 access points.
  Index written to 'data_project.2.tar.gzi'.

  Processing 'project_2.gz' ...
  Built index with 3 access points.
  Index written to 'project_2.gzi'.

Extract data from project.gz byte at 1 GiB to STDOUT, and use grep on this output. Index file name will be project.gzi:
```
  $ gztool -b 1G project.gz | grep -i "balance = "
```
Please, note that STDOUT is used for data extraction with -bcdtT modifiers, so an explicit command line redirection is needed if output is to be stored in a file:
```
  $ gztool -b 99m project.gz > uncompressed.data
```
Extract data from a gzipped file which index is still growing with a gztool -S process that is monitoring the (still-growing) gzip file: in this case the use of -W will not try to update the index on disk so that the other process is not disturb! (this is so because gztool always tries to update the index used if it thinks it's necessary):
```
  $ gztool -Wb 100k still-growing-gzip-file.gz > mytext
```

Extract data from line 10 million, to STDOUT:

  $ gztool -L 10m compressed_text_file.gz

Nonetheless note that if in the precedent example an index was previously created for the gzip file without the -x parameter (or not using -L), as it doesn't contain line numbering info, gztool will complain and stop. This can be circumvented by telling gztool to use another new index file name (-I), or even not using anyone at all with -W (do not write index) and an index file name that doesn't exists (in this case None - it won't be created because of -W), and so ((just) this time) the gzip will be processed from the beginning:
```
  $ gztool -L 10m -WI None compressed_text_file.gz
```
Extract all data from a rsyslog's veryRobustZip that contains dirty data. This corrupted-gzip-files can arise when using rsyslog's veryRobustZip omfile option and the process that is logging is abruptly terminated and then restarted - this produces an incorrectly-terminated-gzip stream that is followed by another gzip stream in the same file. gzip (nor zlib) cannot read this files beyond the point of error. But gztool can correctly extract all data (and only good data) using -p (patch) parameter:
```
  $ gztool -p -b0 compressed_text_file.gz
```

This creates, as usual, the index file compressed_text_file.gzi. In order to not create it, -W (do not Write index) can be used:

    $ gztool -pWb0 compressed_text_file.gz

Note that -p can require up to twice the time for decompression, because it performs two decompression processes: the usual one, and another one that is performed in advance of the usual and which is the one that detects errors, marks them, and finds new entry points to end/begin the decompression circumventing the problems.

Note also that these corrupted-gzip-files should be always decompressed with -p parameter, even if a gztool index file exists for them, because the index file stores entry points, but does not store where do errors occur in the gzip file. That said, if the -[bL] point of extraction is beyond the point(s) of error in the gzip file and an index file exists, then the decompression can proceed fine without -p, as the index points stored in the index file are always clean.

When tailing an still-growing gzip file (-T) that could contain errors at some point, one may still want to obtain output from the gzip stream as soon as possible - this is what the patching option -P is for (like -p but capitalized): with -p gztool decompress the stream about 48 kiB ahead of the output that is actually shown/written in order to catch possible gzip-stream errors ahead of output, and so maintain always a clean output without error-introduced artifacts. This has the side effect that output must always wait for that 48 kiB of data to be available in advance, which if the file grows slowly can take a very long time. With -P the buffer-ahead restriction is relaxed to just as few bytes as available before reaching end-of-file and waiting for new data, so responsiveness is as quick as without -p. The side effect of -P is that depending on the gzip file some errors may lead to incorrect output being shown/written - though in this case a "PATCHING WARNING" would be shown (to stderr).
```
  $ gztool -PT application_log.gz
  ...
  PATCHING: Gzip stream error found @ 15745693 Byte.
  PATCHING WARNING:
      Part of data extracted after (compressed 15728640 / uncompressed 92783074) Byte
      collides with previously extracted data, after a badly-ended gzip stream
      was found @15745693 and a new starting point began @15700759.
  PATCHING: New valid gzip full flush found @ 15700759 Byte.
  ...
```

The same applies to -S though in this case there's no output, as only the index is being constructed:

    $ gztool -PS application_log.gz
    ...
    PATCHING: Gzip stream error found @ 15745693 Byte.
    PATCHING WARNING:
        Data extracted around the patching point may overlap.
    PATCHING: New valid gzip full flush found @ 15700759 Byte.
    ...

To tail to stdout, like a tail -f, an still-growing gzip file (an index file will be created with name still-growing-gzip-file.gzi in this case):
```
  $ gztool -T still-growing-gzip-file.gz
```
More on files still being "Supervised" (-S) by another gztool instance: they can also be tailed à la tail -f without updating the index on disk using -W:
```
  $ gztool -WT still-growing-gzip-file.gz
```
Compress (-c) an still growing (-E) file: in this case both still-growing-file.gz and still-growing-file.gzi files will be created on-the-fly as the source file grows. Note that in order to terminate compression, Ctrl+C must be used to kill gztool: this results in an incomplete-gzip-file as per GZIP standard, but this is not important as it will contain all the source data, and both gzip and gztool (or any other tool) can correctly and completely decompress it.
```
  $ gztool -Ec still-growing-file
```
If you have an incomplete index file (they just do not have the length of the source data, as it didn't correctly finish) and want to make it complete and so that the length of the uncompressed data be stored, just unconditionally complete it with -C with a new -i run over your gzip file: note that as the existent index data is used (in this case the file my-incomplete-gzip-data.gzi), only last compressed bytes are decompressed to complete this action:
```
  $ gztool -Ci my-incomplete-gzip-data.gz
```
Decompress a file like with gzip (-d), but do not delete (-D) the original one: Decompressed file will be myfile. Note that gzipped file must have a ".gz" extension or gztool will complain.
```
  $ gztool -Dd myfile.gz
```
Decompress a file that does not have ".gz" file extension, like with gzip (-d):
```
  $ cat mycompressedfile | gztool -d > my_uncompressed_file
```

Show internals of all index files in this directory. -e is used to not stop the process on the first error, if a *.gzi file is not a valid gzip index file. The -ll list option repetition will show data about each index point. -lll also decompress each point's window to ensure index integrity:

  $ gztool -ell *.gzi

      Checking index file 'accounting.gzi' ...
      Size of index file (v0)  :   184577 Bytes (0.37%/gzip)
      Guessed gzip file name   : 'accounting.gz' (66.05%) ( 50172261 Bytes )
      Number of index points   : 15
      Size of uncompressed file: 147773440 Bytes
      Compression factor       : 66.05%
      List of points:
      #: @ compressed/uncompressed byte (window data size in Bytes @window's beginning at index file) !bits needed from previous byte, ...
      #1: @ 10 / 0 ( 0 @56 ) !0, #2: @ 3059779 / 10495261 ( 13127 @80 ) !2, #3: @ 6418423 / 21210594 ( 6818 @13231 ) !0, #4: @ 9534259 / 31720206 ( 7238 @20073 ) !7...
  ...

If gztool finds the gzip file companion of the index file, some statistics are shown, like the index/gzip size ratio, or the ratio of compression of the gzip file.

Also, if the gzip is complete, the size of the uncompressed data is shown. This number is interesting if the gzip file is bigger than 4 GiB, in which case gunzip -l cannot correctly calculate it as it is limited to a 32 bit counter, or if the gzip file is in bgzip format, in which case gunzip -l would only show data about the first block (< 64 kiB).

Note that gztool -l tries to guess the companion gzip file of the index looking for a file with the same name, but without the i of the .gzi file name extension, or without the .gzi. But the gzip file name can also be directly indicated with this format:
```
  $ gztool -l -I index_filename gzip_filename
```

In this latter case only a pair of index+gzip filenames can be indicated with each use.

Use a truncated gzip file (100000 first bytes are removed: (not zeroed, removed; if they're zeroed cautions are the same, but -n is not needed)), to extract from byte 20 MiB, using a previously generated index: as far as the -b parameter refers to a byte after an index point (See -ll) and -n be less than that needed first index point, this is always possible. In this case -I gzip_filename.gzi is implicit:
```
  $ gztool -n 100001 -b 20M gzip_filename.gz
```

Take into account that, as shown, the first byte of the truncated gzip_filename.gz file is numbered 100001, that is, the bytes retain the order number in which they appear in the original file (that's the reason why it is not the 1 byte).

Please, note that index point positions at index file may require also the previous byte to be available in the truncated gzip file, as a gzip stream is not byte-rounded but a stream of pure bits. Thus if you're thinking on truncating a gzip file, please do it always at least by one byte before the indicated index point in the gzip - as said, it may not be needed, but in 7 of 8 cases it is needed. Another option is to use -Z when creating the index, as indicated below.

Create an index for a gzip file in which every index entry point is adjusted to byte boundary, so no previous byte (bits) is needed. Note that in general the byte at which the index entry point begins does not represent a clear cut point as the gzip window needs up to 7 bits from the previous byte. This is so because gzip is a bit-level stream compressor. With -Z the cut point is always clean and no bits from the previous byte are required. This will result in index points spaced by more than -s MiB between then (10 MiB by default), and so, may be, less points in the index. But this is completely safe and sound.
```
  $ gztool -Z my_gzip_file.gz
```

-Z exists since gztool v1.6.0.

Since v1.5.0, using -[fW] (-f: force index overwriting; -W: do not write index) with -[ST] (-S: create index on still-growing gzip file; -T: tail and continue decompressing to stdout) indicates gztool to continue operations even after the source file is overwritten. If using -f, the index file will be overwritten. For example:
```
  $ gztool -WT log_filename.gz
  ...
  File overwriting detected and restarting decompression...
  Processing 'log_filename.gz' ...
```

Index file format

Index files are created by default with extension '.gzi' appended to the original file name of the gzipped file:

filename.gz     ->     filename.gzi

If the original file doesn't have ".gz" extension, ".gzi" will be appended - for example:

filename.tgz     ->     filename.tgz.gzi

There's a special header to mark ".gzi" files as index files usable for this app:

+-----------------+-----------------+
|   0x0 64 bits   |    "gzipindx"   |     ~     16 bytes = 128 bits
+-----------------+-----------------+

This is "version 0" header, that is, it does not contain lines information. The header indicating that the index contains lines information is a "version 1" header, differing only in the capital "X" (each index registry point in this case contains an additional 64-bit counter to take lines into account). Next versions (if any) would use "gzipindx" string with lower and capital letters following a binary counting as if they were binary digits.

+-----------------+-----------------+
|   0x0 64 bits   |    "gzipindX"   |     version 1 header (index was created with `-[xX]` parameter)
+-----------------+-----------------+

Note that this header has been built so that this format will be "compatible" with index files generated for bgzip-compressed files. bgzip files are totally compatible with gzip: they've just been made so every 64 kiB of uncompressed data the zlib library is restart, so they are composed of independent gzipped blocks one after another. The bgzip command can create index files for bgzipped files in less time and with much less space than with this tool as they're already almost random-access-capable. The first 64-bit-long of bgzip files is the count of index pairs that are next, so with this 0x0 header gztool-index-files can be ignored by bgzip command and so bgzipped and gzipped files and their indexes could live in the same folder without collision.

All numbers are stored in big-endian byte order (platform independently). Big-endian numbers are easier to interpret than little-endian ones when inspecting them with an hex editor (like od for example).

Next, and almost one-to-one pass of struct access is serialize to the file. access->have and access->size are both written even though they'd always be equal. If the index file is generated with -S or -T on a still-growing gzip file (or somehow the index hasn't been completed because the gzip data was still incomplete), the values on disk for access->have and access->size will be respectively 0x0..0 and "number of actual index points written" (both uint64_t) to mark this fact. access->size MAY be UINT64_MAX to avoid the need to write this value as the number of index points are added to the file: as the index is incremental the number of points can be determined by reading the index until EOF. access->have MAY also be greater than zero but lower than access->size: this can occur when an already finished index is increased with new points (source gzip may have grown) - in this case this is also considered an incomplete index: when the index be correctly closed both numbers will have the same value (a Ctrl+C before would leave the index "incomplete", but usable for next runs in which it can be finished).

After that, comes all the struct point data. As previously said, windows are compressed so a previous register (32 bits) with their length is needed. Note that an index point with a window of size zero is possible.

After all the struct point structures data, the original uncompressed data size of the gzipped file is stored (64 bits).

Please note that not all stored numbers are 64-bit long. This is because some counters will always fit in less length. Refer to code.

With 64 bit long numbers, the index could potentially manage files up to 2^64 = 16 EiB (16 777 216 TiB).

Line number counting

Regarding line number counting (-[xX]), note that gztool's index counts last line in uncompressed data even if the last char isn't a newline char - whilst wc command will not count it in this case!. Nonetheless, line counting when extracting data with -[bLtT] does follow wc convention - this is in order to not obtain different (+/-1) results reading gztool output info and wc counts.

Also note that line counting when a gzip file / index file aren't still complete, always starts in 1. This is coherent with the previous statement, and it's also reasonable because if number counting is activated (-[xX]) there'll presumably be lines beautifully ending with a new line char (or chars in case of Windows: CR+LF) somewhere in the string.

magic file

A magic file to correctly identify gztool's index files with linux file command is provided: you can append it (or overwrite your empty) /etc/magic file or append/copy it to your home directory as ~/.magic (note the point prepending the name).

Other tools which try to provide random access to gzipped files

also based on zlib's zran.c:
- Perl module Gzip::RandomAccess. It seems to create the index only in memory, after decompressing the whole gzip file.
- Go package gzran. It seems to create the index only in memory, after decompressing the whole gzip file.
- jzran. It's a Java library based on the zran.c sample from zlib.
gzindex.c. Code written also by Mark Adler: Build an index for a gzip file and then test it. This code demonstrates the use of new capabilities in zlib 1.2.3.4 or later to create and use a random access index. It is called with the name of a gzip file on the command line. That file is then indexed and the index is tested by accessing the blocks in reverse order.
bgzip command, available in linux with package tabix (used for chromosome handling). This discussion about the implementation is very interesting: random-access-to-zlib-compressed-files. I've developed also a bgztail command tool to tail bgzipped files, even as they grow.
dictzip command, is a format compatible with gzip that stores an index in the header of the file. Uncompressed size is limited to 4 GiB - see also idzip below. The dictionary header cannot be expanded if more gzip data is added, and it cannot be added to an existent gzip file - both issues are successfully managed by gztool.
idzip Python command and function, builds upon dictzip to overcome the 4 GiB limit of dictzip by using multiple gzip members.
GZinga: Seekable and Splittable GZip, provides Java language classes to create gzip-compatible compressed files using the Z_FULL_FLUSH option, to later access or split them.
indexed_gzip Fast random access of gzip files in Python: it also creates file indexes, but they are not as small and they cannot be reused as easily as with gztool.
lzopfs. Random access to compressed files with a FUSE filesystem: allows random access to lzop, gzip, bzip2 and xz files - lzopfs allows to mount gzip, bzip2, lzo, and lzma compressed files for random read-only access. I.e., files can be seeked arbitrarily without a performance penalty. It hasn't been updated for years, but there's a fork that provides CMake installation. See also this description of its internals.
pymzML v2.0. One key feature of pymzML is the ability to write and read indexed gzip files. It has its own index format. Read also this.
zindex creates SQLite indexes for text files based on regular expressions.

Version

This version is v1.6.1.

Please, read the Disclaimer. In case of any errors, please open an issue.

License

A work by Roberto S. Galende
distributed under the same License terms covering
zlib from Mark Adler (aka Zlib license):
  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the authors be held liable for any damages
  arising from the use of this software.
  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:
  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.

/* zlib.h -- interface of the 'zlib' general purpose compression library
  version 1.2.11, January 15th, 2017
  Copyright (C) 1995-2017 Jean-loup Gailly and Mark Adler
  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the authors be held liable for any damages
  arising from the use of this software.
  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:
  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.
  Jean-loup Gailly        Mark Adler
  [email protected]          [email protected]
  The data format used by the zlib library is described by RFCs (Request for
  Comments) 1950 to 1952 in the files http://tools.ietf.org/html/rfc1950
  (zlib format), rfc1951 (deflate format) and rfc1952 (gzip format).
*/

Disclaimer

This software is provided "as is", without warranty of any kind, express or implied. In no event will the authors be held liable for any damages arising from the use of this software.

Author

by Roberto S. Galende

on code by Mark Adler's zlib.

gztool's People

Contributors

Stargazers

Watchers

Forkers

jwoolston xytrix alphaneer skitt anti32 mycroft hopeday6688 thoughtsynapse edwardbetts aitorarjona jnorthrup

gztool's Issues

gzip tailing like tail -F

Hi and thank you for this wonderful tool.

I was wondering if it was possible to continuously tail a gzipped file like you would with tail -F (note the capital -F, not -f!).
This allows the seamless restart of the tail process if the file disappears (e.g. if it is rotated).

This would be very useful for gzipped log files that are quickly rotated.
Currently, my only option, as far as I understand it, is to do something along the lines of:

while true; do
  gztool -wWTP -a1 -v0 test-log.log
done

, which is admittedly quite ugly. Do you see any other option?

Question: line-oriented vs byte

Hi,

I work with large text files using commands like 'tail -n +40000 file.txt | head -n 1000' which retrieves lines 40000-41000. This feeds chunks of data to a program. The program needs to know which line numbers are being processed; and it operates on multiple files at once, getting the same line number blocks from each even though they have different data.

Is it possible to retrieve in line-oriented mode instead of byte? I imagine if the index proxied bytes for line numbers, for example determine the byte number at each 1,000 line block of text when creating the index so it knows that line 41000 is at byte #xyz. When creating the index specify a line-oriented block size eg. 1,000

This probably seems like a special application but I think many/most people work with text files in line-oriented mode vs. byte count otherwise lines are truncated and data lost and not able to control which lines are retrieved. Thanks for you consideration and work on this project!

Segmentation fault -z -b 0

Hey I'm reporting something where I want the block index but i am doing my own line index with -b 0 so -z looks like the thing.

then this happens.

$ gztool -z ~/work/fumes/data/galaxy_1day.json.gz -I FOOIdx.gzi -b 0 >/dev/null
ACTION: Extract from byte = 0

Processing '/home/jim/work/fumes/data/galaxy_1day.json.gz' ...
Processing index to 'FOOIdx.gzi'...
Segmentation fault

the OS distro is recent gentoo ~2 months world build on gcc compiler -O2

the gztool binary is from clang-15 -l{z,m} -O3 -flto -ogztool gztool.c

it was also plain old gcc before with same result.

I think I could possibly get a cheap build with zig to test different libc's which would output clang but I'd need to rtfm and follow up for that.

Can we make it so I can 'write' gzi files to a different directory?

I have a situation where I have given a user RO access to a growing log file, but don't want to give them 'write' access to the logs directory/files. Is there a way to use gztool to reference an index in a DIFFERENT directory than the 'fully readable' growing gzip file? I want them to be able to tail -f like the file w/o r/w access to the gzip file.

Feature request: Resume index creation for stdin

Hi Roberto,

Merry Christmas! I hope you are well.

Could you please enable the -n argument to be used during index creation?

Something like:
gztool.exe -n 56343039741 -I "file.gzi"

Then I pass in compressed data (starting at byte 56343039741) into stdin.

Thank you,
Fidel

Background
I have in-memory compressed data, and I am currently creating an index using:
gztool.exe -I "file.gzi"
During index creation, the user may close my program without warning. I'd like to resume index creation when they run the program again. So I plan to use -ll to see what compressed byte we got up to last time, then use the -n argument to resume where we left off. This would avoid having to send all the data into stdin again.

use libdeflate?

Would implementing with libdeflate improve performance?

Feature request: Random access when using stdin

Hi Roberto,

Could you please implement a new argument, which informs gztool which compressed byte it's about to receive in stdin?

Something like:
gztool.exe -ecb 1501239288 -b 3410890441

Where -ecb stands for 'Expect compressed byte'

Background
The -b argument is great for telling telling gztool which uncompressed byte to start extracting from. That works really well if working with concrete files.

However I'm writing a program which doesn't use concrete files, rather uses in-memory content. In order to use the -b argument, I've used the technique you described here, which makes use of sparse files.

For example, if I would like to get uncompressed bytes 3,410,890,441 - 4,000,000,000, I first run gztool.exe -ll to work out which part of the compressed file contains that portion. In this case 1,501,239,288 - 1,630,240,888. I then create a sparse file and populate that section with the compressed data. (The sparse file in this example only takes up 123MB). Then finally I run gztool on the sparse file:
gztool -W -I "file.gzi" -b 3410890441 <sparseGzipFilename> and read from stdout.

That technique works perfectly. The only issue is that there is unnecessary IO (I have to populate a sparse file every time I want to decompress something). So using the -ecb argument would mean I could just send compressed data straight into gztool's stdin, and it would output decompressed data.

Cheers,
Fidel

Dead code when checking input

The code at line 3844:

    if ( span_between_points != SPAN &&
        ( action == ACT_COMPRESS_CHUNK || action == ACT_DECOMPRESS_CHUNK || action == ACT_LIST_INFO )
        ) {
        printToStderr( VERBOSITY_NORMAL, "ERROR: `-s` parameter does not apply to `-[lu]`.\n" );
        return EXIT_INVALID_OPTION;
    }

will never run, as this case is covered by code at 3809-3827:

    if ( ( action == ACT_COMPRESS_CHUNK || action == ACT_DECOMPRESS_CHUNK ) &&
        ( force_action == 1 || force_strict_order == 1 || write_index_to_disk == 0 ||
            span_between_points != SPAN || index_filename_indicated == 1 ||
            end_on_first_proper_gzip_eof == 1 || always_create_a_complete_index == 1 ||
            waiting_time != WAITING_TIME )
        ) {
        printToStderr( VERBOSITY_NORMAL, "ERROR: `-[aCEfFIsW]` does not apply to `-u`\n" );
        return EXIT_INVALID_OPTION;
    }

    if ( ( action == ACT_LIST_INFO ) &&
        ( force_action == 1 || force_strict_order == 1 || write_index_to_disk == 0 ||
            span_between_points != SPAN ||
            end_on_first_proper_gzip_eof == 1 || always_create_a_complete_index == 1 ||
            waiting_time != WAITING_TIME )
        ) {
        printToStderr( VERBOSITY_NORMAL, "ERROR: `-[aCEfFsW]` does not apply to `-l`\n" );
        return EXIT_INVALID_OPTION;
    }

I suggest to remove handling of case span_between_points != SPAN from 3809-3827.

Feature request: zsttool

Hi Roberto,

Going out on a limb here, but do you think you can make a tool to index Zstandard files?

Functional question.

I would like to be able to unzip truncated data with a range request without using transfer encoding.
How do I raise the first index point so that I can make larger cuts?

Thanks : )

Action create index for a gzip from STDIN not working

Hello, thanks for maintaining this amazing tool :)

I need to use the reading from stdin feature, but is not working anymore for version 1.5.0.
I tried for version 1.4.3 and is working fine.

Steps to reproduce:

$ git clone https://github.com/circulosmeos/gztool.git
$ git fetch
$ git checkout v1.5.0
$ automake --add-missing && autoreconf && ./configure && make check
$ cat tests/gplv3.txt.gz | ./gztool -I index.bin
ACTION: Create index for a gzip file

Processing STDIN ...
Processing index to 'index.bin'...
ERROR: Compressed data error in STDIN.
$ make clean
$ git checkout v1.4.3
$ automake --add-missing && autoreconf && ./configure && make check
$ cat tests/gplv3.txt.gz | ./gztool -I index.bin
ACTION: Create index for a gzip file

Index file 'index.bin' already exists and will be used.
Processing STDIN ...
Processing index to 'index.bin'...
$ ./gztool -ell index.bin
ACTION: Check & list info in index file

Checking index file 'index.bin' ...
	Size of index file (v1)  : 84.00 Bytes (84 Bytes)
	Number of index points   : 1
	Size of uncompressed file: 31.27 kiB (32024 Bytes)
	Number of lines          : 207
	Compression factor       : Not available
	List of points:
	#: @ compressed/uncompressed byte L#line_number (window data size in Bytes @window's beginning at index file), ...
#1: @ 20 / 0 L1 ( 0 @60 ), 

1 files processed

Thank you!

Only extract a subset of lines using gztool

Hi,

I've been searching for a little while for a way to perform seek and tell operations on gzip files, which I need for one of my projects, and finally found your work. Thank you so very much for this!

However, it seems like the extraction functionalities always output the whole file, starting from the given line / offset.
I would need a way to use gztool to, say, for instance, "Extract lines between 12 and 18", or "Extract content between offset 125 and offset 1500", but I don't seem to find any option allowing this?

I'm currently browsing through the source code trying to see if I can slightly alter it to fit my needs, but maybe I'm just overlooking an option that already exists?

Thanks,
Pierre

an attempt at porting an index client

this C code has 2 ints

struct access {
    uint64_t have;      /* number of list entries filled in */
    uint64_t size;      /* number of list entries allocated */
    uint64_t file_size; /* size of uncompressed file (useful for bgzip files) */
    struct point *list; /* allocated list */
    char *file_name;    /* path to index file */
    int index_complete; /* 1: index is complete; 0: index is (still) incomplete */
// index v1:
    int index_version;  /* 0: default; 1: index with line numbers */
    uint32_t line_number_format; /* 0: linux \r | windows \n\r; 1: mac \n */
    uint64_t number_of_lines; /* number of lines (only used with v1 index format) */
};

This code below is an attempt to port the index and eventually apply the inflater lib to the points, but im hitting buffer underruns with the bytebuffers and i cannot rule out that sizeof(int) might be other than 32 bits.

package ed.fumes

import java.io.File
import java.nio.ByteBuffer
import java.nio.ByteOrder
import java.nio.channels.FileChannel


object GzipIndexReader {
    data class Point(
        var out: Long,
        var `in`: Long,
        var bits: Int,
        var windowSize: Int,
        var window: ByteArray?,
        var windowBeginning: Long,
        var lineNumber: Long,
    )

    data class Access(
        var have: Long,
        var size: Long,
        var fileSize: Long,
        var list: MutableList<Point>,
        var fileName: String,
        var indexComplete: Int,
        var indexVersion: Int,
        var lineNumberFormat: Int,
        var numberOfLines: Long,
    )

    val GZIP_INDEX_HEADER_SIZE = 16
    val GZIP_INDEX_IDENTIFIER_STRING = "gzipindx"
    val GZIP_INDEX_IDENTIFIER_STRING_V1 = "gzipindX"

    fun ByteBuffer.readLongBE(): Long {
        order(ByteOrder.BIG_ENDIAN)
        return long
    }

    fun ByteBuffer.readIntBE(): Int {
        order(ByteOrder.BIG_ENDIAN)
        return int
    }

    fun ByteBuffer.readBytes(n: Int): ByteArray {
        val arr = ByteArray(n)
        get(arr)
        return arr
    }

    fun createEmptyIndex(): Access {
        return Access(
            have = 0L,
            size = 0L,
            fileSize = 0L,
            list = mutableListOf(),
            fileName = "",
            indexComplete = 0,
            indexVersion = 0,
            lineNumberFormat = 0,
            numberOfLines = 0L
        )
    }

    fun addPoint(index: Access, point: Point) {
        index.list.add(point)
        index.have++
        index.size++
    }

    fun deserializeIndexFromFile(indexFile: File, loadWindows: Boolean = false, gzFilename: String): Access? {
        indexFile.inputStream().use { inputStream ->
            val channel = inputStream.channel
            val header = ByteArray(GZIP_INDEX_HEADER_SIZE)
            inputStream.read(header)

            if (header.sliceArray(0 until 8)
                    .toString(Charsets.UTF_8) != "\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000" ||
                !(header.sliceArray(8 until 16)
                    .toString(Charsets.UTF_8) == GZIP_INDEX_IDENTIFIER_STRING || header.sliceArray(8 until 16)
                    .toString(Charsets.UTF_8) == GZIP_INDEX_IDENTIFIER_STRING_V1)
            ) {
                println("ERROR: File is not a valid gzip index file.")
                return null
            }

            val indexVersion =
                if (header.sliceArray(8 until 16).toString(Charsets.UTF_8) == GZIP_INDEX_IDENTIFIER_STRING_V1) 1 else 0
            val index = createEmptyIndex()
            index.indexVersion = indexVersion

            if (indexVersion == 1) index.lineNumberFormat = channel.readIntBE()

            val indexHave = channel.readLongBE()
            val indexSize = channel.readLongBE()

            if (indexHave != indexSize) {
                println("Index file is incomplete.")
                index.indexComplete = 0
            } else index.indexComplete = 1

            for (i in 0 until indexSize) {
                val out = channel.readLongBE()
                val `in` = channel.readLongBE()
                val bits = channel.readIntBE()
                val windowSize = channel.readIntBE()

                val window: ByteArray? = if (loadWindows) {
                    val windowBytes = ByteBuffer.allocate(windowSize)
                    channel!!.read(windowBytes)
                    windowBytes.array()
                } else {
                    channel.position(channel.position() + windowSize)
                    null
                }

                val windowBeginning = channel.readLongBE()
                val lineNumber = if (indexVersion == 1) channel.readLongBE() else 0L

                val point = Point(
                    out = out,
                    `in` = `in`,
                    bits = bits,
                    windowSize = windowSize,
                    window = window,
                    windowBeginning = windowBeginning,
                    lineNumber = lineNumber
                )

                addPoint(index, point)
            }

            index.fileName = gzFilename
            index.numberOfLines = if (indexVersion == 1) channel.readLongBE() else 0L
            index.fileSize = indexFile.length()

            return index
        }
    }

    fun FileChannel.readIntBE(): Int {
        val buffer = ByteBuffer.allocate(4).order(ByteOrder.BIG_ENDIAN)
        read(buffer)
        buffer.flip()
        return buffer.int
    }

    fun FileChannel.readLongBE(): Long {
        val buffer = ByteBuffer.allocate(8).order(ByteOrder.BIG_ENDIAN)
        read(buffer)
        buffer.flip()
        return buffer.long
    }

}

fun main(args: Array<String>) {
    val inputFile = args[0]
    val gzfile = args.getOrNull(1) ?: (inputFile.replace(".gzi$".toRegex(), ".gz"))

    val index = GzipIndexReader.deserializeIndexFromFile(File(inputFile), true, gzfile)

    if (index != null) {
        println("Index loaded successfully.")
        println("Points: ${index.list.size}")
    } else {
        println("Failed to load index.")
    }
}

the outcome is that on the second window read, there's a buffer underflow reported however this debugger shows something that looks entirely like misaligned values in the millions and trillions for windowsize. this overrruns the index if its completely in memory or if there's a short read from file. any insight appreciated

Exception in thread "main" java.nio.BufferUnderflowException
at java.base/java.nio.Buffer.nextGetIndex(Buffer.java:710)
at java.base/java.nio.HeapByteBuffer.getLong(HeapByteBuffer.java:494)
at ed.fumes.GzipIndexReader.readLongBE(GzipIndexReader.kt:153)
at ed.fumes.GzipIndexReader.deserializeIndexFromFile(GzipIndexReader.kt:118)
at ed.fumes.GzipIndexReaderKt.main(GzipIndexReader.kt:162)

request: byte-aligned index blocks for low-tech zlib inflater

I am able to read the created index files using a jdk client and to prime the window as needed and set up the streams to the needed positions.

the wall i run into is the inflatePrime zlib function being absent from non-c libraries which is true among at least 3 ports including the official Oracle one.

the occurrence of non-zero bits in the index is roughly... 7 in 8

shown below:

in gzindex the indexes are not stored to disk, it's just a minimum unit test of what gztool does. the point struct stores the first 2 offsets in bits, not bytes.

i modified the loop conditionals of gzindex as shown below to change the input window to 1 and keep iterating the loop until arriving at byte aligned block boundary. i'm guessing this makes the block boundary slightly stochastic, up to an average of 4 bytes variance. with gztool this isn't a simple modification.

diff --git a/gzindex.c b/gzindex.c
--- a/gzindex.c	(revision f1b7696c1e4757a7201009a2f3e02ed9e3536a56)
+++ b/gzindex.c	(revision 662eb8434ed5c3d18e4673621aafd9e0feb415bf)
@@ -207,6 +207,7 @@
     unsigned char *out, *out2;
     z_stream strm;
     unsigned char in[16384];
+size_t input_stride = sizeof(in); 

     /* position input file */
     ret = fseeko(gz, offset, SEEK_SET);
@@ -273,7 +274,7 @@
         do {
             /* if needed, get more input data */
             if (strm.avail_in == 0) {
-                strm.avail_in = fread(in, 1, sizeof(in), gz);
+                strm.avail_in = fread(in, 1, input_stride, gz);
                 if (ferror(gz)) {
                     (void)inflateEnd(&strm);
                     free(list);
@@ -304,6 +305,12 @@
 
             /* if at a block boundary, note the location of the header */
             if (strm.data_type & 128) {
+            out_alignment = (pos - strm.avail_in) & 7;
+            if (out_alignment) {
+                input_stride = 1;
+            } else {
+                input_stride = sizeof(in);
+            }
                 head = ((pos - strm.avail_in) << 3) - (strm.data_type & 63);
                 last = strm.data_type & 64; /* true at end of last block */
             }
...
        } while (strm.avail_out != 0 && !last &&out_alignment ); //keeps reading 1 byte at end of block-read until alignment

Feature request: Multiple input files

Hi Roberto,

Clonezilla splits gzip files into 4GB segments. Would it be possible to enable gztool to accept multiple input files and treat them as one big input file?

eg.
gztool -I myindex.gzi "sda4.ntfs-ptcl-img.gz.aa" "sda4.ntfs-ptcl-img.gz.ab" "sda4.ntfs-ptcl-img.gz.ac"

I want to avoid using cat to combine the files
eg.
cat *sda4* | gztool -I myindex.gzi
because I create the index over multiple gztool runs; so I want to avoid providing data which has already been indexed by gztool.

Thank you,
Fidel

automake --add-missing

Thanks for the recent release and the mention

this brought me to clean the old source tree and attempt something proper

 jim@gentoo ~/work/gztool $ autoreconf && ./configure && make check
configure.ac:4: warning: The macro `AC_PROG_CC_C99' is obsolete.
configure.ac:4: You should run autoupdate.
./lib/autoconf/c.m4:1664: AC_PROG_CC_C99 is expanded from...
configure.ac:4: the top level
configure.ac:3: error: required file './compile' not found
configure.ac:3:   'automake --add-missing' can install 'compile'
[...]

some fiddling lead to

jim@gentoo ~/work/gztool $ automake --add-missing
configure.ac:3: installing './compile'
configure.ac:2: installing './install-sh'
configure.ac:2: installing './missing'
Makefile.am: installing './depcomp'
parallel-tests: installing './test-driver'

the results changed after that.

jim@gentoo ~/work/gztool $ autoreconf && ./configure && make check
configure.ac:4: warning: The macro `AC_PROG_CC_C99' is obsolete.
configure.ac:4: You should run autoupdate.
./lib/autoconf/c.m4:1664: AC_PROG_CC_C99 is expanded from...
configure.ac:4: the top level
checking for a BSD-compatible install... /usr/bin/install -c

eventually the makefile was sane.

just passing this along, i think i can retire my fork for a bit until im ready to figure out an in-browser solution via range requests

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

circulosmeos / gztool Goto Github PK

gztool's Introduction

gztool

Considerations

Background

Installation

Compilation

Compilation in Windows

Usage

Examples of use

Index file format

Line number counting

magic file

Other tools which try to provide random access to gzipped files

Other interesting links

Version

License

Disclaimer

Author

gztool's People

Contributors

Stargazers

Watchers

Forkers

gztool's Issues

Recommend Projects

Recommend Topics

Recommend Org