Git Product home page Git Product logo

lz4's Introduction

lz4 : LZ4 compression in pure Go

Go Reference CI Go Report Card GitHub tag (latest SemVer)

Overview

This package provides a streaming interface to LZ4 data streams as well as low level compress and uncompress functions for LZ4 data blocks. The implementation is based on the reference C one.

Install

Assuming you have the go toolchain installed:

go get github.com/pierrec/lz4/v4

There is a command line interface tool to compress and decompress LZ4 files.

go install github.com/pierrec/lz4/v4/cmd/lz4c@latest

Usage

Usage of lz4c:
  -version
        print the program version

Subcommands:
Compress the given files or from stdin to stdout.
compress [arguments] [<file name> ...]
  -bc
        enable block checksum
  -l int
        compression level (0=fastest)
  -sc
        disable stream checksum
  -size string
        block max size [64K,256K,1M,4M] (default "4M")

Uncompress the given files or from stdin to stdout.
uncompress [arguments] [<file name> ...]

Example

// Compress and uncompress an input string.
s := "hello world"
r := strings.NewReader(s)

// The pipe will uncompress the data from the writer.
pr, pw := io.Pipe()
zw := lz4.NewWriter(pw)
zr := lz4.NewReader(pr)

go func() {
	// Compress the input string.
	_, _ = io.Copy(zw, r)
	_ = zw.Close() // Make sure the writer is closed
	_ = pw.Close() // Terminate the pipe
}()

_, _ = io.Copy(os.Stdout, zr)

// Output:
// hello world

Contributing

Contributions are very welcome for bug fixing, performance improvements...!

  • Open an issue with a proper description
  • Send a pull request with appropriate test case(s)

Contributors

Thanks to all contributors so far!

Special thanks to @Zariel for his asm implementation of the decoder.

Special thanks to @greatroar for his work on the asm implementations of the decoder for amd64 and arm64.

Special thanks to @klauspost for his work on optimizing the code.

lz4's People

Contributors

10000tb avatar anatol avatar bodgit avatar connor4312 avatar corneliusroemer avatar danielmoy-google avatar edigaryev avatar edwingeng avatar emfree avatar evanphx avatar fingon avatar greatroar avatar h8liu avatar honda-tatsuya avatar ianwilkes avatar ikkeps avatar imura-tatsuya-ab avatar ivanov avatar johejo avatar klauspost avatar leehinman avatar lhemala avatar lizthegrey avatar maxlaverse avatar moycat avatar oakad avatar pierrec avatar sbinet avatar tobgu avatar x4m avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lz4's Issues

Broken data integrity on compression

Hello, this commit d9ddf25 introduces data corruption upon compression, I was just able to trace it in 20hours of fiddling around. Since we had data integrity checks in the unmarshalling logic, I had to check unmarshalling first..

3 things that I dislike about this commit:

  1. Using of unsafe package bans it from Google App Engine, also removes any guarantees from runtime whatsoever. This is not what I'd expect from a simple compression library.

  2. Relying on the endianness of the host machine is evil (plus, Rob's opinion: https://commandcenter.blogspot.ru/2012/04/byte-order-fallacy.html), I won't believe this is really necessary there to gain a speed up.

  3. These changes breaks data integrity (no surprise).

I don't have a reproducible result to share, but maybe the tests should be improved to cover the code better. And the host endianness if you'd prefer to stick with it.

Find the details of my environment below

$ go version
go version go1.8 darwin/amd64

$ go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/xlab/Documents/dev/go"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/dr/2vrbtj0j7kl1yy32kp5cpn040000gn/T/go-build154721013=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"

Anyway, nice package! I'm fine on the latest working commit 7f1eefc

cannot find package "github.com/pierrec/lz4/v3"

Hi, I can't go get "pierrec/lz4/v3".

$ go get -v github.com/pierrec/lz4/v3
github.com/pierrec/lz4 (download)
package github.com/pierrec/lz4/v3: cannot find package "github.com/pierrec/lz4/v3" in any of:
	/usr/local/Cellar/go/1.13.5/libexec/src/github.com/pierrec/lz4/v3 (from $GOROOT)
	/Users/my-user/go/src/github.com/pierrec/lz4/v3 (from $GOPATH)

environment

$ uname -a
Darwin my-mac.local 18.7.0 Darwin Kernel Version 18.7.0: Tue Aug 20 16:57:14 PDT 2019; root:xnu-4903.271.2~2/RELEASE_X86_64 x86_64
$ go version
go version go1.13.5 darwin/amd64

jsonlz4 / mozlz4?

Mozilla has a weird lz4 format that is simply an lz4 compressed json file with a header, as far as I can tell. I haven't been able to find any documentation on the format but I did find this project, which works: https://github.com/badboy/jsonlz4cat. In the project there is a file that defines the format as:

[mozLz40\0] {u32 outsize} {compressed data}

I figured I could implement a reader in Go with something like the following:

package main

import (
	"fmt"
	"io"
	"os"

	"github.com/pierrec/lz4"
)

const magic_header = "mozLz40\x00"

func main() {
	if len(os.Args) != 2 {
		panic("must pass one arg")
	}
	file, err := os.Open(os.Args[1])
	if err != nil {
		panic(err)
	}

	header := make([]byte, len(magic_header))
	_, err = file.Read(header)
	if err != nil {
		panic(err)
	}
	if string(header) != magic_header {
		panic("wrong header")
	}

	size := make([]byte, 4)
	file.Read(size)
	x := lz4.NewReader(file)
	_, err = io.Copy(os.Stderr, x)
	if err != nil {
		panic(err)
	}
}

Unfortunately I consistently get lz4: bad magic number. I suspect that lz4 is looking for it's own header and that because this is some other file format it's not there. Do you know of a way I can make this work?

Block compression incompatibility with C/Python library

Since commit a207029 (identified through bisecting) compressing some data causes the Python/C decompressor to fail with error message _block.LZ4BlockError: Decompression failed: corrupt input or insufficient space in destination buffer. Error code: 179.

The problem seems to be that the decompressor cannot fit the decompressed data within the provided buffer size (which is the size of the original data). Or is somehow tricked into thinking it cannot fit the data. Increasing the provided size by one resolves the problem in all cases observed so far.

Below is a Go compressor and Python decompressor pair that demonstrates the problem. The input data provided is the smallest I managed to identify that would reproduce the problem:

package main

import (
	"fmt"
	"github.com/pierrec/lz4"   // v2.3.0
	"os"
)


func check(err error) {
	if err != nil {
		panic(err)
	}
}

func main() {
        // 214 bytes
	originalInput := []byte(`[{"rqgulitaeas_bitmap":0,"mwsswng_afnce":null,"fm_rxposire_op":null,"lnpmct_abs_diff_pct":null,"pqrstuvwyx_pxtch_eating":null,"fm_wxpssxre_rbs_dwxx_pct":null,"ued_date":"2020-09-15","regglttabns_bitmap_op":0,"dfff_`)
	input := originalInput
	var ht [1 << 16]int
	dst := make([]byte, lz4.CompressBlockBound(len(input)))
	fmt.Printf("Dst size: %d, input size: %d, original size: %d\n", len(dst), len(input), len(originalInput))
       
	size, err := lz4.CompressBlock(input, dst, ht[:])
	check(err)

	if size == 0 {
		fmt.Println("Uncompressible!")
		return
	}

        // No problem decompressing using Go lz4 with the exact size of the input
	uncompressedBuf := make([]byte, len(input))
	_, err = lz4.UncompressBlock(dst[:size], uncompressedBuf)
	check(err)

	f, err := os.Create(fmt.Sprintf("test_compressed_%d.lz4", len(input)))
	check(err)
	defer f.Close()

	_, err = f.Write(dst[:size])
	check(err)
}
# Python 3.7.4
# LZ4 2.2.1

import lz4.block

size = 214
with open(f'test_compressed_{size}.lz4', mode='rb') as f:
    # This crashes
    lz4.block.decompress(f.read(), uncompressed_size=size)
   
    # This works
    # lz4.block.decompress(f.read(), uncompressed_size=size+1)

Is this a bug in github.com/pierrec/lz4, a bug in the C implementation or something I just have to live with and deal with?

Buffer for compressed data is constantly reallocated

I had been using the lz4 commit 623b5a2 in our production environment since January 2019 to compress messages written to Kafka with the Sarama Kafka client library. Performance was very good, both in terms of CPU and memory usage.

I upgraded our lz4 dependency to commit 057d66e (tag v2.2.4) and saw dramatically worse performance in the HTTP response times for our service that receives beacon data and writes to Kafka:

image

Running a debug trace on the old and new versions of the application showed that the newer application was performing a whole lot of garbage collection:

image

So, 20% of the program execution time was spent performing garbage collection. Garbage collection didn't run even once during the 5 second trace I captured for the older app.

I ran a heap dump of the new application and identified the top two nodes:

→ go tool pprof heap.out
File: app
Build ID: 1532f41779e0d517df7a17dd48010b6e0f2f8061
Type: inuse_space
Time: Aug 29, 2019 at 6:01pm (PDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 2
Showing nodes accounting for 17.59MB, 83.18% of 21.15MB total
Showing top 2 nodes out of 68
      flat  flat%   sum%        cum   cum%
      16MB 75.65% 75.65%       16MB 75.65%  github.com/pierrec/lz4.(*Writer).writeHeader
    1.59MB  7.53% 83.18%     1.59MB  7.53%  github.com/pierrec/lz4.NewWriter

Over 75% of the heap operations were attributes to the lz4 writeHeader function.

Benchmarks

I wrote a benchmark to see what the performance of the Write/Close/Reset functions would be when reusing a writer instance:

func BenchmarkWrite(b *testing.B) {
	writer := lz4.NewWriter(nil)

	src, err := ioutil.ReadFile("testdata/gettysburg.txt")
	if err != nil {
		b.Fatal(err)
	}

	for n := 0; n < b.N; n++ {
		var buf bytes.Buffer
		writer.Reset(&buf)

		if _, err := writer.Write(src); err != nil {
			b.Fatal(err)
		}
		if err := writer.Close(); err != nil {
			b.Fatal(err)
		}
	}
}

Old (623b5a2):

goos: darwin
goarch: amd64
pkg: github.com/pierrec/lz4
BenchmarkWrite-4   	  200000	      6366 ns/op	    1564 B/op	       3 allocs/op
PASS
ok  	github.com/pierrec/lz4	1.356s
Success: Benchmarks passed.

New (057d66e):

goos: darwin
goarch: amd64
pkg: github.com/pierrec/lz4
BenchmarkWrite-4   	    1000	   1282000 ns/op	 8390668 B/op	       4 allocs/op
PASS
ok  	github.com/pierrec/lz4	1.456s
Success: Benchmarks passed.

That's a 200x slowdown, and many orders of magnitude increase in the number of bytes allocated per operation.

I narrowed down the commit where this performance issue began: 6749706

This commit guarantees that the buffer for compressed data will be reallocated on every invocation of writeHeader. The z.zdata slice will always have a capacity that's less than 2x the z.Header.BlockMaxSize value, leading to the z.zdata slice being reassigned on every call. This is extremely wasteful and unnecessary. I'm still trying to understand why this change was even introduced.

Printf Debugging

I instrumented this code with some logging to show the logic slice capacities with each code revision:

Old (623b5a2):

Pass 1

Checking whether to allocate zdata memory, cap(z.data)=0 cap(z.zdata)=0 n=8388608
Allocating zdata memory, cap(z.data)=0 cap(z.zdata)=0 n=8388608
bSize: 4194304
before: z.data len=0 cap=0
before: z.zdata len=8388608 cap=8388608
after: z.data len=4194304 cap=4194304
after: z.zdata len=4194304 cap=8388608

Pass 2

Checking whether to allocate zdata memory, cap(z.data)=4194304 cap(z.zdata)=8388608 n=8388608
bSize: 4194304
before: z.data len=0 cap=4194304
before: z.zdata len=0 cap=8388608
after: z.data len=4194304 cap=4194304
after: z.zdata len=4194304 cap=8388608

New (057d66e):

Pass 1

Checking whether to allocate zdata memory, cap(z.data)=0 cap(z.zdata)=0 n=8388608
Allocating zdata memory, cap(z.data)=0 cap(z.zdata)=0 n=8388608
bSize: 4194304
before: z.data len=0 cap=0
before: z.zdata len=8388608 cap=8388608
after: z.data len=4194304 cap=8388608
after: z.zdata len=4194304 cap=4194304

Pass 2

Checking whether to allocate zdata memory, cap(z.data)=8388608 cap(z.zdata)=4194304 n=8388608
Allocating zdata memory, cap(z.data)=8388608 cap(z.zdata)=4194304 n=8388608
bSize: 4194304
before: z.data len=0 cap=8388608
before: z.zdata len=8388608 cap=8388608
after: z.data len=4194304 cap=8388608
after: z.zdata len=4194304 cap=4194304

LZ4 C library incompatible?

Hello, I am still working on debugging my (very simple) C++ program, but it uses the LZ4 library to decompress the output of this Go package.

I noticed that lz4cat and lz4c are both able to decompress this output, but the C library approach is not working. I was wondering if you know of some incompatibility there?

For what it's worth, I am using the latest C library at revision c4dbc37 (lz4/lz4@c4dbc37).

Cheers!

Error 66 : Decompression error : ERROR_contentChecksum_invalid

When decompressing files written by this package with the lz4 utility, I find checksum invalid errors. v1.8.2 is latest lz4, according to https://github.com/lz4/lz4/releases, so perhaps an update to the pierrec/lz4 checksum is needed.

package main

import (
    "fmt"
    "github.com/pierrec/lz4"
    "os"
)

func main() {
    f, err := os.Create("chk.out.lz4")
    if err != nil {
        panic(err)
    }
    defer f.Close()
    w := lz4.NewWriter(f)

    for i := 0; i < 10; i++ {
        fmt.Fprintf(w, "Hello World.\n")
    }
    w.Close()
}

run then decode with command line lz4 utility

jaten@jatens-MacBook-Pro ~/go/src/github.com/pierrec/lz4/chk (master) $cat chk.out.lz4 |lz4 -d -c
Hello World.
Hello World.
Hello World.
Hello World.
Hello World.
Hello World.
Hello World.
Hello World.
Hello World.
Hello World.
Error 66 : Decompression error : ERROR_contentChecksum_invalid
jaten@jatens-MacBook-Pro ~/go/src/github.com/pierrec/lz4/chk (master) $  lz4 --version
*** LZ4 command line interface 64-bits v1.8.2, by Yann Collet ***
jaten@jatens-MacBook-Pro ~/go/src/github.com/pierrec/lz4/chk (master) $ go version
go version go1.10.2 darwin/amd64

Why it is very slower than snappy ?

Hi,
My test case is a pure JSON file of 223179 bytes. I use this code:

zbuf := make([]byte, lz4.CompressBlockBound(len(data)))
winSize := 128 * 1024
zsize, err := lz4.CompressBlockHC(data, zbuf, winSize)

LZ4: 5.54 ms
Snappy: 258 µs

Snappy (github.com/golang/snappy) is more than 20X faster !!

According to benchmark, LZ4 should be faster than Snappy:
https://www.percona.com/blog/2016/04/13/evaluating-database-compression-methods-update/
https://gist.github.com/kevsmith/4217004
https://doordash.engineering/2019/01/02/speeding-up-redis-with-compression/
...

Repeating uint64(1) sequence doesn't compress

With data which is a repeating sequence of [1 0 0 0 0 0 0 0], lz4.CompressBlock() compresses it as a bunch of 4-byte literal / 4-byte copies rather than one huge copy of the repeating 8 byte sequence. CompressBlockHC() works just fine.

Test case:

func TestRepeatingEightBytes(t *testing.T) {
	var buf [65536]byte

	for i := 0; i < 65536; i += 8 {
		buf[i] = 1
	}

	compressBuf := make([]byte, lz4.CompressBlockBound(len(buf)))
	n, _ := lz4.CompressBlock(buf[:], compressBuf, 0)
	t.Log(buf[:20])
	t.Logf("compress length %d", n)
	t.Log(compressBuf[:20])

	n, _ = lz4.CompressBlockHC(buf[:], compressBuf, 0)
	t.Logf("compressHC length %d", n)
	t.Log(compressBuf[:20])
}

output:

--- PASS: TestRepeatingEightBytes (0.00s)
	lz4_test.go:17: [1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0]
	lz4_test.go:18: compress length 57348
	lz4_test.go:19: [64 1 0 0 0 1 0 64 1 0 0 0 1 0 64 1 0 0 0 1]
	lz4_test.go:22: compressHC length 278
	lz4_test.go:23: [34 1 0 1 0 31 1 8 0 255 255 255 255 255 255 255 255 255 255 255]
PASS

I'm trying to understand why this is happening; I'm guessing the repeating sequences of 0 0 0 0 and 1 0 0 0 keep colliding the hash table?

If I change the repeating sequence to [0 1 2 3 4 5 6 7], it compresses down to 280 bytes in CompressBlock(), 281 bytes in CompressBlockHC().

perf: go 1.11 regression

Benchmarks show significant slowdowns with 1.11, other tools as much as 50%.
This seems to be related to additional allocations that were not in 1.10.

benchmark old ns/op new ns/op delta
BenchmarkCompress-2 5868130 6017448 +2.54%
BenchmarkCompressHC-2 48995513 48250826 -1.52%
BenchmarkUncompress-2 140 184 +31.43%
BenchmarkUncompressPg1661-2 2139211 2333622 +9.09%
BenchmarkUncompressDigits-2 124482 145167 +16.62%
BenchmarkUncompressTwain-2 1362173 1507390 +10.66%
BenchmarkUncompressRand-2 6234 6696 +7.41%
BenchmarkCompressPg1661-2 184544 191268 +3.64%
BenchmarkCompressDigits-2 29301 29163 -0.47%
BenchmarkCompressTwain-2 118101 121249 +2.67%
BenchmarkCompressRand-2 4660 4825 +3.54%

benchmark old MB/s new MB/s speedup
BenchmarkUncompressPg1661-2 278.11 254.94 0.92x
BenchmarkUncompressDigits-2 803.35 688.88 0.86x
BenchmarkUncompressTwain-2 284.73 257.30 0.90x
BenchmarkUncompressRand-2 2628.13 2446.56 0.93x
BenchmarkCompressPg1661-2 3223.78 3110.46 0.96x
BenchmarkCompressDigits-2 3412.90 3429.09 1.00x
BenchmarkCompressTwain-2 3284.06 3198.80 0.97x
BenchmarkCompressRand-2 3515.75 3395.05 0.97x

benchmark old allocs new allocs delta
BenchmarkCompress-2 0 0 +0.00%
BenchmarkCompressHC-2 0 0 +0.00%
BenchmarkUncompress-2 0 0 +0.00%
BenchmarkUncompressPg1661-2 0 159 +Inf%
BenchmarkUncompressDigits-2 0 39 +Inf%
BenchmarkUncompressTwain-2 0 109 +Inf%
BenchmarkUncompressRand-2 0 17 +Inf%
BenchmarkCompressPg1661-2 0 4 +Inf%
BenchmarkCompressDigits-2 0 4 +Inf%
BenchmarkCompressTwain-2 0 4 +Inf%
BenchmarkCompressRand-2 0 4 +Inf%

benchmark old bytes new bytes delta
BenchmarkCompress-2 0 0 +0.00%
BenchmarkCompressHC-2 0 0 +0.00%
BenchmarkUncompress-2 0 0 +0.00%
BenchmarkUncompressPg1661-2 0 1296 +Inf%
BenchmarkUncompressDigits-2 0 336 +Inf%
BenchmarkUncompressTwain-2 0 896 +Inf%
BenchmarkUncompressRand-2 0 160 +Inf%
BenchmarkCompressPg1661-2 73 145 +98.63%
BenchmarkCompressDigits-2 21 93 +342.86%
BenchmarkCompressTwain-2 49 121 +146.94%
BenchmarkCompressRand-2 28 100 +257.14%

Failed to decompress

I tried using the main.go ( https://github.com/pierrec/lz4/blob/master/lz4c/main.go ) to test the attached file. The file can be compressed but decompression failed. The error message is

//compress it:
$~/go/bin/lz4c t.png
//decompress it:
$ ~/go/bin/lz4c -d t.png.lz4
2017/05/12 09:50:00 Error while decompressing input: lz4: invalid source

( GO: go version go1.8.1 darwin/amd64 ; OS: macOS 10.12.4 (16E195))
t

go test results in error: invalid frame checksum

To reproduce I created a larger file:

for i in {1..100};do cat Mark.Twain-Tom.Sawyer.txt >> longer.txt; done

And added the "longer.txt" into the _test files.

Result:
testdata/longer.txt : 24394635 / 18475893 / 38785100
--- FAIL: TestWriter (0.00s)
--- FAIL: TestWriter/testdata/longer.txt/lz4.Header{CompressionLevel:10} (3.72s)
writer_test.go:65: lz4: invalid frame checksum: got cdbda0bd; expected f1d62b24
FAIL
exit status 1
FAIL github.com/pierrec/lz4 18.054s

I believe it's got something to do with io.Reader and it's quirk of not returning everything each time.

CompressBlock error returns are confusing

CompressBlock and CompressBlockHC docs currently state:

// If the destination buffer size is lower than CompressBlockBound and
// the compressed size is 0 and no error, then the data is incompressible.
//
// An error is returned if the destination buffer is too small.

This was introduced to fix #71, but I find it confusing. What is the difference between incompressible and the destination buffer being too small?

Can the lz4.Reader uncompresses data compressed by lz4.CompressBlock?

dst = make([]byte, len(src))
	ht := make([]int, 64<<10)
	n, err := lz4.CompressBlock(src, dst, ht)
	if err != nil {
		fmt.Println(err)
	}
	if n >= len(src) {
		fmt.Printf("`%s` is not compressible", src)
	}
	dst = dst[:n]
	
        br := bytes.NewBuffer(dst)
	zr := lz4.NewReader(br)
	var out bytes.Buffer
	_, err = io.Copy(&out, zr)
	if err != nil {
		log.Fatal(err) .  // bad magic number error is thrown here. Looks like this isn't possible?
	}

Not able to build v4

There's something wrong with the Go module version thingy for v4, it is not able to resolve the internal package references:

/tmp/q$ ls -l
total 8
-rw-r--r-- 1 mdevan mdevan 60 Jul  6 20:48 go.mod
-rw-r--r-- 1 mdevan mdevan 86 Jul  6 20:48 main.go
/tmp/q$ cat go.mod
module q

go 1.14

require github.com/pierrec/lz4/v4 v4.0.0
/tmp/q$ cat main.go
package main

import "github.com/pierrec/lz4/v4"

func main() {
        _ = lz4.NewWriter
}

/tmp/q$ GOCACHE=`mktemp -d` GOPATH=`mktemp -d` go build -v .
go: downloading github.com/pierrec/lz4/v4 v4.0.0
go: finding module for package github.com/pierrec/lz4/internal/lz4stream
go: finding module for package github.com/pierrec/lz4/internal/lz4block
go: finding module for package github.com/pierrec/lz4/internal/lz4errors
go: downloading github.com/pierrec/lz4 v1.0.1
go: downloading github.com/pierrec/lz4 v2.5.2+incompatible
../tmp.YQV392cFcD/pkg/mod/github.com/pierrec/lz4/[email protected]/lz4.go:14:2: module github.com/pierrec/lz4@latest found (v2.5.2+incompatible), but does not contain package github.com/pierrec/lz4/internal/lz4block
../tmp.YQV392cFcD/pkg/mod/github.com/pierrec/lz4/[email protected]/lz4.go:15:2: module github.com/pierrec/lz4@latest found (v2.5.2+incompatible), but does not contain package github.com/pierrec/lz4/internal/lz4errors
../tmp.YQV392cFcD/pkg/mod/github.com/pierrec/lz4/[email protected]/reader.go:7:2: module github.com/pierrec/lz4@latest found (v2.5.2+incompatible), but does not contain package github.com/pierrec/lz4/internal/lz4stream

Multithreading support

I noticed that this library uses only one cpu core. It may be problem on processing large data. Do you have ideas about how to add multi thread support?

compress and decompress file is not same

I call funciton Lz4Compress, then using the []byte call Lz4Decompress.
The result shows it is not same with the source []byte.

Why?

func Lz4Compress(in []byte) ([]byte, error) {
	buf := bytes.NewBuffer(nil)
	zw := lz4.NewWriter(buf)

	_, err := zw.Write(in)
	if err != nil {
		return nil, err
	}
	return buf.Bytes(), nil
}

func Lz4Decompress(in []byte) ([]byte, error) {
	zw := lz4.NewReader(bytes.NewBuffer(in))

	out, err := ioutil.ReadAll(zw)
	if err != nil {
		return nil, err
	}
	return out, nil
}

I compare the file, the result file miss some data at the end of file.

checksum mismatch with go mod

Seeing these errors today when trying to compile a project that depends on this lib. Was tag v4.0.0 updated by any chance?

go: github.com/pierrec/lz4/[email protected]/go.mod: verifying module: checksum mismatch
	downloaded: h1:gZWDp/Ze/IJXGXf23ltt2EXimqmTUXEy0GFuRQyBid4=
	sum.golang.org: h1:eJLuLNLPNm3xzyZth0nrJJr7QGTLLgDBqadqAFFWq/M=

SECURITY ERROR
This download does NOT match the one reported by the checksum server.
The bits may have been replaced on the origin server, or an attacker may
have intercepted the download attempt.

For more information, see 'go help module-auth'.

Vendor dependencies

Please vendor lz4's dependencies, so that users don't have to manually resolve missing deps.

Example:

$ cd $GOROOT/src/github.com/pierrec/lz4
$ git submodule update --init --recursive
$ git submodule add [email protected]:pierrec/xxHash.git vendor/github.com/pierrec/xxHash
$ git add .gitmodules vendor
$ git commit -m 'vendor github.com/pierrec/xxHash'
$ git push

go get github.com/pierrec/lz4 failed

github.com/pierrec/lz4

src/github.com/pierrec/lz4/block.go:120: undefined: mfLimit
src/github.com/pierrec/lz4/block.go:128: undefined: hashLog
src/github.com/pierrec/lz4/block.go:128: invalid array bound 1 << hashLog
src/github.com/pierrec/lz4/block.go:129: undefined: minMatch
src/github.com/pierrec/lz4/block.go:129: undefined: hashLog
src/github.com/pierrec/lz4/block.go:134: undefined: hasher
src/github.com/pierrec/lz4/block.go:140: undefined: skipStrength
src/github.com/pierrec/lz4/block.go:141: undefined: minMatch
src/github.com/pierrec/lz4/block.go:143: undefined: hasher
src/github.com/pierrec/lz4/block.go:159: undefined: skipStrength
src/github.com/pierrec/lz4/block.go:159: too many errors

seek inside uncompressed file

Hello, does it possible to seek to pos inside uncompressed file?
As i see erader have pos, does this pos relative to compressed file or uncompressed?

Go module support

Hi,

I'm not sure the new Go Module system is integrated in this project.
Maybe I'm using this module badly but, in my project, when I add :

go.mod
require github.com/pierrec/lz4 v3.1.0

I get :
$ go build .
go: github.com/pierrec/[email protected]: go.mod has post-v0 module path "github.com/pierrec/lz4/v3" at revision f25943004a39
go: error loading module requirements

Did I miss something or is it not supported yet ?

Regards,

EDIT : I'm using go1.12.9

Non-deterministic output observed

Hi,

we are observing a weird issue, where different servers running the same version of LZ4 package (github.com/pierrec/lz4 v2.3.0+incompatible) produce different compressed output from the same input, and I'm trying to understand what is going on.

In our codebase, we reuse *lz4.Writer instances, and use (*lz4.Writer).Reset call when starting a new output. (*lz4.Writer) receives single Write call with entire input all at once. Inputs are ~2 MB, we use 64kb buffers in LZ4, mostly to reduce memory usage during decompression.

When reading two files with lz4 debug enabled, I see these outputs:

LZ4: reader.go:145 Read buf len=32768
LZ4: reader.go:103 header block max size id=4 size=65536
LZ4: reader.go:132 header read: lz4.Header{BlockMaxSize: 65536 }
LZ4: reader.go:152 header read OK compressed buffer 65536 / 131072 uncompressed buffer 65536 : 65536 index=65536
LZ4: reader.go:164 reading block from writer
LZ4: reader.go:203 raw block size 15501
LZ4: reader.go:238 compressed block size 15501
LZ4: reader.go:274 current frame checksum 30916118
LZ4: reader.go:295 copied 32768 bytes to input
LZ4: reader.go:145 Read buf len=32768
LZ4: reader.go:295 copied 32768 bytes to input
LZ4: reader.go:145 Read buf len=32768
LZ4: reader.go:164 reading block from writer
LZ4: reader.go:203 raw block size 15400
LZ4: reader.go:238 compressed block size 15400
LZ4: reader.go:274 current frame checksum fd477d82
LZ4: reader.go:295 copied 32768 bytes to input
LZ4: reader.go:145 Read buf len=32768
LZ4: reader.go:295 copied 32768 bytes to input
LZ4: reader.go:145 Read buf len=32768
LZ4: reader.go:164 reading block from writer
LZ4: reader.go:203 raw block size 15396
LZ4: reader.go:238 compressed block size 15396
LZ4: reader.go:274 current frame checksum 564efa4
LZ4: reader.go:295 copied 32768 bytes to input
LZ4: reader.go:145 Read buf len=32768
LZ4: reader.go:295 copied 32768 bytes to input
LZ4: reader.go:145 Read buf len=32768
LZ4: reader.go:164 reading block from writer
LZ4: reader.go:203 raw block size 15388
LZ4: reader.go:238 compressed block size 15388
LZ4: reader.go:274 current frame checksum 20424901
LZ4: reader.go:295 copied 32768 bytes to input
LZ4: reader.go:145 Read buf len=32768
LZ4: reader.go:295 copied 32768 bytes to input
LZ4: reader.go:145 Read buf len=32768
LZ4: reader.go:164 reading block from writer
LZ4: reader.go:203 raw block size 13886
LZ4: reader.go:238 compressed block size 13886
LZ4: reader.go:274 current frame checksum bba29999
LZ4: reader.go:295 copied 32768 bytes to input
LZ4: reader.go:145 Read buf len=32768
LZ4: reader.go:295 copied 25551 bytes to input
LZ4: reader.go:145 Read buf len=32768
LZ4: reader.go:164 reading block from writer
LZ4: reader.go:185 frame checksum got=bba29999 / want=bba29999

Second:

LZ4: reader.go:145 Read buf len=32768
LZ4: reader.go:103 header block max size id=4 size=65536
LZ4: reader.go:132 header read: lz4.Header{BlockMaxSize: 65536 }
LZ4: reader.go:152 header read OK compressed buffer 65536 / 131072 uncompressed buffer 65536 : 65536 index=65536
LZ4: reader.go:164 reading block from writer
LZ4: reader.go:203 raw block size 15501
LZ4: reader.go:238 compressed block size 15501
LZ4: reader.go:274 current frame checksum 30916118
LZ4: reader.go:295 copied 32768 bytes to input
LZ4: reader.go:145 Read buf len=32768
LZ4: reader.go:295 copied 32768 bytes to input
LZ4: reader.go:145 Read buf len=32768
LZ4: reader.go:164 reading block from writer
LZ4: reader.go:203 raw block size 15400
LZ4: reader.go:238 compressed block size 15400
LZ4: reader.go:274 current frame checksum fd477d82
LZ4: reader.go:295 copied 32768 bytes to input
LZ4: reader.go:145 Read buf len=32768
LZ4: reader.go:295 copied 32768 bytes to input
LZ4: reader.go:145 Read buf len=32768
LZ4: reader.go:164 reading block from writer
LZ4: reader.go:203 raw block size 15399
LZ4: reader.go:238 compressed block size 15399
LZ4: reader.go:274 current frame checksum 564efa4
LZ4: reader.go:295 copied 32768 bytes to input
LZ4: reader.go:145 Read buf len=32768
LZ4: reader.go:295 copied 32768 bytes to input
LZ4: reader.go:145 Read buf len=32768
LZ4: reader.go:164 reading block from writer
LZ4: reader.go:203 raw block size 15386
LZ4: reader.go:238 compressed block size 15386
LZ4: reader.go:274 current frame checksum 20424901
LZ4: reader.go:295 copied 32768 bytes to input
LZ4: reader.go:145 Read buf len=32768
LZ4: reader.go:295 copied 32768 bytes to input
LZ4: reader.go:145 Read buf len=32768
LZ4: reader.go:164 reading block from writer
LZ4: reader.go:203 raw block size 13887
LZ4: reader.go:238 compressed block size 13887
LZ4: reader.go:274 current frame checksum bba29999
LZ4: reader.go:295 copied 32768 bytes to input
LZ4: reader.go:145 Read buf len=32768
LZ4: reader.go:295 copied 25551 bytes to input
LZ4: reader.go:145 Read buf len=32768
LZ4: reader.go:164 reading block from writer
LZ4: reader.go:185 frame checksum got=bba29999 / want=bba29999

The only difference is that raw block sizes for last two blocks are 15388, 13886 and 15386, 13887 respectively, otherwise both files decompress back to the same data.

Is there anything that can make LZ4 writers to generate slightly different output? Reset call seems to reset everything except the hashtable – could that be an issue?

Thank you.

ps: So far I've been unable to reproduce the issue on my machine. :-(

Provide block compression functions that are guaranteed to succeed

Currenty both CompressBlock and CompressBlockHC refuse to compress some blocks.

The size of the compressed data is returned. If it is 0 and no error, then the data is incompressible.

That causes interoperability problems, when talking to software that uses block format directly.

https://github.com/lz4/lz4/blob/v1.8.3/lib/lz4.h#L129

Compression is guaranteed to succeed if 'dstCapacity' >= LZ4_compressBound(srcSize).

C implementation guarantees that block would be compressed, provided output buffer is large enough.

module name in go.mod for v2

It should have v2 in go.mod

module github.com/pierrec/lz4/v2

and then this example works,

import "github.com/pierrec/lz4/v2"

func test() {
    // ...
    text := lz4.NewReader(compressed)
    // ...
}

Data race when decompressing blocks concurrently

I have build code with -race and got the following data race message while uncompressing stream:

WARNING: DATA RACE
Read by goroutine 31:
  github.com/pierrec/lz4.(*Reader).decompressBlock()
      /home/developer/gocode/src/github.com/pierrec/lz4/reader.go:281 +0x292

Previous write by goroutine 27:
  github.com/pierrec/lz4.(*Reader).readBlock()
      /home/developer/gocode/src/github.com/pierrec/lz4/reader.go:230 +0x194
  github.com/pierrec/lz4.(*Reader).Read()
      /home/developer/gocode/src/github.com/pierrec/lz4/reader.go:178 +0x578
  ...

At first look it looks like z.Pos is modified concurrently without atomic or proper locking.

Very slow performance when sending on a socket.

I am writing an application that sends compressed messages through a socket. When using LZ4 performance is very low: about 7k messages/second for sending a trivial string. If I change BlockDependency from true to false then it raises to about 60k messages/second, still very far from the ~500k messages/second I get without compression.

It is very weird because using your Node.js package for a pure-JavaScript sender and writer, I am getting more than 300k messages/second with compression enabled. Can you tell what is going on? What am I doing wrong?

Full test code (sender and receiver) is attached. Usage: in two terminals:

$ go run receiver-test.go
$ go run sender-test.go

Thanks!

lz4-test.zip

TestCopy hangs on armhf

Hi,
we have build problems with lz4 on armhf (v7 set like Cortex-a8), TestCopy hangs:

=== RUN   TestCopy
SIGQUIT: quit
PC=0x6fcfc m=0

goroutine 28155 [running]:
bytes.Equal(0x13060000, 0x400000, 0x700000, 0x10576000, 0x400000, 0xa00000)
	/usr/lib/go-1.7/src/runtime/asm_arm.s:893 +0x28 fp=0x10541dd4 sp=0x10541dd4
github.com/pierrec/lz4_test.TestCopy(0x10572100)
	/home/stender/sandbox/golang-github-pierrec-lz4-0.0~git20151216.222ab1f/obj-arm-linux-gnueabihf/src/github.com/pierrec/lz4/lz4_test.go:510 +0x830 fp=0x10541fa8 sp=0x10541dd4
testing.tRunner(0x10572100, 0x15c6f4)
	/usr/lib/go-1.7/src/testing/testing.go:610 +0xa4 fp=0x10541fcc sp=0x10541fa8
runtime.goexit()
	/usr/lib/go-1.7/src/runtime/asm_arm.s:998 +0x4 fp=0x10541fcc sp=0x10541fcc
created by testing.(*T).Run
	/usr/lib/go-1.7/src/testing/testing.go:646 +0x304

goroutine 1 [chan receive, 2 minutes]:
testing.(*T).Run(0x10572100, 0x14906a, 0x8, 0x15c6f4, 0xa1901)
	/usr/lib/go-1.7/src/testing/testing.go:647 +0x324
testing.RunTests.func1(0x10516100)
	/usr/lib/go-1.7/src/testing/testing.go:793 +0xb8
testing.tRunner(0x10516100, 0x1053eed4)
	/usr/lib/go-1.7/src/testing/testing.go:610 +0xa4
testing.RunTests(0x15c714, 0x1c3120, 0x8, 0x8, 0xa00001)
	/usr/lib/go-1.7/src/testing/testing.go:799 +0x388
testing.(*M).Run(0x1053ef7c, 0x100000)
	/usr/lib/go-1.7/src/testing/testing.go:743 +0x8c
main.main()
	github.com/pierrec/lz4/_test/_testmain.go:76 +0x118

trap    0x6
error   0x0
oldmask 0x0
r0      0x1332bbce
r1      0x13460000
r2      0x10841bcd
r3      0x400000
r4      0x6f
r5      0x4c
r6      0x400000
r7      0x15c858
r8      0x11a863c0
r9      0x0
r10     0x10500960
fp      0x1c4114
ip      0x1b2728
sp      0x10541dd4
lr      0x95284
pc      0x6fcfc
cpsr    0x20000010
fault   0x0
*** Test killed with quit: ran too long (10m0s).
FAIL	github.com/pierrec/lz4	600.079s

The package is tried to build against Golang 1.7.
Best,
DS

Documentation about version differences.

Hello Pierre !

I've just realized there is v2, v3 and v4. I'm wondering what's the differences between those ? any reasons to use v3 or v4 over v2 ?

Thank you !

The latest commit (move stuff to v2) breaks package with go 1.10 and earlier

Everything is in the name.

go get github.com/pierrec/lz4

package github.com/pierrec/lz4/v2/internal/xxh32: cannot find package "github.com/pierrec/lz4/v2/internal/xxh32" in any of:
        /usr/lib/go-1.9/src/github.com/pierrec/lz4/v2/internal/xxh32 (from $GOROOT)
        /***/gopath/src/github.com/pierrec/lz4/v2/internal/xxh32 (from $GOPATH)

EOF error when decompress

func decompress(file string) []byte {
	data, err := ioutil.ReadFile(file)
	if err != nil {
		panic(err)
	}
	var buff bytes.Buffer
	r := lz4.NewReader(&buff)
	_, err = r.Read(data)
	if err != nil {
		panic(err)
	}
	return buff.Bytes()
}

And _, err = r.Read(data) return EOF. Can you help me?

Hash tables can be made 2× or 4× smaller

CompressBlock uses a hash table of 1<<16 int-typed match positions, which is 256kiB on a 32-bit platform, 512kiB on a 64-bit one. That's wasteful, since match offsets are limited to 65536 by the LZ4 format. If the table entry type is changed to uint16, it uses only 128kiB, regardless of the platform.

The trick is to store only the lower 16 bits of the match position, meaning that it becomes an offset past an implicit 64kiB block boundary. Since the compressor doesn't remember where that boundary was, it has to check for every match in the hash table whether an actual match occurs at si-offset. That's not too much of a problem, because it has to check that position anyway in case of a hash collision and it's likely still in the cache (more likely when the hash table is smaller).

I have a working prototype of this idea. It passes all tests and its speed is largely unchanged on amd64:

name              old time/op    new time/op    delta
Compress-8          3.13ms ± 3%    3.05ms ± 1%    -2.42%  (p=0.000 n=20+20)
CompressRandom-8    17.9µs ± 1%     8.4µs ± 1%   -52.72%  (p=0.000 n=19+20)
CompressPg1661-8     685µs ± 2%     700µs ± 2%    +2.26%  (p=0.000 n=20+20)
CompressDigits-8     652µs ± 3%     649µs ± 2%      ~     (p=0.659 n=20+20)
CompressTwain-8      677µs ± 4%     690µs ± 3%    +1.90%  (p=0.000 n=20+20)
CompressRand-8       641µs ± 3%     634µs ± 2%    -1.16%  (p=0.006 n=20+19)

Results on the Raspberry PI look worse, but because the allocation rate has high variance, I think the sync.Pools don't work as well on that tiny machine.

I'd like to clean up the code and submit a PR, but before I do so, I'd like to discuss the API with you. I think the current CompressBlock API doesn't work all too well anymore, given the type change. The prototype code casts a []int hash table to *[1<<16]uint16 with an unsafe hack. If I may make a proposal, I think a new API might look as follows:

// A BlockCompressor compresses byte slices into the LZ4 block format.
type BlockCompressor struct {/* ... */}

func (c *BlockCompressor) Compress(src, dst []byte) (int, error) // Or just int? #91

type BlockCompressor struct {
    Level int
    // implementation
}

func (c *BlockCompressor) Compress(src, dst []byte) (int, error)

Having these types allows us to get rid of the sync.Pools. The old functions can be compatibility shims that call these new ones:

func CompressBlock(src, dst []byte, _ []int) (int, error) {
    var c BlockCompressor // Maybe use a sync.Pool for this one.
    return c.Compress(src, dst)
}

WDYT?

Tag release v1.0.1?

There have been some important bug fixes since the v1.0. Would you mind tagging a v1.0.1 release?

Can't build the binary

go version go1.11.11 linux/amd64

/go/src/github.com/pierrec/lz4# go build .

github.com/pierrec/lz4

./block.go:11:23: undefined: hashShift
./block.go:50:21: undefined: mfLimit
./block.go:72:31: undefined: winSize
./block.go:83:9: undefined: minMatch
./block.go:184:21: undefined: mfLimit
./block.go:192:29: undefined: winSize
./block.go:195:11: undefined: winSize
./block.go:207:74: undefined: winSize
./block.go:207:106: undefined: winMask
./block.go:225:12: undefined: minMatch
./block.go:225:12: too many errors

Alternating byte values don't compress

I was trying to run some benchmarks on the code by compressing some text files in the 500k - 2mb range, and they just weren't compressing. So I wrote this for a test:

`

package main

import (
    "fmt"
    "github.com/pierrec/lz4"
)

func main() {
    input := make([]byte, 64 * 1024)

    for i := 0; i < 64*1024; i += 1 {
        input[i] = byte(i & 0x1)
    }
    output := make([]byte, 64 * 1024)

    c, e := lz4.CompressBlock(input, output, 0)

    if c == 0 && e == nil {
        fmt.Printf("Won't compress.\n")
    } else if e != nil {
        fmt.Printf("Error: %s\n", e)
    } else {
        fmt.Printf("Compressed to %d bytes.\n", c)
    }
}

`

This should produce a 64k buffer of alternating 1s and 0s. LZ4 reports that the block won't compress, which feels wrong to me. If I change the buffer to only contains 0s, then it compresses to 50833 bytes, which seems a bit large.

Changing the code to use CompressBlockHC does appear to work. So i'm guessing there is a issue with the normal compressing code?

compress time formate by time.RFC3339 failed, compressed byte slice len is 0

data := []byte({"CreatedAt":"2019-12-29T20:56:26.535143+08:00"}`)
buf := make([]byte, len(data))
ht := make([]int, 64<<10) // buffer for the compression table

n, err := lz4.CompressBlock(data, buf, ht)
if err != nil {
	fmt.Println(err)
}
if n >= len(data) {
	fmt.Printf("`%s` is not compressible", string(data))
}
fmt.Println("compressed len", n)
buf = buf[:n] // compressed data

// Allocated a very large buffer for decompression.
out := make([]byte, 10*len(data))
n, err = lz4.UncompressBlock(buf, out)
if err != nil {
	fmt.Println(err)
}
out = out[:n] // uncompressed data

fmt.Println(string(out[:len(data)]))`

output:
compressed len 0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.