Git Product home page Git Product logo

Comments (27)

jeroen avatar jeroen commented on April 29, 2024

I can confirm these results. At least for the example above brotli is about 10x slower with the default settings and still performs much worse than xz:

> # Compare size
> length(x)
[1] 6779000
> system.time(print(length(brotli_compress(x))))
[1] 1369587
   user  system elapsed 
 26.702   0.672  27.371 
> system.time(print(length(brotli_compress(x, mode = "font"))))
[1] 1353634
   user  system elapsed 
 26.417   0.624  27.038 
> system.time(print(length(memCompress(x, "gzip"))))
[1] 2323522
   user  system elapsed 
  0.358   0.001   0.359 
> system.time(print(length(memCompress(x, "bzip2"))))
[1] 2289619
   user  system elapsed 
  0.661   0.004   0.664 
> system.time(print(length(memCompress(x, "xz"))))
[1] 937728
   user  system elapsed 
  2.006   0.045   2.050 

from brotli.

jeroen avatar jeroen commented on April 29, 2024

@bhouston why are you suck with JavaScript-based compression? There are C implementations of lzma?

from brotli.

bhouston avatar bhouston commented on April 29, 2024

@jeroenooms I want to send LZMA compressed mesh data to our WebGL viewer that runs in JavaScript:

https://clara.io/view/d3b82831-d56b-462f-b30c-500ea1c7f870

Thus I need to have browser-based client side decompression. I would love it if LZMA or an equivalent (I was hoping Brotli) was integrated into all browsers, but as it stands, only gzip and deflate are available. Thus yo have to use LZMA.js for browser-based LZMA decompression.

from brotli.

jeroen avatar jeroen commented on April 29, 2024

Perhaps their main selling point is that decompression is faster than for lzma, to make things easier on the client. But I agree that the difference in compression ratio very big, given the proposal states that:

This specification defines a lossless compressed data format that compresses data using a combination of the LZ77 algorithm and Huffman coding, with efficiency comparable to the best currently available general-purpose compression methods.

from brotli.

bhouston avatar bhouston commented on April 29, 2024

@jeroenooms I think that LZMA also requires significant maximum memory for decompression, which Brotli may be improving upon.

It would be near if you could re-run your statistics with decompression times and maximum memory instead of compression times, if that was easy. Maybe that will better emphasize Brotli's benefits?

from brotli.

jeroen avatar jeroen commented on April 29, 2024

@bhouston yes it seems like brotli does outperform everyone in terms of decompression speed. It is similar, sometimes bit faster than gzip. But I have yet to find an example where the compression ratio is better than xy or bzip2.

from brotli.

eustas avatar eustas commented on April 29, 2024

Where I can read about TriMesh binary format?
I have a feeling, that transposing and demuxing tables would make the file more compressible, especially by encoders with lower rank context modeling (zlib and brotli).

from brotli.

Metric avatar Metric commented on April 29, 2024

I can also confirm that brotli sucks with some binary and text files using the GENERIC / TEXT setting! I can zip up the brotli source code on my mac down to 8.3MB using the standard archiver utility. If I try to archive the brotli source code using brotli it increases it from 19.8MB to 33.3MB with a quality setting of 6! Even with a quality of 11 it is still 31MB! I have tried both ways, where I combine the binary of all files into one file and compress or compressing individual files and combining into one file. Either way it works out to the same amount some how.

Now, I have to hand it to brotli when it comes to a 5200x2500 image file which is 2.8MB in jpeg form at 100% quality. Brotli is able to compress it down to 900KB with just a quality of 6. With a quality of 11 it gets it down to 813KB roughly. It is a simple graphic of a logo though. However, it is better than a PNG version of the logo. A PNG version of the logo is 1.98MB. So, hey at least that is an improvement!

from brotli.

JarekDuda avatar JarekDuda commented on April 29, 2024

I have checked lzturbo 1.2 ( https://sites.google.com/site/powturbo/home ) on this file:
-49 mode: 894,729
-39 mode 1,256,602
and LZA 0.82 ( http://encode.ru/threads/1969-LZA-archiver ): 995,014

I think the main issue is that Brotli still uses Huffman coding, which is terrible at skewed probability distributions, characteristic for some types of data. More recent compressors use ANS coding instead (lzturbo, LZA, LZNA, ZSTD, Apple LZFSE), which allows to repair it still maintaining the speed.

from brotli.

eustas avatar eustas commented on April 29, 2024

My guess was right. Even without knowing the format, it is easy to improve compression ratio.

Step 1: split odd and even bytes, e.g. with python script

f = open("input", "rb")
K = 2
d = [[] for k in range(K)]
i = 0
b = f.read(1)
while b != "":
  d[i % K].append(b)
  i = i + 1
  b = f.read(1)
f.close()

f = open("output", "wb")
for k in range(K):
  f.write(''.join(d[k]))
f.close()

Step 2: compress
Step 3: profit!

  • Zopfli: 1,847,682 bytes (~22% smaller)
  • Brotli: 1,172,693 bytes (~29% smaller)

I believe that with smarter transform results will be even better.

from brotli.

bhouston avatar bhouston commented on April 29, 2024

My guess was right. Even without knowing the format, it is easy to improve compression ratio. Step 1: split odd and even bytes, e.g. with python script

Weird approach but it gave good results. The file is a bit of a mish-mash between float and integer data. It will exhibit a 32-bit alignment pattern as it is filled mostly with 32 bit floats and 32 bit floats with some exceptions and we do align all internal data array starts on 32-bit boundaries for JavaScript ArrayBuffer access efficiency.

As some asked, a trimesh / polymesh format is for representing piece-wise linear objects in 3D. This format supports both trimeshes and polygon meshes:

https://en.wikipedia.org/wiki/Triangle_mesh
https://en.wikipedia.org/wiki/Polygon_mesh

It is composed of a list of 3D positions (X,Y,Z as 32 bit floats triples), and then a list of integers which state how those points are connected to form triangles or polygons, (one list of integers which states how many points per face/polygon, and then another list of those indices for each face) and then it contains normals (more XYZ 32 bit float triples) and UVs (UV 32 bit float pairs) and more integers that state how these are to be used by the faces/polygons (these are called polygon mappings.) This format actually will change the precision of the integers it uses from 8 to 16 to 32 bit depending upon their required maximum values, but I believe this files is mostly 32 bit integers because it is on the larger size (more than 65K vertices/points, and thus it exceeds the range of 16 bit integers.)

from brotli.

JarekDuda avatar JarekDuda commented on April 29, 2024

One of reasons might be the fact that the first 9 bits of float ( https://en.wikipedia.org/wiki/Single-precision_floating-point_format ) is sign and exponent, which should be nearly the same for most of points. Also the next 8 bits: the most significant of the fraction, should be similar for neighboring points.
So if you group even or odd positions, you make these similar bytes closer and so simpler for compressor to use their similarity (most of them work on bytes).

But generally, these formats seem extremely wasteful - you could probably get an order of magnitude better compression with a cheap but dedicated compressor.
Starting with storing differences instead of absolute positions, additionally entropy coder coder should be used for the most significant digits.
Storing the mesh is a graph compression problem and mesh is a very specific one - should be cheaply compressible.
My first thought is to cover all triangles with a tree (like in https://en.wikipedia.org/wiki/File:Icosaedro_desarrollo.gif ) - then go through the tree and add new information for each node: usually one vertex per node - just store the difference of its position to the previous ones.
Or even less as it has to agree with neighbors - one distance per node (difference from equilateral triangle) seems sufficient.

from brotli.

bhouston avatar bhouston commented on April 29, 2024

But generally, these formats seem extremely wasteful - you could probably get an order of magnitude better compression with a cheap but dedicated compressor.

Our data is representative of a lot of non-video large binary streams -- it is composed of long streams of floats and integers, exactly what should be in a binary data stream. Our case is an easy one because they are even 32-bit boundary aligned. Of course we could write a custom compressor but then it will be in JavaScript again and we lose the speed and memory benefits of having it integrated into the browser. It would also take a lot of time for us and also force that on all other 3D developers -- thus it is not a globally efficient solution nor a locally efficient solution.

I would prefer to just use LZMA.js in that case or a browser-based compression tool that produces similar results. Basically saying that I have to write a custom compression scheme because Brotli is inadequate compared to LZMA is not an acceptable answer. We have good general purpose compression schemes for a reason and right now Brotli isn't one, even though it claims to be.

from brotli.

JarekDuda avatar JarekDuda commented on April 29, 2024

If decompression in java is your requirement, LZMA would be still much slower than dedicated compressor I have sketched.
But generally I also think that it is too early to declare Brotli as a new standard - it is still far from state-of-art compressors ( http://encode.ru/threads/2313-Brotli?p=44970&viewfull=1#post44970 ) and uses Huffman making it terrible if skewed probability file happen.

from brotli.

bhouston avatar bhouston commented on April 29, 2024

My first thought is to cover all triangles with a tree (like in https://en.wikipedia.org/wiki/File:Icosaedro_desarrollo.gif ) - then go through the tree and add new information for each node: usually one vertex per node - just store the difference of its position to the previous ones.
Or even less as it has to agree with neighbors - one distance per node (difference from equilateral triangle) seems sufficient.

This sounds like the triangle strip idea: https://en.wikipedia.org/wiki/Triangle_strip

But the standard storage format for Triangle Strips is still a list of vertices as XYZ 32 bit floats and then a long list of 32 bit integers.

I am interested a bit in general purpose compressors, but only if they are easy to implement and ultra fast and give better results than LZMA.

So if I understand one could easily write a special purpose compressor for lists of 32 bit floats. Basically you want a reordering of the positions to have neighbors close by, then an adjustment of their space to be fully positive in XYZ. And then a reordering of their internal bytes. And then I need to send this into a general purpose compression stream of some type -- what would you recommend? One can do that fairly simply I guess. This covers the compression of the 32bit floats.

What would you do with the 32bit integers to maximize compression? If one does a neighbor reordering of the XYZ positions, that imposes an ordering on the 32bit integers which reference those points. Basically one can only impose an order on either the integers or the XYZ positions, but not both easily.

from brotli.

JarekDuda avatar JarekDuda commented on April 29, 2024

Indeed imposing an order is difficult, better is to directly encode parameters of the "triangle strip":

  • first you have to encode the structure of the tree. For a triangular lattice you have full binary tree - their number is Catalan number, you need ~1 bit/node: if this is a leaf or not.
    Comparing to 3x32 vertex numbers, it means 96x compression.
  • then you don't need 3 float numbers per vertex - imagine such triangular lattice you want to bend - there are lots of constraints due to the fact that lengths of triangle edges have to agree. It is sufficient to store 1 length per node of the tree - e.g. length of its edge on the right (on the left is from its neighbor).
    Then decoder would need to perform the bending - it is a few trigonometric operations per node - really cheap comparing to what LZMA does.

Regarding using floats - you store mostly very similar numbers - their exponent and sign (9 of 32 bits) are probably identical, also the most significant bits of the fraction.
Using fixed-precision numbers instead you should directly get like 1/2 savings.
And their most significant digits are usually from some localized probability distribution, especially if you encode differences - you can use entropy coder for them.

from brotli.

bhouston avatar bhouston commented on April 29, 2024

@JarekDuda Very interesting. Our case is a bit more complex than what is outlined, and would require a fair bit of work to deduce, but I bet there is a good algorithm to be had in the end.

from brotli.

JarekDuda avatar JarekDuda commented on April 29, 2024

I have started a general discussion about shape compression ( http://encode.ru/threads/2324-Data-compression-of-a-%283D%29-shape-e-g-of-TriMesh-or-a-chemical-molecule ) and was pointed a nice paper:
http://www.cc.gatech.edu/~jarek/papers/Compression.pdf
It says we need ~1byte/triangle, what agrees with what I have written.
In contrast, the standard you have described uses 3x4 bytes per vertex plus ~3x4 bytes/triangle for labels - we are talking about a cheap ~20x compression, 3 times better than the best discussed above (which are much more expensive).
It's sad that standards are so wasteful ... it's what's happening when engineers don't talk with theoreticians ...

from brotli.

bhouston avatar bhouston commented on April 29, 2024

@JarekDuda Just be careful that there are two major different mesh formats, one is purely for playback in a video game -- limited trimesh format with a fixed number of channels, and there is a second more general polygon format with arbitrary named channels -- also used for playback when one needs additional channels or knowledge of the polygonal structure. Both are heavily used for data transfer needs. I have interests in both.

BTW the current most popular open mesh compression format is OpenCTM. I think it is mostly LZMA: http://openctm.sourceforge.net/ And then this for in browser decompression: https://github.com/jcmellado/js-openctm

from brotli.

jyrkialakuijala avatar jyrkialakuijala commented on April 29, 2024

I'm now seeing 1'236'531 bytes with --quality 11 --window 24. I have some ideas (using more LSB6/MSB6 context modeling and less blocks) that we will try out later this year, with an expected 5-10 % more compression for this kind of data.

from brotli.

valette avatar valette commented on April 29, 2024

BTW the current most popular open mesh compression format is OpenCTM. I think it is mostly LZMA:

it is LZMA plus vertices and triangles reordering, to improve compression:
https://github.com/Danny02/OpenCTM/blob/master/lib/compressMG2.c#L191-L212

This scheme could probably be improved, but there are several constraints to take into account. IMHO the input mesh topology is the biggest constraint : as an example, you cannot perform triangle strip compression on a triangle soup, where all vertices are disconnected.

from brotli.

bhouston avatar bhouston commented on April 29, 2024

I understand re-ordering the vertices as well as splitting out the Y, X and Z into linear arrays so that it is X0, X1, X2, X3.... Y0, Y1, Y2.... Z0, Z1, Z2, is better.

from brotli.

bhouston avatar bhouston commented on April 29, 2024

But maybe you are corrent that OpenCTM is the best solution, maybe it is better than I thought. I apologize.

from brotli.

eustas avatar eustas commented on April 29, 2024

Update : currently brotli at level 11 compresses the example file to 1'236'595. It is still bigger than LZMA output, but there is a space for further encoder improvements.

from brotli.

bhouston avatar bhouston commented on April 29, 2024

Those are fairly significant improvements to brotli. That is nice. It doesnt' have to his LZMA compression level, but 25% improvements over what it was is hugely significant. What was the compression time for this file at level 11?

from brotli.

eustas avatar eustas commented on April 29, 2024

About 49s (with fixed encoder) on Intel Xeon E5-1650 @ 1200 MHz (powersave mode).
There is a problem in encoder that makes encoding of this file unexpectedly slow. I'm preparing a fix, soon it will be published.

from brotli.

wallabyway avatar wallabyway commented on April 29, 2024

@bhouston : I wonder if "utf8 webgl-loader" mesh compression, with brotli is a better option than gzip ? I did a quick test on the utf8 data here: http://webgl-tests.crbs.ucsd.edu/obj-viewer/?model=test
and brotli ratio seemed comparable to lzma.
60MB obj file (original file)
12MB utf8 file...
4.5MB utf8.gz (using default)
3.4MB utf8.br (using q11, w24)
3.4MB utf8.xz (using default)

The ideas in Won's utf8 webgl loader do the specialized transform that @eustas (and @JarekDuda) mentioned, but it sounds like you want to avoid any deserializing in js land?

from brotli.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.