Git Product home page Git Product logo

bsc-m03's Introduction

bsc-m03

The bsc-m03 is experimental block sorting compressor based on M03 context aware compression algorithm invented by Michael Maniscalco:

  • Michael Maniscalco M03: A solution for context based blocksort (BWT) compression, 2004
  • Jurgen Abel Post BWT stages of the Burrows-Wheeler compression algorithm, 2010

Moreover, the bsc-m03 compressor is a practical implementation of Compression via Substring Enumeration for byte-oriented sources:

  • Danny Dube, Vincent Beaudoin Lossless Data Compression via Substring Enumeration, 2010
  • Takahiro Ota, Hiroyoshi Morita, Akiko Manada Compression by Substring Enumeration with a Finite Alphabet Using Sorting, 2018

Copyright (c) 2021-2023 Ilya Grebnov [email protected]

License

The bsc-m03 is released under the GNU General Public License

Changes

  • 2023-05-08 : Version 0.5.5
    • Fixed segmentation fault on Unix based systems.
  • 2022-11-27 : Version 0.5.0
    • Compression ratio improvements.
  • 2022-11-20 : Version 0.4.0
    • Compression ratio improvements.
  • 2022-11-10 : Version 0.3.0
    • Compression ratio improvements.
  • 2022-01-08 : Version 0.2.1
    • Performance improvements.
  • 2022-01-05 : Version 0.2
    • Memory usage improvements.
    • Compression ratio improvements.
  • 2021-12-07 : Version 0.1.1 - 0.1.2
    • Minor compression ratio improvements.
  • 2021-12-03 : Version 0.1.0
    • Initial public release of the bsc-m03.

Benchmarks

Calgary Corpus

File name Input size (bytes) Output size (bytes) Bits per symbol
bib 111261 24479 1.760
book1 768771 203745 2.120
book2 610856 138870 1.819
geo 102400 52465 4.099
news 377109 105621 2.241
obj1 21504 9775 3.637
obj2 246814 68003 2.204
paper1 53161 14957 2.251
paper2 82199 22594 2.199
pic 513216 44424 0.692
progc 39611 11257 2.274
progl 71646 13512 1.509
progp 49379 9248 1.498
trans 93695 15310 1.307

Canterbury Corpus

File name Input size (bytes) Output size (bytes) Bits per symbol
alice29.txt 152089 38562 2.028
asyoulik.txt 125179 35889 2.294
cp.html 24603 6872 2.235
fields.c 11150 2685 1.926
grammar.lsp 3721 1120 2.408
kennedy.xls 1029744 57440 0.446
lcet10.txt 426754 94823 1.778
plrabn12.txt 481861 129770 2.154
ptt5 513216 44424 0.692
sum 38240 11426 2.390
xargs.1 4227 1585 3.000

Large Canterbury Corpus

File name Input size (bytes) Output size (bytes) Bits per symbol
bible.txt 4047392 698395 1.380
E.coli 4638690 1126125 1.942
world192.txt 2473400 376173 1.217

Silesia Corpus

File name Input size (bytes) Output size (bytes) Bits per symbol
dickens 10192446 2199344 1.726
mozilla 51220480 15589159 2.435
mr 9970564 2156826 1.731
nci 33553445 1126386 0.269
ooffice 6152192 2503991 3.256
osdb 10085684 2223002 1.763
reymont 6627202 958772 1.157
samba 21606400 3794300 1.405
sao 7251944 4649723 5.129
webster 41458703 6253627 1.207
xml 5345280 357958 0.536
x-ray 8474240 3681388 3.475

Manzini Corpus

File name Input size (bytes) Output size (bytes) Bits per symbol
chr22.dna 34553758 7206269 1.668
etext99 105277340 21422251 1.628
gcc-3.0.tar 86630400 10046880 0.928
howto 39422105 7504315 1.523
jdk13c 69728899 2612434 0.300
linux-2.4.5.tar 116254720 16351863 1.125
rctail96 114711151 9707347 0.677
rfc 116421901 14871775 1.022
sprot34.dat 109617186 17157222 1.252
w3c2 104201579 5598687 0.430

Maximum Compression Corpus

File name Input size (bytes) Output size (bytes) Bits per symbol
A10.jpg 842468 823533 7.820
AcroRd32.exe 3870784 1555832 3.216
english.dic 465211 145096 2.495
FlashMX.pdf 4526946 3712716 6.561
FP.LOG 20617071 502648 0.195
MSO97.DLL 3782416 1878076 3.972
ohs.doc 4168192 803171 1.542
rafale.bmp 4149414 745470 1.437
vcfiu.hlp 4121418 604165 1.173
world95.txt 2988578 442271 1.184

Large Text Compression Benchmark Corpus

File name Input size (bytes) Output size (bytes) Bits per symbol
enwik8 100000000 20263925 1.621
enwik9 1000000000 160018905 1.280

Pizza & Chilli Corpus

File name Input size (bytes) Output size (bytes) Bits per symbol
dblp.xml 296135874 21926695 0.592
dna 403927746 86414423 1.711
english.1024MB 1073741824 193810792 1.444
pitches 55832855 16984071 2.434
proteins 1184051855 304486803 2.057
sources 210866607 29749020 1.129

Pizza & Chilli Repetitive Corpus

File name Input size (bytes) Output size (bytes) Bits per symbol
cere 461286644 8576879 0.149
coreutils 205281778 4293243 0.167
einstein.de.txt 92758441 132286 0.011
einstein.en.txt 467626544 336029 0.006
Escherichia_Coli 112689515 7928044 0.563
influenza 154808555 1760692 0.091
kernel 257961616 2955825 0.092
para 429265758 10730998 0.200
world_leaders 46968181 518220 0.088
fib41 267914296 83 0.000
rs.13 216747218 86 0.000
tm29 268435456 158 0.000

bsc-m03's People

Contributors

ilyagrebnov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

check4game

bsc-m03's Issues

Segmentation fault on linux' logo.gif

file linux/Documentation/logo.gif from every linux kernel source starting from linux-2.0.tar
bsc-m03 is compiled for 64-bit arch (does not reproduced if compiled for 32-bit arch)

walt@myhost:/dev/shm$ /usr/local/src/compression/bsc-m03/bsc-m03-0.2.1/bsc-m03 e logo.gif logo.gif.m03-021
bsc-m03 is experimental block sorting compressor. Version 0.2.1 (08 January 2022).
Copyright (c) 2021-2022 Ilya Grebnov <[email protected]>. ABSOLUTELY NO WARRANTY.
This program is based on (at least) the work of Michael Maniscalco (see AUTHORS).

logo.gif compressed from 16335 into 16300 in 0.046 seconds (7.983 bps).
walt@myhost:/dev/shm$ /usr/local/src/compression/bsc-m03/bsc-m03-0.3.0/bsc-m03 e logo.gif logo.gif.m03-030
bsc-m03 is experimental block sorting compressor. Version 0.3.0 (10 November 2022).
Copyright (c) 2021-2022 Ilya Grebnov <[email protected]>. ABSOLUTELY NO WARRANTY.
This program is based on (at least) the work of Michael Maniscalco (see AUTHORS).

logo.gif compressed from 16335 into 16281 in 0.046 seconds (7.974 bps).
walt@myhost:/dev/shm$ /usr/local/src/compression/bsc-m03/bsc-m03-0.4.0/bsc-m03 e logo.gif logo.gif.m03-040
bsc-m03 is experimental block sorting compressor. Version 0.4.0 (20 November 2022).
Copyright (c) 2021-2022 Ilya Grebnov <[email protected]>. ABSOLUTELY NO WARRANTY.
This program is based on (at least) the work of Michael Maniscalco (see AUTHORS).

Compressing logo.gif(00%)Segmentation fault
walt@myhost:/dev/shm$ /usr/local/src/compression/bsc-m03/bsc-m03-0.5.0/bsc-m03 e logo.gif logo.gif.m03-050
bsc-m03 is experimental block sorting compressor. Version 0.5.0 (27 November 2022).
Copyright (c) 2021-2022 Ilya Grebnov <[email protected]>. ABSOLUTELY NO WARRANTY.
This program is based on (at least) the work of Michael Maniscalco (see AUTHORS).

Compressing logo.gif(00%)Segmentation fault
walt@myhost:/dev/shm$ ls -l logo.gif*
-rw-r--r-- 1 walt walt 16335 May  7 23:24 logo.gif
-rw-rw-r-- 1 walt walt 16300 May  8 00:40 logo.gif.m03-021
-rw-rw-r-- 1 walt walt 16281 May  8 00:41 logo.gif.m03-030
-rw-rw-r-- 1 walt walt     0 May  8 00:41 logo.gif.m03-040
-rw-rw-r-- 1 walt walt     0 May  8 00:41 logo.gif.m03-050
walt@myhost:/dev/shm$ /usr/local/src/compression/bsc-m03/bsc-m03-0.2.1/bsc-m03 d logo.gif.m03-021 logo.gif_021
bsc-m03 is experimental block sorting compressor. Version 0.2.1 (08 January 2022).
Copyright (c) 2021-2022 Ilya Grebnov <[email protected]>. ABSOLUTELY NO WARRANTY.
This program is based on (at least) the work of Michael Maniscalco (see AUTHORS).

logo.gif.m03-021 decompressed from 16300 into 16335 in 0.047 seconds.
walt@myhost:/dev/shm$ /usr/local/src/compression/bsc-m03/bsc-m03-0.3.0/bsc-m03 d logo.gif.m03-030 logo.gif_030
bsc-m03 is experimental block sorting compressor. Version 0.3.0 (10 November 2022).
Copyright (c) 2021-2022 Ilya Grebnov <[email protected]>. ABSOLUTELY NO WARRANTY.
This program is based on (at least) the work of Michael Maniscalco (see AUTHORS).

logo.gif.m03-030 decompressed from 16281 into 16335 in 0.048 seconds.
walt@myhost:/dev/shm$ sha256sum -b logo.gif logo.gif_*
4cdf8d34e001fc7f15b61823eee5617f5389e153d7d317471d0f9d982c0a2745 *logo.gif
4cdf8d34e001fc7f15b61823eee5617f5389e153d7d317471d0f9d982c0a2745 *logo.gif_021
4cdf8d34e001fc7f15b61823eee5617f5389e153d7d317471d0f9d982c0a2745 *logo.gif_030
walt@myhost:/dev/shm$ 

corrupted size vs. prev_size on linux/include/linux/user.h

file linux/include/linux/user.h (22 bytes) from every linux kernel source:

walt@myhost:/dev/shm$ /usr/local/src/compression/bsc-m03/bsc-m03-0.2.1/bsc-m03 e user.h user.h.m03-021
bsc-m03 is experimental block sorting compressor. Version 0.2.1 (08 January 2022).
Copyright (c) 2021-2022 Ilya Grebnov <[email protected]>. ABSOLUTELY NO WARRANTY.
This program is based on (at least) the work of Michael Maniscalco (see AUTHORS).

user.h compressed from 22 into 38 in 0.002 seconds (13.818 bps).
corrupted size vs. prev_size
Aborted
walt@myhost:/dev/shm$ /usr/local/src/compression/bsc-m03/bsc-m03-0.3.0/bsc-m03 e user.h user.h.m03-030
bsc-m03 is experimental block sorting compressor. Version 0.3.0 (10 November 2022).
Copyright (c) 2021-2022 Ilya Grebnov <[email protected]>. ABSOLUTELY NO WARRANTY.
This program is based on (at least) the work of Michael Maniscalco (see AUTHORS).

user.h compressed from 22 into 37 in 0.002 seconds (13.455 bps).
corrupted size vs. prev_size
Aborted
walt@myhost:/dev/shm$ /usr/local/src/compression/bsc-m03/bsc-m03-0.4.0/bsc-m03 e user.h user.h.m03-040
bsc-m03 is experimental block sorting compressor. Version 0.4.0 (20 November 2022).
Copyright (c) 2021-2022 Ilya Grebnov <[email protected]>. ABSOLUTELY NO WARRANTY.
This program is based on (at least) the work of Michael Maniscalco (see AUTHORS).

user.h compressed from 22 into 38 in 0.004 seconds (13.818 bps).
corrupted size vs. prev_size
Aborted
walt@myhost:/dev/shm$ /usr/local/src/compression/bsc-m03/bsc-m03-0.5.0/bsc-m03 e user.h user.h.m03-050
bsc-m03 is experimental block sorting compressor. Version 0.5.0 (27 November 2022).
Copyright (c) 2021-2022 Ilya Grebnov <[email protected]>. ABSOLUTELY NO WARRANTY.
This program is based on (at least) the work of Michael Maniscalco (see AUTHORS).

user.h compressed from 22 into 37 in 0.004 seconds (13.455 bps).
corrupted size vs. prev_size
Aborted
walt@myhost:/dev/shm$ /usr/local/src/compression/bsc-m03/bsc-m03-0.2.1/bsc-m03 d user.h.m03-021 user.h_021
bsc-m03 is experimental block sorting compressor. Version 0.2.1 (08 January 2022).
Copyright (c) 2021-2022 Ilya Grebnov <[email protected]>. ABSOLUTELY NO WARRANTY.
This program is based on (at least) the work of Michael Maniscalco (see AUTHORS).


Error: The compressed data is corrupted!
Decompressing user.h.m03-021(00%)walt@myhost:/dev/shm$ 
walt@myhost:/dev/shm$ /usr/local/src/compression/bsc-m03/bsc-m03-0.3.0/bsc-m03 d user.h.m03-030 user.h_030
bsc-m03 is experimental block sorting compressor. Version 0.3.0 (10 November 2022).
Copyright (c) 2021-2022 Ilya Grebnov <[email protected]>. ABSOLUTELY NO WARRANTY.
This program is based on (at least) the work of Michael Maniscalco (see AUTHORS).

Decompressing user.h.m03-030(00%)
Error: The compressed data is corrupted!
walt@myhost:/dev/shm$ /usr/local/src/compression/bsc-m03/bsc-m03-0.4.0/bsc-m03 d user.h.m03-040 user.h_040
bsc-m03 is experimental block sorting compressor. Version 0.4.0 (20 November 2022).
Copyright (c) 2021-2022 Ilya Grebnov <[email protected]>. ABSOLUTELY NO WARRANTY.
This program is based on (at least) the work of Michael Maniscalco (see AUTHORS).

Decompressing user.h.m03-040(00%)
Error: The compressed data is corrupted!
walt@myhost:/dev/shm$ /usr/local/src/compression/bsc-m03/bsc-m03-0.4.0/bsc-m03 d user.h.m03-050 user.h_050
bsc-m03 is experimental block sorting compressor. Version 0.4.0 (20 November 2022).
Copyright (c) 2021-2022 Ilya Grebnov <[email protected]>. ABSOLUTELY NO WARRANTY.
This program is based on (at least) the work of Michael Maniscalco (see AUTHORS).

Decompressing user.h.m03-050(00%)
Error: The compressed data is corrupted!
walt@myhost:/dev/shm$ sha256sum -b user.h*
25ea321ef82b8973b8c15e126e06dda702cb3f458a99f48c4ffd4c463c797487 *user.h
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 *user.h_021
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 *user.h_030
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 *user.h_040
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 *user.h_050
53f5b03293a37822fdf8ebdfd40d0aed7186088c049d3f57aac45b9f331d5741 *user.h.m03-021
0c9bd3733498caf58caa937cb01731609bdb081538e2a6b34c0a0911c33ab98d *user.h.m03-030
a9bce14d6bc60f57aca3fa601c9a3527806784aba55578c325c3ee2ccac6b685 *user.h.m03-040
14b9ed4479a456f5cd0aa7e81f8a2159f95c67c8c4280c701f90c9c50504631d *user.h.m03-050
walt@myhost:/dev/shm$ ls -l user.h*
-rw-r--r-- 1 walt walt 22 May  7 23:24 user.h
-rw-rw-r-- 1 walt walt  0 May  8 00:57 user.h_021
-rw-rw-r-- 1 walt walt  0 May  8 00:57 user.h_030
-rw-rw-r-- 1 walt walt  0 May  8 00:58 user.h_040
-rw-rw-r-- 1 walt walt  0 May  8 00:58 user.h_050
-rw-rw-r-- 1 walt walt 38 May  8 00:55 user.h.m03-021
-rw-rw-r-- 1 walt walt 37 May  8 00:55 user.h.m03-030
-rw-rw-r-- 1 walt walt 38 May  8 00:56 user.h.m03-040
-rw-rw-r-- 1 walt walt 37 May  8 00:56 user.h.m03-050
walt@myhost:/dev/shm$ 

encoe/decode_root_frequencies

m03_model.h: line 106 and 184
below code reduces iteration(especialy text file). is it correct?
for (ptrdiff_t p = 0;remaining_total;++p)

STDIN and STDOUT

Would there be a possibility for a fork or addition to the main code to be made so that BSC can read from STDIN and write to STDOUT for both compression and decompression (or at least decompression)?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.