charlesdaniels / bitshuffle Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 0.0 247 KB

License: BSD 3-Clause "New" or "Revised" License

Python 47.68% Shell 44.83% PowerShell 7.50%

bitshuffle's People

Contributors

Stargazers

Watchers

bitshuffle's Issues

Decrease paste time by inserting line breaks

When pasting bitshuffle-encoded data, large files can take a long time to paste, since the application must read the text into memory. This is exacerbated by large line lengths.

The current standard is to add one line break after each packet. However, with large chuck sizes (>~1000) this decreases load times significantly.

We should add an option, enabled by default, to insert line breaks approximately every 100 characters.

Port to JavaScript

To avoid heavy-duty hosting like Heroku, it would be nice to have this in native JavaScript. This will also have the benefit of allowing us to host on Github Pages, which only supports static sites.

Waiting on ~~#57 and #54~~ #51.

Encode command line arguments

Currently, adding extra arguments gives a usage error. It would be nice to take additional arguments to encode.

Make readpath portable

readlink -e is not supported on MacOS and various other OSes. Possible solution in Charles' toolchest.

We specifically preclude unicode support in encode_packet with the line data = data.decode(encoding="ascii"). Since unicode is ascii-compatible, this should be as simple as changing encoding="ascii" to encoding="unicode".

Stable public API

A stable API to use BitShuffle as a library needs to be designed, implemented, and documented.

As a component of this, the entire API should be tested by unit tests (which will constitute the completion of #5).

The documentation should be generated automatically, and we can host it on Read the Docs.

Support for more compression types

Compression other than bz2 should be supported. We don't need to go overkill, but it's sufficiently easy to compress bytes() in Python that we may as well support some more. I would suggest maybe gzip, lz4, and an option to disable compression entirely (i.e. for input data that is already compressed).

When this feature is implemented, the compatibility level counter should be incremented.

I would make use of function pointers; i.e...

compress_data = None
if compressiontype is "bz2":
    compress_data = bz2.compress
elif compressiontype is "lz4"
    compress_data = ...
...

if compress_data is None:
    # crash the program with an error
    ...

...

compressed_data = compress_data(data, compressionlevel)

Note that we will need at least one wrapper function to "compress" data for the uncompressed type, and we also might need some for any compression functions that don't support compression levels (or don't do so as the second positional argument).

Test with both python 2 and 3

PyPI metadata states we suppport both 2 and 3. However, tests only currently run on 2. I've manually tested most parts of the application, but it would be nice to automate this.

Unit tests

Unit tests need to be added for as many of the functions as is possible, for obvious reasons.

Add error-handling

See discussion. Invalid data should be handled gracefully and exit with an error, and perhaps write the original data to a temporary file. Each error should have a UID for easy search through the search code. My suggestion is to have errors organized by topic as described here, but I'm willing to change.

Fix failing unit tests

Add option to edit message

Currently, the message is hardcoded in the encode_packet method:

msg = "This is encoded with BitShuffle, which you can download " +
"from https://github.com/charlesdaniels/bitshuffle"

It would be nice to specify a message from the command line.

Editor check fails when non-interactive

Relevent commit.

Currently, we don't check the editor unless the program is interactive. This prevents automated testing of the editor. I made a stackoverflow post until I understood the issue, which contains more detail.

TODO: Find a way to test editor when non-interactive.

Code Quality Checks with ShellCheck

We have enough automation scripts that we should probably do automated code quality checks on them. I have used ShellCheck for other projects and found that it worked really well.

This should be added to the pre-commit checks, and also the test suite that TravisCI runs.

Package BitShuffle using PyPi

Official tutorial.

Rewrite tests in Python

Currently, unit tests and smoketests are written in Bash. This prevents them from being run in AppVeyor. Furthermore, the tests are in need of some rewriting (since tests NOT involving temp files require significant code duplication).

I propose we fix both problems at once by rewriting existing tests in Python.

Make hash tests flexible for spec changes

Currently, there are a bunch of encoded packets in scripts/test_bad_hashes.sh. As noted here, this ties the tests to the current spec.

I would suggest introducing errors in the stream directly, for example by doing bitshuffle --encode ... | tr [:lower:] [:upper:] | ....

Reed-Solomon Error Correction

One possible method for doing error correction in pure Python would be to use a package like unireedsolomon or some other Reed-Solomon error correction.

If this feature is added, I would suggest making RS encoded packets bear a different encoding than those that do not (i.e. base64-rs), this way non-RS encoded packets can still be encoded/decoded with only the Python standard library.

Tests can only check non-interactive elements

See https://github.com/charlesdaniels/bitshuffle/pull/34/files/a4ced202750988db4bc9e500697a9323eb20c6ce#r157613283 and #31. This is a long-term issue.

Smoke testing for Windows

For BitShuffle to be of maximum utility, it should run on a wide variety of platforms, including MS Windows. While it should work, we should have some basic smoke testing for Windows to ensure it continues to do so.

It is not necessary to run through every test case, just enough to demonstrate that it can run without crashing and do a basic encode/decode test.

The specific feature that should be tested are:

Run and display the version with --version without crashing.
Encode a file from --input, writing the encoded data to a temp file with --output; run another BitShuffle instance and have it do the reverse.
Encode a file from --input, pipe to another instance, decode, and write out with `--output``.
Encode a file from standard in, pipe to another instance, decode, and write to standard out.

I have some experience with doing CI on Windows systems, so I will handle this one myself.

I will probably use AppVeyor unless someone has a different suggestion?

Add `pylint` to pre-commit checks

Thoughts?

Pylint is a style-checker similar to but different from pycodestyle.

It can

check runtime errors
detect unused code
generate UML diagrams (!!)
integrate easily with vim and emacs
be customized easily

and perform many, many other checks.

Since the linter is extremely strict, I suggest starting at a required code health of 7.0 and slowly increasing the required value as the project matures.

Sample output:

 % pylint bitshuffle.py
No config file found, using default configuration
************* Module bitshuffle.bitshuffle
W:414, 0: TODO: check editor when encoding as well as decoding (fixme)
C:184, 0: Wrong continued indentation (remove 5 spaces).
                             "the input file and write it to the output.")
                        |    ^ (bad-continuation)
C:211, 0: Wrong continued indentation (remove 5 spaces).
                             "Ignored if decoding packets. " +
                        |    ^ (bad-continuation)
C:212, 0: Wrong continued indentation (remove 5 spaces).
                             "Currently supported: 'bz2', 'gzip'.")
                        |    ^ (bad-continuation)
W:334, 0: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
W:337, 0: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
W:338, 0: Bad indentation. Found 12 spaces, expected 8 (bad-indentation)
W:340, 0: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
W:341, 0: Bad indentation. Found 12 spaces, expected 8 (bad-indentation)
W:343, 0: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
W:344, 0: Bad indentation. Found 12 spaces, expected 8 (bad-indentation)
W:347, 0: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
W:350, 0: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
W:352, 0: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
W:354, 0: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
W:355, 0: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
W:356, 0: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
W:357, 0: Bad indentation. Found 12 spaces, expected 8 (bad-indentation)
W:358, 0: Bad indentation. Found 16 spaces, expected 12 (bad-indentation)
W:359, 0: Bad indentation. Found 16 spaces, expected 12 (bad-indentation)
W:360, 0: Bad indentation. Found 20 spaces, expected 16 (bad-indentation)
W:362, 0: Bad indentation. Found 20 spaces, expected 16 (bad-indentation)
W:363, 0: Bad indentation. Found 12 spaces, expected 8 (bad-indentation)
W:364, 0: Bad indentation. Found 16 spaces, expected 12 (bad-indentation)
W:366, 0: Bad indentation. Found 16 spaces, expected 12 (bad-indentation)
W:368, 0: Bad indentation. Found 12 spaces, expected 8 (bad-indentation)
W:369, 0: Bad indentation. Found 16 spaces, expected 12 (bad-indentation)
W:370, 0: Bad indentation. Found 20 spaces, expected 16 (bad-indentation)
W:371, 0: Bad indentation. Found 16 spaces, expected 12 (bad-indentation)
W:372, 0: Bad indentation. Found 20 spaces, expected 16 (bad-indentation)
W:374, 0: Bad indentation. Found 12 spaces, expected 8 (bad-indentation)
W:375, 0: Bad indentation. Found 12 spaces, expected 8 (bad-indentation)
W:376, 0: Bad indentation. Found 16 spaces, expected 12 (bad-indentation)
W:377, 0: Bad indentation. Found 16 spaces, expected 12 (bad-indentation)
C:378, 0: Wrong continued indentation (remove 5 spaces).
                          % (index, packet[packet_hash]))
                     |    ^ (bad-continuation)
W:380, 0: Bad indentation. Found 12 spaces, expected 8 (bad-indentation)
W:382, 0: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
W:383, 0: Bad indentation. Found 12 spaces, expected 8 (bad-indentation)
W:384, 0: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
W:385, 0: Bad indentation. Found 12 spaces, expected 8 (bad-indentation)
W:387, 0: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
W:391, 0: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
W:392, 0: Bad indentation. Found 12 spaces, expected 8 (bad-indentation)
W:393, 0: Bad indentation. Found 16 spaces, expected 12 (bad-indentation)
W:396, 0: Bad indentation. Found 12 spaces, expected 8 (bad-indentation)
W:397, 0: Bad indentation. Found 16 spaces, expected 12 (bad-indentation)
W:398, 0: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
W:338, 0: Anomalous backslash in string: '\('. String constant might be missing an r prefix. (anomalous-backslash-in-string)
W:338, 0: Anomalous backslash in string: '\('. String constant might be missing an r prefix. (anomalous-backslash-in-string)
W:338, 0: Anomalous backslash in string: '\)'. String constant might be missing an r prefix. (anomalous-backslash-in-string)
W:338, 0: Anomalous backslash in string: '\)'. String constant might be missing an r prefix. (anomalous-backslash-in-string)
W:463, 0: Redefining built-in 'hash' (redefined-builtin)
C:  1, 0: Missing module docstring (missing-docstring)
W: 25, 4: Wildcard import errors (wildcard-import)
W: 27, 4: Wildcard import errors (wildcard-import)
W: 35, 4: Statement seems to have no effect (pointless-statement)
W: 36, 4: Statement seems to have no effect (pointless-statement)
C: 39, 4: Missing function docstring (missing-docstring)
C: 41, 8: Variable name "c" doesn't conform to snake_case naming style (invalid-name)
C: 42,71: Variable name "f" doesn't conform to snake_case naming style (invalid-name)
C: 46, 4: Missing function docstring (missing-docstring)
C: 47,56: Variable name "f" doesn't conform to snake_case naming style (invalid-name)
C: 55, 4: Constant name "file_type" doesn't conform to UPPER_CASE naming style (invalid-name)
C: 57, 4: Class name "file_type" doesn't conform to PascalCase naming style (invalid-name)
C: 60, 0: Constant name "version" doesn't conform to UPPER_CASE naming style (invalid-name)
C: 62, 0: Constant name "stderr" doesn't conform to UPPER_CASE naming style (invalid-name)
C: 63, 0: Constant name "stdout" doesn't conform to UPPER_CASE naming style (invalid-name)
C: 64, 0: Constant name "stdin" doesn't conform to UPPER_CASE naming style (invalid-name)
C: 65, 0: Constant name "compress" doesn't conform to UPPER_CASE naming style (invalid-name)
C: 66, 0: Constant name "debug" doesn't conform to UPPER_CASE naming style (invalid-name)
C: 67, 0: Constant name "verbose" doesn't conform to UPPER_CASE naming style (invalid-name)
C: 68, 0: Constant name "compatlevel" doesn't conform to UPPER_CASE naming style (invalid-name)
C: 71, 0: Constant name "default_msg" doesn't conform to UPPER_CASE naming style (invalid-name)
W: 75,55: Redefining name 'compress' from outer scope (line 65) (redefined-outer-name)
C:102, 8: Variable name "c" doesn't conform to snake_case naming style (invalid-name)
C:103,11: Do not use `len(SEQUENCE)` to determine if a sequence is empty (len-as-condition)
W:109,40: Redefining name 'compress' from outer scope (line 65) (redefined-outer-name)
R:109, 0: Too many arguments (6/5) (too-many-arguments)
W:134,16: Redefining name 'compress' from outer scope (line 65) (redefined-outer-name)
C:146, 4: Variable name "e" doesn't conform to snake_case naming style (invalid-name)
C:153, 8: Variable name "c" doesn't conform to snake_case naming style (invalid-name)
W:146, 4: Unused variable 'e' (unused-variable)
W:248,12: Redefining name 'compress' from outer scope (line 65) (redefined-outer-name)
C:166, 0: Missing function docstring (missing-docstring)
C:255,12: Variable name "p" doesn't conform to snake_case naming style (invalid-name)
C:279,39: Variable name "tf" doesn't conform to snake_case naming style (invalid-name)
C:290, 8: Variable name "e" doesn't conform to snake_case naming style (invalid-name)
W:290, 8: Unused variable 'e' (unused-variable)
R:166, 0: Too many branches (18/12) (too-many-branches)
R:166, 0: Too many statements (71/50) (too-many-statements)
C:309, 0: Missing function docstring (missing-docstring)
C:318,38: Variable name "f" doesn't conform to snake_case naming style (invalid-name)
R:309, 0: Either all return statements in a function should return an expression, or none of them should. (inconsistent-return-statements)
C:333, 0: Missing function docstring (missing-docstring)
R:333, 0: Too many local variables (19/15) (too-many-locals)
C:343,11: Do not use `len(SEQUENCE)` to determine if a sequence is empty (len-as-condition)
R:348,39: Comparison to literal (literal-comparison)
W:382,11: Using possibly undefined loop variable 'packet' (undefined-loop-variable)
W:334, 8: Unused variable 'comment' (unused-variable)
W:334,17: Unused variable 'compatibility' (unused-variable)
W:334,32: Unused variable 'encoding' (unused-variable)
W:335,12: Unused variable 'seq_end' (unused-variable)
W:352, 8: Unused variable 'segments' (unused-variable)
R:333, 0: Too many branches (14/12) (too-many-branches)
R:420, 8: The if statement can be replaced with 'var = bool(test)' (simplifiable-if-statement)
C:437, 0: Missing function docstring (missing-docstring)
C:451, 8: Variable name "e" doesn't conform to snake_case naming style (invalid-name)
C:457, 8: Variable name "e" doesn't conform to snake_case naming style (invalid-name)
C:463, 0: Missing function docstring (missing-docstring)
C:467, 0: Missing function docstring (missing-docstring)
C:478, 0: Function name "exitWithError" doesn't conform to snake_case naming style (invalid-name)
C:478, 0: Missing function docstring (missing-docstring)
C:485, 0: Function name "exitSuccessfully" doesn't conform to snake_case naming style (invalid-name)
C:485, 0: Missing function docstring (missing-docstring)
C:491, 0: Missing function docstring (missing-docstring)

------------------------------------------------------------------
Your code has been rated at 5.91/10 (previous run: 5.02/10, +0.89)

Additional smoke tests

Additional smoke tests need to be added.

Representative samples of each compatibility level of packet should be stored. Any future version of BitShuffle needs to be able to decode such packets, and it's ability to do so should be a smoke test (I would do one smoke test per sample).
Once #1 is implemented, it will need to be tested also. I'm not sure how to test the $EDITOR feature though.
A smoke test should be added that deliberately corrupts the data produced by BitShuffle, then tries to decode it. The test should assert that the decoding either error-ed or produced a warning of some kind (nonzero exit code, don't hard code something to look for the actual warning message).
A smoke test should generate a valid packet stream, then remove an entire packet. The test case should assert that BitShuffle exits nonzero and does not write an output file.
A smoke test should generate a valid packet stream, then encode that, then decoded the encoded packet stream, then decode the decoded packet stream, then get the original data back out. For example: bitshuffle --encode --input somefile | bitshuffle --encode | bitshuffle --decode | bitshuffle --decode --output anotherfile should cause somefile and anotherfile to have the same contents.
A smoke test should artificially generate a packet with a compatibility level greater than the one currently supported by BitShuffle and assert that BitShuffle refuses to decode this packet.
- To support this, a new commandline flag should be added that outputs the current compatibility level to standard out then exists.
A smoke test should generate a single packet that contains chunksize many bytes exactly, then decode it without error.
A smoke test should generate number of packets which are all exactly chunksize many bytes long (i.e. no overflow), then decode it without error.

Use sha256 instead of sha1

Currently, all hashing is done using hashlib.sha1(data).hexdigest(data). SHA1 is known to be insecure and should be updated. SHA256 should be used instead, preferrably using the SHA3 standard.

When taking input from stdin, waits for two files before exiting

Example:

joshua@debian-thinkpad:/usr/local/src/bitshuffle$ ./bitshuffle.py -e
some message here
((<<This is encoded with BitShuffle, which you can download from https://github.com/charlesdaniels/bitshuffle|1|base64|bz2|0|0|stdin|9477a94f13fef4343e1fca75a1a814f870160dc6|QlpoNTFBWSZTWdy1BSkAAAhRgAAQQAAiwpgAIAAxADAp6A9JP28BA1onHC7kinChIblqClI=>>))

l
((<<This is encoded with BitShuffle, which you can download from https://github.com/charlesdaniels/bitshuffle|1|base64|bz2|0|0|stdin|cde25b5e10ad99822ac2c62b8e01b4d8af3e01d5|QlpoNTFBWSZTWX5LdcUAAADBAAAQAAQgACEAgrF3JFOFCQfkt1xQ>>))

--version flag should be added

After the first release (when that happens), it will be useful to have a --version flag. Example:

>>> import argparse
>>> parser = argparse.ArgumentParser(prog='PROG')
>>> parser.add_argument('--version', action='version', version='%(prog)s 2.0')
>>> parser.parse_args(['--version'])
PROG 2.0

source

Add Jeremy as contributor

account page

Use logging library

We've got sufficient logging that handrolling our output is probably more effort than it's worth. Python has a builtin logging library that will be suitable; it supports debug through critical levels with a special exception level for backtraces.

This will allow interoperability with other libraries and reduce the size of the codebase considerably.

Make '/dev/stdin' and '/dev/stdout' portable

As per https://github.com/charlesdaniels/bitshuffle/pull/26/files/9b902a6e66c03dd0902ed6dff674fe186a22cc37#r156245520

Wrapper not compatible with powershell

In fb39078, I transitioned to making the module directly executable, so that python -m bitshuffle would run the program instead of the previous error. For this to work with a wrapper, I changed it to a shell command: cd "$(dirname "$(realpath "$0")")" && exec python -m bitshuffle "$@". Unfortunately, sh does not exist on Windows.

Split up library and command line tool

As suggested by Jeremy:

Consider splitting up the cli and implementation into different files
Makes it a lot easier to port to other languages

Will also make #39 and #16 significantly easier.

`./setup.py install` doesn't work

Think it's somehow related to this line: warning: install_lib: 'build/lib.linux-x86_64-2.7' does not exist -- no Python modules to install

Full traceback:

joshua@debian-thinkpad:/usr/local/src/bitshuffle$ ./setup.py install
running install
running bdist_egg
running egg_info
writing bitshuffle.egg-info/PKG-INFO
writing top-level names to bitshuffle.egg-info/top_level.txt
writing dependency_links to bitshuffle.egg-info/dependency_links.txt
writing entry points to bitshuffle.egg-info/entry_points.txt
reading manifest file 'bitshuffle.egg-info/SOURCES.txt'
writing manifest file 'bitshuffle.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
warning: install_lib: 'build/lib.linux-x86_64-2.7' does not exist -- no Python modules to install

creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying bitshuffle.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying bitshuffle.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying bitshuffle.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying bitshuffle.egg-info/entry_points.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying bitshuffle.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
zip_safe flag not set; analyzing archive contents...
creating 'dist/bitshuffle-0.0.1-py2.7.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing bitshuffle-0.0.1-py2.7.egg
Removing /usr/local/lib/python2.7/dist-packages/bitshuffle-0.0.1-py2.7.egg
Copying bitshuffle-0.0.1-py2.7.egg to /usr/local/lib/python2.7/dist-packages
bitshuffle 0.0.1 is already the active version in easy-install.pth
Installing bitshuffle script to /usr/local/bin

Installed /usr/local/lib/python2.7/dist-packages/bitshuffle-0.0.1-py2.7.egg
Processing dependencies for bitshuffle==0.0.1
Finished processing dependencies for bitshuffle==0.0.1
joshua@debian-thinkpad:/usr/local/src$ python2 -m bitshuffle
usage: bitshuffle.py [-h] [--input INPUT] [--output OUTPUT]
                    [--filename FILENAME] [--encode] [--decode] [--version]
                    [--chunksize CHUNKSIZE] [--compresslevel COMPRESSLEVEL]
                    [--editor EDITOR] [--compresstype COMPRESSTYPE]

optional arguments:
-h, --help            show this help message and exit
--input INPUT, -i INPUT
                        Input file. Default is stdin.
--output OUTPUT, -o OUTPUT
                        Output file. Default is stdout.
--filename FILENAME, -f FILENAME
                        Set filename to use when encoding explicitly
--encode, -e          Generate a BitShuffle data packet fromthe input file
--decode, -d, -D      Extract a BitShuffle data packet.
--version, -v         Displays the current version of bitshuffle
--chunksize CHUNKSIZE, -c CHUNKSIZE
                        Chunk size in bytes
--compresslevel COMPRESSLEVEL, -m COMPRESSLEVEL
                        Compression level when encoding. 1 is lowest, 9 is
                        highest
--editor EDITOR, -E EDITOR
                        Editor to use for pasting packets
--compresstype COMPRESSTYPE, -t COMPRESSTYPE
                        Type of compression to use. Defaults to bz2. Ignored
                        if decoding packets. Currently supported: 'bz2',
                        'gzip'
joshua@debian-thinkpad:/usr/local/src/bitshuffle$ cd ..
joshua@debian-thinkpad:/usr/local/src$ python2 -m bitshuffle
/usr/bin/python2: No module named bitshuffle

Project homepage

A project homepage should be added. It can be hosted as a static page on cdaniels.net with minimal difficulty.

It should also be added to my project page also, including a OC mirror.

Add editor unit test using `expect`

The editor function is currently untested for most scenarios outside of a linux machine. The command expect can access the X-server API to ensure the correct response is given to a command.

Smart detection of --encode / --decode

If neither --encode nor --decode are specified, then BitShuffle should attempt to select one based on a heuristic.

If --output is specified, --input is /dev/stdin, and standard input is a tty, then we can infer the user want's to paste in BitShuffle packets and have them decoded, so we should thus infer --decode.
If --input is specified, --output is not, then we can infer --encode, unless the input is standard in, and it is a tty.

Note that if no arguments are given to BitShuffle, then we should not infer anything, as we wouldn't want someone trying to find the help message to have a bunch of packets barfed all over their terminal.

In `decode`, it should be clear that variables are indicies

these variables names do not make it obvious that these values are indices. Reading file_hash, you would expect the variable to hold the hash, not the index thereof.

See #60 (comment)

Finalize chunk format

The format in which BitShuffle stores encoded data needs to be finalized. With the closure of #46, this should basically involve codifying the current BitShuffle behavior in a formal document and adding tests which assert future versions can decode packets from the current version. This can be accomplished by encoding a known magic string with the current BitShuffle version and having a test case which decodes it and asserts it is identical to the original string.

`python bitshuffle/bitshuffle.py < file` should work

Would involve changing infer_mode. Should be fairly easy to change.

Jeremy:

I expected python bitshuffle/bitshuffle.py < file to [encode the input file]
if I put in an encoded file I would expect it to decode

Me:

for that we'd have to look through the file before we decode it
grep ((<<.*>>)) isn't terribly hard though

Static builds for Windows/macOS/Linux/BSD

In the interest of user friendliness, we should provide static builds for RELEASE versions of BitShuffle.

This can be accomplished using PyInstaller. I have personal experience with this tool and can handle producing the build scripts.

What remains unclear is how to actually produce the builds? We can certainly use TravisCI to run the build scripts, but I need to do some research on what getting build artifacts out of TravisCI looks like. Ideally, they would show up on the releases pages for any commit tagged with X.Y.Z-RELEASE.

`encode_file` fails if filehandle is of type bytes

This is not a problem with the function itself so much as the implementation. In main, we should decide whether to call encode_file or encode_data depending on the type of the input.

This will help significantly with #16.

Chunk-level checksums

Currently, each packet carries the checksum for the entire file with it. Instead, it should carry the checksum for the chunk it contains. A new, optional field should be added to the packet format which will contain the checksum for the whole file, so it is only transmitted once or twice. This will also be useful, because we can tell the user exactly which packet has been damaged.

I would suggest designing this feature such that all "whole file" checksums get accumulated in a list. If any of them do not match the others, then BitShuffle assumes that the entire stream is corrupt and should throw a warning (although the file should still be written, as the user may be using a parity tool like par2 independently).

TravisCI should test setup.py using virtualenv

What it says on the tin. Just a basic sanity check. Something like python2 ./setup.py install, then /usr/local/bin/bitshufle --version should exit nonzero. The same for Python 3 would be nice as well, although I don't see a clean way to uninstall it after installing it the first time... could use pythonX -m bitshuffle.bitshuffle I suppose, rather than /usr/local/bin.

Add HTML page

Something simple is fine; I like the look of https://www.hackterms.com/ but something uglier like http://jsbeautifier.org/ would suffice. We shouldn't need any site generators or the like, we only need a single page.

We would host on GitHub pages.

Check for ~/.selected_editor

Many debian installations have a program called 'select-editor' which runs the first time an editor is needed in the terminal. It saves the selected editor to ~/.selected_editor in the following format:

# Generated by /usr/bin/select-editor
SELECTED_EDITOR=

TODO: Check for this after $VISUAL and $EDITOR

Exit with error code on error

Currently most errors are handled with quit(message), which returns success. If used with pipes, this would not have the intended effects.

Regex parser breaks if it hits a newline

In 5f390a0, added newline to encoding comment, breaking the regex, as referenced in 9f9459d. This should not happen.

Support for Base56

It could potentially be useful to have support for Base56. This could be added alongside for support for base64. If we do this, we will need to write a library for it, and we should probably do so in a separate project which can be uploaded to PyPi separately.

This will involve a compatibility level bump.

This is marked as hard because while the implementation in BitShuffle should be fairly straightforward, implementing the Base56 encoder/decoder will be really annoying.

charlesdaniels / bitshuffle Goto Github PK

bitshuffle's People

Contributors

Stargazers

Watchers

bitshuffle's Issues

Recommend Projects

Recommend Topics

Recommend Org