Comments (7)
I'm not sure that is possible without loosing a lot of performance. The core code makes assumptions about the memory being contiguous and adding the checks for the end of the buffer would likely slowdown the code significantly. Specifically, for compression the algorithms perform a single bounds check to verify that there is plenty of space in the output buffer, and do not perform an internal bounds checks, which makes them very fast.
Generally, I have found that for Hadoop formats this API is good enough since the compression is typically done is smaller chunks, and decompression typically has the output size.
Is there another shape that could work that doesn't need a multi-part output buffer?
from aircompressor.
I'm breaking the input stream into 256k buffers to compress, which for Snappy means that I need an output buffer of 306k. If we assume the compressed output is ~40%, that means that the output is 102k. To compress the next 256k block I either need to copy out the 102k or waste the empty 204k and make a new 306k buffer. I can't afford to use 3x the memory so I need to copy.
That said, I don't need to use every single byte in the output buffer as long as I know what is used. So if the compressor checked at reasonable points for the algorithm that would be fine.
Decompression is fine since I know exactly that the output will always be less than 256k. :)
from aircompressor.
In the case of snappy the algorithm will only compress 65536 bytes in one shot, and then uses an internal framing format, so a 256k buffer doesn't really help. For other algorithms, it can help.
I find that when I am compressing, I'm writing into a larger output buffer (1-8MBs) for a disk or network write, and I compress directly to that larger buffer until it is full. This shares the allocation across the entire stream lifetime. In your code, do you have access to a larger output buffer?
from aircompressor.
On a side note, I should finish a patch to add aircompressor support for ORC today. The code runs, I just need to add some new tests to check the new codecs and check compatibility with the old snappy code.
https://issues.apache.org/jira/browse/ORC-77
from aircompressor.
The problem is that I'm compressing all of the columns as they are written and I can't afford to have 1mb buffers per a column (users sometimes have thousands of columns).
Yeah, I knew that Snappy was putting in the 64k limit (unfortunately). I guess I should have Snappy restrict the buffer size, although as you point out it doesn't really change the underlying problem.
from aircompressor.
How does the current compression system work? I didn't think the snappy c code supported this kind of buffer management.
from aircompressor.
The old code had the copy, but I had a comment in to fix it at some point.
// I should work on a patch for Snappy to support an overflow buffer
// to prevent the extra buffer copy.
The current code for snappy in ORC is really sloppy and allocates a new output buffer for each block and then copies it into final buffer. I'll fix that as part of ORC-77 moving to aircompressor.
I guess ORC could allocate shared 8MB buffers and then capture the output of the compressor as slices of that buffer.
from aircompressor.
Related Issues (20)
- Snappy Decompress throw MalformedInputException HOT 2
- Zstd Decompress does support files compressed by newer C version? HOT 6
- Implement other ZSTD compression levels
- Lz4 double buffering
- compile and generate class files for the given Java implementaion. HOT 1
- ZstdOutputStream writes empty stream HOT 2
- Support Zstd seekable format
- Hive fails opening split for zst compressed files
- ZStd JNI vs Aircompressor pure java performance question HOT 3
- The result is different HOT 1
- Add `SECURITY.md` file and enable vulnerability reporting HOT 2
- Set up OSS-Fuzz
- Decompression: How to determine the size of the output buffer? HOT 1
- Remove dependency on sun.misc.Unsafe? HOT 1
- May I ask if Android is usable? HOT 1
- Security vulnerability: Snappy decompressor can be made to crash JVM HOT 5
- Compression ratio is different in ZSTD algorithm between ZstdOutputStream and ZstdCompressor.compress(Bytebuffer) HOT 2
- Support lz4 framing decompression HOT 1
- ZSTD : Drive the end of decompression with both "inputLimit" and "outputLimit"
- Request: Generic Decompressor for all algos based on magic number header
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aircompressor.