cowtowncoder / java-merge-sort Goto Github PK
View Code? Open in Web Editor NEWBasic stand-alone disk-based N-way merge sort component for Java
License: Apache License 2.0
Basic stand-alone disk-based N-way merge sort component for Java
License: Apache License 2.0
I just wanted to raise that I think there is a doubling of the expected amount of usage of memory by this library. We recently hit a problem with OOM during the sort because our estimatedSize calculations were wrong (not truly reflecting the cost of the object structure, you can see the fix here: https://github.com/Aconex/scrutineer/commit/47f304267426bdd44fe1b396770449ee1e0271ce )
When debugging what we may have done wrong, we stumbled-upon code within Sorter.java that appears to hold 2 Object[] the size based on the sort memory configuration item. What we think is happening is:
Sorter.java line 194 - sort(...):
....
Object[] items = _readMax(inputReader, buffer, _config.getMaxMemoryUsage(), null);
This uses one buffer the size of the configured memory and if it fits in memory, the whole job is done, however if not, it then goes to:
Sorter.java line 207 - sort(...) :
....
List presorted = presort(inputReader, buffer, items, next);
Passing in the items array already read in using the memory configured, so this array is still retained.
Sorter.java line 299 - presort(...)
....
Object[] items = _readMax(inputReader, buffer, _config.getMaxMemoryUsage(), nextValue);
this now creates a second Object[] of the size requested, but still retaining in scope the firstSortedBatch array reference passed in.
When you combine this 'doubling' of the configured memory with a bad object size estimate, it's an episode of Air Crash Investigations waiting to film.
Is this the correct reading of the way the code works? From reading the presort method, I would think the first 2 lines are the only ones needing the firstSortedBatch reference, the remaining method code that does the do/while loop doesn't need this reference and really somehow the firstSortedBatch reference needs to be released (nulled out - eligible for GC), thereby honouring the configured memory utilisation. Some code refactoring in order?
thoughts?
Greetings! I am using the library and really like it. Very flexible and performs quite well. However... there's always room for optimization. I noticed that the activity on this project is low and most comments are very old, so I should ask first, should I be looking into a newer project that has effectively replaced this one?
That being said, my experience in C++ based sorting showed two improvements can produce very significant results:
I haven't dug deep enough into the code to see if some of this is already supported, please tell me to RTFM if I've missed something.
Thanks,
john
I would like a way to enhance the sort or merge process that allows removing duplicate values from the input.
For example, if my input is: a, b, c, b, a the output would be a, b, c
Currently the Merger constructor is too deep in the Sorter code to easily override the creation process and make my own Merger subclass.
In the method Sorter#_merge(), the list of data readers is never closed. Which means the finally block where the input files are deleted, the files won't be deleted until the GC finally notices the open files are no longer used, then closes them and deletes the files.
I'm using the merge-sort as part of a much large, long running process which means the files may be left open, and on disk, for a significant period of time.
The attached program sorts 1000 items in a very small amount of memory. So small that it creates 501 temp files, which are pre-merged into six files, which I can see in my temp folder:
-rwx------+ 1 jlilley Domain Users 2600 Sep 17 18:28 j-merge-sort-3015695007988242713.tmp
-rwx------+ 1 jlilley Domain Users 13 Sep 17 18:28 j-merge-sort-3903033334011859833.tmp
-rwx------+ 1 jlilley Domain Users 2600 Sep 17 18:28 j-merge-sort-4734711900361229763.tmp
-rwx------+ 1 jlilley Domain Users 2600 Sep 17 18:28 j-merge-sort-5609443384695274597.tmp
-rwx------+ 1 jlilley Domain Users 2477 Sep 17 18:28 j-merge-sort-8785061516420350339.tmp
-rwx------+ 1 jlilley Domain Users 2600 Sep 17 18:28 j-merge-sort-9037742217201471621.tmp
However, when I finish the output iterator and call sort.close(), it attempts to remove the original 501 temp files, not the files resulting from the pre-merge.
for (File input : _mergerInputs) {
input.delete();
}
I believe that this is the fix:
public Iterator<T> sort(DataReader<T> inputReader)
...
_mergerInputs = presorted;
_merger = _createMergeReader(merge(presorted));
It should look more like
_mergerInputs = presorted;
_mergerInputs = merge(presorted);
_merger = _createMergeReader(_mergeInputs);
If this seems right, I'll post a patch.
For a k-way merge with k>2 consider using a PriorityQueue instead of multiple 2-way mergers for enhanced performance.
Due to fix #21 we need to increase JDK baseline from Java 6 to at least Java 7.
But at this point in time, Java 8 is the oldest supported version so let's just move to it.
Change needs to result in at least minor version bump so let's also move from 1.0.x to 1.1.0 for this project.
The ASL2.0 license boilerplate for this repository doesn't make it clear about a specific copyright owner:
Line 190 in 405cd63
Copyright [yyyy] [name of copyright owner]
I'd presume this is @cowtowncoder himself - be great if you could just update this repo with this corrected. (I've done the same problem myself in another repo, and I'm sure we're both not the only ones out there).
RawTextFileReader corrupts lines which span more than 32000 bytes (twice the size of the allocated buffer).
When no CR/LF is found within the first 16000 bytes, _readNextSlow is called. This method has a while loop but the loop will ignore any chunk of 16000 bytes which does not contain a CR/LF thus leading to corrupted input.
in com.fasterxml.sort.std.RawTextLineReader#readNext, the start index is stored before the call to _skipCR(), so that even if _skipCR() skips the LF byte by incrementing ++_inputPtr following the CR, the Arrays.copyOfRange call below still has it.
On windows, this results in sorted files being written with:
<LF>
data...data...data<CRLF>
<LF>
data...data...data<CRLF>
Currently there is no support for estimating sortable entry lengths. It would be nice to have helper object to use for calculating object sizes: this can use conservative estimates, like:
We should probably have separate instances for 32-/64-bit machines; but can let caller specify which one is wanted.
Since this object is performance sensitive will probably need to keep it stateless (i.e. caller must pass all data as arguments), instead of using builder-style (which otherwise would be nice).
$ wc -l ngram.data; wc -l ngram_sort.data
10069731 ngram.data
10067458 ngram_sort.data
after sorting, I lost about 2273 lines? why?
Suggestion: It might be worthwhile to parallelize the main memory sort phase for better performance. For example, using jsr166y an array can be sorted in parallel simply like this: ParallelArray.createUsingHandoff(Object[], ForkJoinPool).sort(Comparator).getArray()
Note, however, that with that algorithm implementation the sort might not be "stable" anymore, i.e. the relative order of equal elements might change.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.