Git Product home page Git Product logo

java-merge-sort's People

Contributors

cowtowncoder avatar davidandrzej avatar jlleitschuh avatar nathanlws avatar ndikan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

java-merge-sort's Issues

Possible 'overallocation' of memory

I just wanted to raise that I think there is a doubling of the expected amount of usage of memory by this library. We recently hit a problem with OOM during the sort because our estimatedSize calculations were wrong (not truly reflecting the cost of the object structure, you can see the fix here: https://github.com/Aconex/scrutineer/commit/47f304267426bdd44fe1b396770449ee1e0271ce )

When debugging what we may have done wrong, we stumbled-upon code within Sorter.java that appears to hold 2 Object[] the size based on the sort memory configuration item. What we think is happening is:

  • Sorter.java line 194 - sort(...):
    ....
    Object[] items = _readMax(inputReader, buffer, _config.getMaxMemoryUsage(), null);

    This uses one buffer the size of the configured memory and if it fits in memory, the whole job is done, however if not, it then goes to:

  • Sorter.java line 207 - sort(...) :
    ....
    List presorted = presort(inputReader, buffer, items, next);

    Passing in the items array already read in using the memory configured, so this array is still retained.

  • Sorter.java line 299 - presort(...)
    ....
    Object[] items = _readMax(inputReader, buffer, _config.getMaxMemoryUsage(), nextValue);

    this now creates a second Object[] of the size requested, but still retaining in scope the firstSortedBatch array reference passed in.

When you combine this 'doubling' of the configured memory with a bad object size estimate, it's an episode of Air Crash Investigations waiting to film.

Is this the correct reading of the way the code works? From reading the presort method, I would think the first 2 lines are the only ones needing the firstSortedBatch reference, the remaining method code that does the do/while loop doesn't need this reference and really somehow the firstSortedBatch reference needs to be released (nulled out - eligible for GC), thereby honouring the configured memory utilisation. Some code refactoring in order?

thoughts?

ENHANCEMENT: Performance ideas: pipelining and compression

Greetings! I am using the library and really like it. Very flexible and performs quite well. However... there's always room for optimization. I noticed that the activity on this project is low and most comments are very old, so I should ask first, should I be looking into a newer project that has effectively replaced this one?

That being said, my experience in C++ based sorting showed two improvements can produce very significant results:

  • Pipelining: during the block sort phase, instead of doing the accumulate/blocksort/write in a procedural loop, fill each block and then lob it into an execution pipeline that separates the sort and write into separate parallel tasks.
  • Compression: A light compression like Snappy can reduce temp space by 70% or so, and can result in faster I/O (especially if compression is done in parallel using the above pipeline technique).

I haven't dug deep enough into the code to see if some of this is already supported, please tell me to RTFM if I've missed something.

Thanks,
john

Sort/Merge process that removes duplicates

I would like a way to enhance the sort or merge process that allows removing duplicate values from the input.

For example, if my input is: a, b, c, b, a the output would be a, b, c

Currently the Merger constructor is too deep in the Sorter code to easily override the creation process and make my own Merger subclass.

Sorter#_merge() not closing streams correctly.

In the method Sorter#_merge(), the list of data readers is never closed. Which means the finally block where the input files are deleted, the files won't be deleted until the GC finally notices the open files are no longer used, then closes them and deletes the files.

I'm using the merge-sort as part of a much large, long running process which means the files may be left open, and on disk, for a significant period of time.

Temp files that result from two-phase merge are not deleted

The attached program sorts 1000 items in a very small amount of memory. So small that it creates 501 temp files, which are pre-merged into six files, which I can see in my temp folder:

-rwx------+ 1 jlilley Domain Users 2600 Sep 17 18:28 j-merge-sort-3015695007988242713.tmp
-rwx------+ 1 jlilley Domain Users   13 Sep 17 18:28 j-merge-sort-3903033334011859833.tmp
-rwx------+ 1 jlilley Domain Users 2600 Sep 17 18:28 j-merge-sort-4734711900361229763.tmp
-rwx------+ 1 jlilley Domain Users 2600 Sep 17 18:28 j-merge-sort-5609443384695274597.tmp
-rwx------+ 1 jlilley Domain Users 2477 Sep 17 18:28 j-merge-sort-8785061516420350339.tmp
-rwx------+ 1 jlilley Domain Users 2600 Sep 17 18:28 j-merge-sort-9037742217201471621.tmp

However, when I finish the output iterator and call sort.close(), it attempts to remove the original 501 temp files, not the files resulting from the pre-merge.

            for (File input : _mergerInputs) {
                input.delete();
            }

I believe that this is the fix:

public Iterator<T> sort(DataReader<T> inputReader)
	...
	_mergerInputs = presorted;
	_merger = _createMergeReader(merge(presorted));

It should look more like

	_mergerInputs = presorted;
        _mergerInputs = merge(presorted);
	_merger = _createMergeReader(_mergeInputs);

If this seems right, I'll post a patch.

Test.zip

Increase JDK baseline from Java 6 to Java 8

Due to fix #21 we need to increase JDK baseline from Java 6 to at least Java 7.
But at this point in time, Java 8 is the oldest supported version so let's just move to it.

Change needs to result in at least minor version bump so let's also move from 1.0.x to 1.1.0 for this project.

copyright clarity - license boilerplate issue

The ASL2.0 license boilerplate for this repository doesn't make it clear about a specific copyright owner:

Copyright [yyyy] [name of copyright owner]

Copyright [yyyy] [name of copyright owner]

I'd presume this is @cowtowncoder himself - be great if you could just update this repo with this corrected. (I've done the same problem myself in another repo, and I'm sure we're both not the only ones out there).

Long lines are corrupted when read by _readNextSlow

RawTextFileReader corrupts lines which span more than 32000 bytes (twice the size of the allocated buffer).

When no CR/LF is found within the first 16000 bytes, _readNextSlow is called. This method has a while loop but the loop will ignore any chunk of 16000 bytes which does not contain a CR/LF thus leading to corrupted input.

RawTextLineReader doesn't skip line feeds correctly on Windows

in com.fasterxml.sort.std.RawTextLineReader#readNext, the start index is stored before the call to _skipCR(), so that even if _skipCR() skips the LF byte by incrementing ++_inputPtr following the CR, the Arrays.copyOfRange call below still has it.

On windows, this results in sorted files being written with:
<LF>
data...data...data<CRLF>
<LF>
data...data...data<CRLF>

Add helper methods for calculating approximate object mem usage

Currently there is no support for estimating sortable entry lengths. It would be nice to have helper object to use for calculating object sizes: this can use conservative estimates, like:

  • Each object has some base per-instance overhead (say, 16 bytes)
  • Arrays also have base overhead (possibly less than Objects, but let's say 16 bytes) -- but also per-entry overhead (4 or 8 byte per reference type, 1/2/4/8 bytes for primitives)
  • Fields and array entries could perhaps be combined

We should probably have separate instances for 32-/64-bit machines; but can let caller specify which one is wanted.

Since this object is performance sensitive will probably need to keep it stateless (i.e. caller must pass all data as arguments), instead of using builder-style (which otherwise would be nice).

loss data after sort

$ wc -l ngram.data; wc -l ngram_sort.data
 10069731 ngram.data
 10067458 ngram_sort.data

after sorting, I lost about 2273 lines? why?

Parallelize main memory sort phase

Suggestion: It might be worthwhile to parallelize the main memory sort phase for better performance. For example, using jsr166y an array can be sorted in parallel simply like this: ParallelArray.createUsingHandoff(Object[], ForkJoinPool).sort(Comparator).getArray()

Note, however, that with that algorithm implementation the sort might not be "stable" anymore, i.e. the relative order of equal elements might change.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.