npgall / concurrent-trees Goto Github PK

View Code? Open in Web Editor NEW

498.0 498.0 80.0 4.23 MB

Concurrent Radix and Suffix Trees for Java

License: Apache License 2.0

Java 100.00%

concurrent-trees's People

Contributors

Stargazers

Watchers

concurrent-trees's Issues

Documentation: docs for getValueForLongestKeyPrefixing(CharSequence word)

The current documentation does not mention these methods in InvertedRadixTree:

getLongestKeyPrefixing(CharSequence var1)
getValueForLongestKeyPrefixing(CharSequence var1)
getKeyValuePairForLongestKeyPrefixing(CharSequence var1)

I had a usecase for creating simple URL wild card matching, and didn't realise this existed until after digging deep into the Javadoc

I'm happy to create a pull request with this if seems useful?

How to use with word tokens?

May I restrict tree nodes to only string tokens from a given set, e.g. {"apple", "orange", "world wide web", "how are you", "how", "world"}?

How to add this library in dependencies

i want use this library. How to i can add and import this library in the project ???

Expose API to get longest prefix match / "closest" keys


Please add support for querying the longest prefix match (this is not currently 
exposed in the public API). A variation of this which only matches if the key 
is in fact truly a prefix would be useful, otherwise the caller would have to 
run an additional prefix.startsWith(key). For example, if the tree only 
contains "foo", it's the longest prefix match for anything. But in the use-case 
I have, I would only want to match "foo*".

Original issue reported on code.google.com by phraktle on 19 Nov 2012 at 10:33

Support 8-bit encoded strings

Java's default UTF-16, 2-bytes-per-character string encoding, is inefficient 
for strings which otherwise could be encoded with a single byte per character.

It should be possible to represent characters in the trees using only a single 
byte per character, when working with compatible strings. This may reduce 
memory overhead by 50%.

Original issue reported on code.google.com by [email protected] on 20 Oct 2013 at 10:20

How to serialise entire tree in a file?

I have a usecase where I would I want to prepare a large tree pre-process and save the pre-processed tree as a file for the trie to be created faster. Is there a mechanism to do so.

ConcurrentSuffixTree.getKeysEndingWith(x) returns null when x is empty

What steps will reproduce the problem?
1. Create a ConcurrentSuffixTree
2. Insert some keys
3. Attempt to retrieve all keys by using .getKeysEndingWith("")

What is the expected output? What do you see instead?
I expect an iterable with all keys; I get null. 

What version of the product are you using? On what operating system?
2.4.0 on  Java 1.8.0_20

Please provide any additional information below.
.getKeysStartingWith("") and .getKeysEndingWith("") return all keys for 
ConcurrentRadixTree and ConcurrentReversedRadixTree respectively.

Original issue reported on code.google.com by [email protected] on 25 Oct 2014 at 9:30

Can we make the trees serializable?

The current implementation is not serializable. If we load a huge amount of 
data each time when starting, this may limit the usage. However, if it is 
serializable, we can load it once and serialize the entire tree onto the disk. 
During the start-up time, we only have to de-serialize it to load the whole 
tree quickly.

Original issue reported on code.google.com by [email protected] on 25 Mar 2013 at 2:29

suggestion

Why not implement trie directly in byte array?

Crash: faliure to classiffy

java.lang.IllegalStateException: Unexpected failure to classify
SearchResult: SearchResult{key=/storage/emulated/0,
nodeFound=Node{edge=/storage/emulated/0, value=0, edges=[]},
charsMatched=19, charsMatchedInNodeFound=19, parentNode=Node{edge=,
value=null, edges=[Node{edge=/storage/emulated/0, value=0, edges=[]},
Node{edge=0, value=3, edges=[]}, Node{edge=1, value=4, edges=[]},
Node{edge=2, value=5, edges=[]}, Node{edge=3, value=6, edges=[]},
Node{edge=4, value=7, edges=[]}, Node{edge=5, value=1, edges=[]},
Node{edge=6, value=2, edges=[]}]}, parentNodesParent=null,
classification=null}
at
com.googlecode.concurrenttrees.radix.ConcurrentRadixTree$SearchResult.classify(ConcurrentRadixTree.java:989)
at
com.googlecode.concurrenttrees.radix.ConcurrentRadixTree$SearchResult.<init>(ConcurrentRadixTree.java:969)
at
com.googlecode.concurrenttrees.radix.ConcurrentRadixTree.searchTree(ConcurrentRadixTree.java:932)
at
com.googlecode.concurrenttrees.radix.ConcurrentRadixTree.putInternal(ConcurrentRadixTree.java:456)
at
com.googlecode.concurrenttrees.radix.ConcurrentRadixTree.put(ConcurrentRadixTree.java:83)
at
com.googlecode.concurrenttrees.radixinverted.ConcurrentInvertedRadixTree.put(ConcurrentInvertedRadixTree.java:185)
at

Faster tree build

I was wondering if it would be possible to implement an optimized initial load of a ConcurrentRadixTree, where multiple mappings are supplied (maybe using a builder pattern).

I suppose the design with "mostly" immutable nodes would make this kind of difficult, as there is basically no way around the patching for every put(). Or is there? Only the locking could be skipped, but that probably doesn't make much of a difference.

Currently I just make sure to call put() in lexicographic order of the keys. I assume that should help a bit.

Regarding the patching I have a question. In the documentation it is stated that a thread reading will always see a consistent state, even with concurrent writes. I was assuming this meant the reads would basically be like "statement-level consistent reads" as provided by databases like Oracle. But looking at the implementation, this doesn't seem to be quite the case, as the patching doesn't go all the way up to the root node (as e.g. in Git I think). Did I understand this correctly.

Non-linearizable behavior of ConcurrentSuffixTree

There is a bug that an element, that is added in a concurrent thread, can be found by getValueForExactKey and not found by the following getValuesForKeysEndingWith.

The bug can be reproduced with Lincheck.
The scenario that can produce incorrect results:

Parallel part:
| put(aa, 4): null | getValueForExactKey(aa):       4  |
|                  | getValuesForKeysEndingWith(a): [] |

Performance of RadixTree#getValuesForKeysStartingWith()

Looking at the implementation of ConcurrentRadixTree#getDescendantValues() (as used by #getValuesForKeysStartingWith()) I get the impression that the performance could be improved by not using lazyTraverseDescendants(), as it creates all these intermediate NodeKeyPair objects, while here only the Node objects are relevant.

Maybe it would even be possible / make sense to support a visitor pattern. Then it should also be possible to skip the "stack" which now has to be maintained by the Iterator. For my use case that would also work very nicely, since I in the end want to process all returned values.

Custom Comparator<NodeCharacterProvider>

I'd like to (be able to) set custom NodeCharacterProvider for locale sensitive 
Character comparison. I use RadixTree to provide autocompletion. Right now, 
RadixTree#getKeysStartingWith(CharSequence) returns keys in "not so useful" 
order.

NodeCharacterComparator, the only available comparator implementation, is only 
used only 10 times in whole source code. It should be relatively easy to 
substitute it with something configurable.

Original issue reported on code.google.com by [email protected] on 19 Jun 2015 at 9:44

Initial idea of porting this code to C#

Hi Niall,

I've started to basically port your code to C# initially without any support for thread-safety nor consistency for concurrent modifications. I'd like to know if you have any objections since I'm also planning to open-source the code. Any thoughts or comments?

Thanks,
Rodrigo Souza

API for retrieving keys/values

Hi.

It would be nice to have to have a API to get keys and values in the tree. Similar to Map.keySet() and Map.values()

These APIs would be helpful for iteration and traversing.

Thanks.

reduce memory usage in single thread mode

First thank you for the implementation. It is very capable of handling large data sets in efficient running time. I tried concurrent suffix tree to store around 4 million string and each length around 50 using only single thread. However, it consumes around 50gb of memory. Are there anyway to modify the code to make the package use less memory ?

Deploy to Maven central

Deploy to Maven Central per: 
https://docs.sonatype.org/display/Repository/Sonatype+OSS+Maven+Repository+Usage
+Guide

Original issue reported on code.google.com by [email protected] on 4 Jul 2012 at 2:46

License information

Hi,
I have looked everywhere in your repository but could not find any information about the license you are using. Could you specify it in the readme.md? Thanks!

InvertedRadixTree.getKeyValuePairsForKeysPrefixing() returning keys that aren't even in the tree

Hello, unless I am confused about what to expect, I believe 
getKeyValuePairsForKeysPrefixing() is not returning the correct information.  
In fact, I believe it is returning "keys" that aren't even in the tree.  It is 
returning the values at a node, but not the full key that was stored.

What steps will reproduce the problem?
1. Run the attached TreeTest class.

Here is the output I am getting (using the most recent jar downloaded 
yesterday):

**** Constructing new tree
Added key/value pair: /a/b/ -> 1
Added key/value pair: /a/blob/ -> 2
Added key/value pair: /a/blog/ -> 3
○
└── ○ /a/b
    ├── ○ / (1)
    └── ○ lo
        ├── ○ b/ (2)
        └── ○ g/ (3)
Keys prefixing /: {/, 1} 
Keys prefixing /a/: {/, 1} 
Keys prefixing /a/b/: {/a/b/, 1} 
Keys prefixing /a/bl/: 
Keys prefixing /a/blo/: 
Keys prefixing /a/blob/: {/a/blob/, 2} 
Keys prefixing /a/blog/: {/a/blog/, 3} 
**** Constructing new tree
Added key/value pair: /a/b -> 1
Added key/value pair: /a/blob -> 2
Added key/value pair: /a/blog -> 3
○
└── ○ /a/b (1)
    └── ○ lo
        ├── ○ b (2)
        └── ○ g (3)
Keys prefixing /: 
Keys prefixing /a: 
Keys prefixing /a/b: {/a/b, 1} 
Keys prefixing /a/bl: {/a/b, 1} 
Keys prefixing /a/blo: {/a/b, 1} 
Keys prefixing /a/blob: {/a/b, 1} {/a/blob, 2} 
Keys prefixing /a/blog: {/a/b, 1} {/a/blog, 3} 


It looks to me like the tree structure is correct, but 
getKeyVAluePairsForKeysPrefixing() is returning the incorrect key/value pairs 
for several values.  For example, with first tree in the example above:
Keys prefixing /: {/, 1}   <-  No key "/" stored; this is the node for /a/b/
Keys prefixing /a/: {/, 1}  <- Ditto; no key / was stored


I am using concurrent-trees-2.1.0.jar on Fedora 17.

Original issue reported on code.google.com by [email protected] on 5 Oct 2013 at 7:29

Attachments:

TreeTest.java

New types of tree: ConcurrentPermutermTree, ConcurrentWildcardTree for wildcard queries

It would be useful to support wildcard queries.

Two approaches to be investigated (both of which will be tracked in this issue):

(1) A permuterm index on top of the ConcurrentRadixTree. This would support 
queries such as "<prefix>*<suffix>" on a single tree. It may be more memory 
efficient than a hash-dictionary approach. See: 
http://nlp.stanford.edu/IR-book/html/htmledition/permuterm-indexes-1.html

(2) A composite of a ConcurrentRadixTree and a ConcurrentReversedRadixTree. One 
tree would support prefix lookup, the other suffix lookup. Query 
"prefix*suffix" may return the intersection of the results from both trees, 
after some post-filtering. This second approach however, is near the territory 
of a query engine on top of multiple indexes, so if implemented would not 
belong in this project, but in http://code.google.com/p/cqengine/

Example usage for (1) would be:

public static void main(String[] args) {
    PermutermTree<Integer> tree = new ConcurrentPermutermTree<Integer>(new DefaultCharArrayNodeFactory());

    tree.put("TEST", 1);
    tree.put("TOAST", 2);
    tree.put("TEAM", 3);

    System.out.println("Keys matching 'T*T': " + Iterables.toString(tree.getKeysMatching("T", "T"))); // prefix, suffix
}


Output would be:
    Keys matching 'T*T': [TOAST, TEST]

Original issue reported on code.google.com by [email protected] on 24 Mar 2013 at 10:19

Consider optimizing NodeUtil#binarySearchForEdge()

The way that it's implemented right now isn't very GC friendly. It seems like it would be trivial to actually implement binary search directly.

Consider removing overhead of NodeUtil#ensureNoDuplicateEdges()

For my particular tree I have run some performance measurements indicating that around 15% of the time constructing a tree is spent in NodeUtil#ensureNoDuplicateEdges(). I think that is a rather steep price to be paying.

AFAICT it shouldn't be possible to violate this constraint in the first place (the ConcurrentRadixTree implementation looks correct). So if this is to help people implementing their own implementation, maybe there could be some debug flag to guard this? Alternatives would be: Speed it up (e.g. for edge cases like size() <= 2) or move it into the node constructor, where once the children are sorted it would only be a matter of looping over all children and finding any two adjacent identical ones. Should be cheaper.

Thoughts?

Add tree.size() methods

Currently..

    Iterables.count(myTree.getValuesForClosestKeys(""))

..can be used to count the number of keys/values in the radix tree.

This ticket is to add a size() method to the trees, to simplify this, and also 
it may be more efficient than to calculate size as above.

Note that calculating the size of a radix tree is an expensive operation having 
O(n) time complexity. However the method may be useful for debugging purposes.

Original issue reported on code.google.com by [email protected] on 3 Dec 2013 at 10:45

add tree.contains()-method

In my opinion a very useful method would be a boolean .contains()-method, which 
check if a query is contained in the tree: This should be similar to 
.getKeysStartingWith(query), but break if it finds the first path matching the 
query and return true.

For example:

tree.put("TEST", 1);
tree.put("TOAST", 2);
tree.put("TEAM", 3);

tree.contains("TO") -> returns true.

Original issue reported on code.google.com by [email protected] on 25 Sep 2014 at 1:34

Expose API to get longest prefix match

Expose an API to scan the input for keys stored in the tree which are prefixes 
of the input.
See discussion in forum: 
https://groups.google.com/forum/#!topic/concurrent-trees-discuss/_IpLEzNDFWs

Example: tree contains keys 123, 1234568, 1234569

Input: 12345690

API would return keys 123, 1234569.

This could be used for processing phone numbers.

This could be calculated in a single scan through the input, thus finding keys 
which are prefixes of the input very quickly. This functionality is a subset of 
InvertedRadixTree.getKeysContainedIn, and can use the same traversal algorithm.

Unit test demonstrating desired functionality:

    @Test
    public void testGetKeysPrefixing() throws Exception {
        ConcurrentInvertedRadixTree<Integer> tree = new ConcurrentInvertedRadixTree<Integer>(nodeFactory);

        tree.put("1234567", 1);
        tree.put("1234568", 2);
        tree.put("123", 3);

        //    ○
        //    └── ○ 123 (3)
        //        └── ○ 456
        //            ├── ○ 7 (1)
        //            └── ○ 8 (2)

        assertEquals("[123, 1234567]", Iterables.toString(tree.getKeysPrefixing("1234567")));
        assertEquals("[123, 1234567]", Iterables.toString(tree.getKeysPrefixing("12345670")));
        assertEquals("[123, 1234568]", Iterables.toString(tree.getKeysPrefixing("1234568")));
        assertEquals("[123, 1234568]", Iterables.toString(tree.getKeysPrefixing("12345680")));
        assertEquals("[123]", Iterables.toString(tree.getKeysPrefixing("1234569")));
        assertEquals("[123]", Iterables.toString(tree.getKeysPrefixing("123456")));
        assertEquals("[123]", Iterables.toString(tree.getKeysPrefixing("123")));
        assertEquals("[]", Iterables.toString(tree.getKeysPrefixing("12")));
        assertEquals("[]", Iterables.toString(tree.getKeysPrefixing("")));
    }

Original issue reported on code.google.com by [email protected] on 7 Aug 2013 at 9:03

Optimize ByteArrayNodeFactory and CharArrayNodeFactory for "dense" trees

This request may also apply to other NodeFactory implementations, but so far I have only been using the ByteArrayNodeFactory and CharArrayNodeFactory implementations.

In my use case I build a tree for 1.5M+ mappings resulting in a RadixTree with 1.9M+ nodes. Of these nodes more than 33% have an incoming edge containing only a single character. Wrapping that single byte (or char) in an array incurs a rather big overhead: 4 bytes for the reference + 16 bytes for the array (12 object overhead, 1 byte payload, 3 bytes of padding). So a total of 20 bytes vs. a single byte (or two bytes for CharArrayNodeFactory). Depending on the padding required for the concrete node implementation I suppose one could save between 16 and 20 bytes of memory, when dealing with nodes with a single character in the incoming edge. For my particular tree that makes quite a difference.

Since I can supply my own NodeFactory implementation I can of course extend the mentioned implementations accordingly. I was just wondering if this improvement may be of general interest.

question

is there a way for saving the index to disk or different storage ?

if all the suffix tree is too much big is there a way for loading a part dinamically?

limit the result (maybe an iterator)

Right now, when we search for keys, we get ALL the keys that match the term.
It would be great to limit the result. A solution would be to return an iterator instead of a collection.

Also, if the first element could be the exact match if it exists that would be great!

Use 'char' instead of 'Character'

There are a few classes and interfaces which declare methods with Character parameters or return types. Examples are NodeCharacterProvider#getIncomingEdgeFirstCharacter(), Node#getOutgoingEdge(), and NodeUtil#binarySearchForEdge().

But it seems like in the character value always originates as a char. Typically resulting as a call to CharSequence#charAt().

Particularly Node#getOutgoingEdge() is called a lot. So there would at least be a little bit to gain by declaring the parameters and return types as char instead, as the JVM wouldn't always have to autobox anymore.

I am however a bit unclear about the backwards compatibility requirements in this project. In the previous release there were methods added to RadixTree, which will have broken any client projects providing their own implementation (not subclassing ConcurrentRadixTree), even though there was only an increment in the minor version. I assume that is because most clients will only "use" the APIs and should thus remain compatible.

InvertedSuffixTree?

Hi,

I'm seeing a need for an InvertedSuffixTree. The scenario is where I have a list of rule suffix strings that I would insert into an InvertedSuffixTree, and then I receive input on a string where I'd like to know whether that input string ends in any of the rule suffixes.

I was wondering if not having an InvertedSuffixTree is because there's an alternative approach to this problem, or just that it's an uncommon problem that could be implemented some day?

Thanks!

Unexpected failure to classify SearchResult

03-30 12:31:41.152 3068 9801 E AndroidRuntime: java.lang.IllegalStateException: Unexpected failure to classify SearchResult: SearchResult{key=537742223652646, nodeFound=Node{edge=23652646, value=[29], edges=[]}, charsMatched=15, charsMatchedInNodeFound=8, parentNode=Node{edge=422, value=null, edges=[Node{edge=23652646, value=[29], edges=[]}, Node{edge=78335, value=[26], edges=[]}]}, parentNodesParent=Node{edge=77, value=null, edges=[Node{edge=334262, value=[10], edges=[]}, Node{edge=422, value=null, edges=[Node{edge=23652646, value=[29], edges=[]}, Node{edge=78335, value=[26], edges=[]}]}]}, classification=null}
03-30 12:31:41.152 3068 9801 E AndroidRuntime: at com.googlecode.concurrenttrees.radix.ConcurrentRadixTree$SearchResult.classify(ConcurrentRadixTree.java:989)
03-30 12:31:41.152 3068 9801 E AndroidRuntime: at com.googlecode.concurrenttrees.radix.ConcurrentRadixTree$SearchResult.<init>(ConcurrentRadixTree.java:969)
03-30 12:31:41.152 3068 9801 E AndroidRuntime: at com.googlecode.concurrenttrees.radix.ConcurrentRadixTree.searchTree(ConcurrentRadixTree.java:932)
03-30 12:31:41.152 3068 9801 E AndroidRuntime: at com.googlecode.concurrenttrees.radix.ConcurrentRadixTree.getValueForExactKey(ConcurrentRadixTree.java:102)

I'm afraid I can't supply a standalone test case since we're building the tree from a contacts database to facilitate T9 searches. All updates are made on the same thread (reads occur on another thread). Under what circumstances can this exception be thrown? Judging from the classification code and search result string, it should be returning Classification.EXACT_MATCH

API method to clear all values in the tree

It'd be nice to have a clean way to prune a tree of all nodes, similar to Map.clear()
Else it is not clear to the API user if it is enough/OK to set the tree to null.

npgall / concurrent-trees Goto Github PK

concurrent-trees's People

Contributors

Stargazers

Watchers

Forkers

concurrent-trees's Issues

Recommend Projects

Recommend Topics

Recommend Org