npgall / concurrent-trees Goto Github PK
View Code? Open in Web Editor NEWConcurrent Radix and Suffix Trees for Java
License: Apache License 2.0
Concurrent Radix and Suffix Trees for Java
License: Apache License 2.0
The current documentation does not mention these methods in InvertedRadixTree:
getLongestKeyPrefixing(CharSequence var1)
getValueForLongestKeyPrefixing(CharSequence var1)
getKeyValuePairForLongestKeyPrefixing(CharSequence var1)
I had a usecase for creating simple URL wild card matching, and didn't realise this existed until after digging deep into the Javadoc
I'm happy to create a pull request with this if seems useful?
May I restrict tree nodes to only string tokens from a given set, e.g. {"apple", "orange", "world wide web", "how are you", "how", "world"}?
i want use this library. How to i can add and import this library in the project ???
Please add support for querying the longest prefix match (this is not currently
exposed in the public API). A variation of this which only matches if the key
is in fact truly a prefix would be useful, otherwise the caller would have to
run an additional prefix.startsWith(key). For example, if the tree only
contains "foo", it's the longest prefix match for anything. But in the use-case
I have, I would only want to match "foo*".
Original issue reported on code.google.com by phraktle
on 19 Nov 2012 at 10:33
Java's default UTF-16, 2-bytes-per-character string encoding, is inefficient
for strings which otherwise could be encoded with a single byte per character.
It should be possible to represent characters in the trees using only a single
byte per character, when working with compatible strings. This may reduce
memory overhead by 50%.
Original issue reported on code.google.com by [email protected]
on 20 Oct 2013 at 10:20
I have a usecase where I would I want to prepare a large tree pre-process and save the pre-processed tree as a file for the trie to be created faster. Is there a mechanism to do so.
What steps will reproduce the problem?
1. Create a ConcurrentSuffixTree
2. Insert some keys
3. Attempt to retrieve all keys by using .getKeysEndingWith("")
What is the expected output? What do you see instead?
I expect an iterable with all keys; I get null.
What version of the product are you using? On what operating system?
2.4.0 on Java 1.8.0_20
Please provide any additional information below.
.getKeysStartingWith("") and .getKeysEndingWith("") return all keys for
ConcurrentRadixTree and ConcurrentReversedRadixTree respectively.
Original issue reported on code.google.com by [email protected]
on 25 Oct 2014 at 9:30
The current implementation is not serializable. If we load a huge amount of
data each time when starting, this may limit the usage. However, if it is
serializable, we can load it once and serialize the entire tree onto the disk.
During the start-up time, we only have to de-serialize it to load the whole
tree quickly.
Original issue reported on code.google.com by [email protected]
on 25 Mar 2013 at 2:29
Why not implement trie directly in byte array?
java.lang.IllegalStateException: Unexpected failure to classify
SearchResult: SearchResult{key=/storage/emulated/0,
nodeFound=Node{edge=/storage/emulated/0, value=0, edges=[]},
charsMatched=19, charsMatchedInNodeFound=19, parentNode=Node{edge=,
value=null, edges=[Node{edge=/storage/emulated/0, value=0, edges=[]},
Node{edge=0, value=3, edges=[]}, Node{edge=1, value=4, edges=[]},
Node{edge=2, value=5, edges=[]}, Node{edge=3, value=6, edges=[]},
Node{edge=4, value=7, edges=[]}, Node{edge=5, value=1, edges=[]},
Node{edge=6, value=2, edges=[]}]}, parentNodesParent=null,
classification=null}
at
com.googlecode.concurrenttrees.radix.ConcurrentRadixTree$SearchResult.classify(ConcurrentRadixTree.java:989)
at
com.googlecode.concurrenttrees.radix.ConcurrentRadixTree$SearchResult.<init>(ConcurrentRadixTree.java:969)
at
com.googlecode.concurrenttrees.radix.ConcurrentRadixTree.searchTree(ConcurrentRadixTree.java:932)
at
com.googlecode.concurrenttrees.radix.ConcurrentRadixTree.putInternal(ConcurrentRadixTree.java:456)
at
com.googlecode.concurrenttrees.radix.ConcurrentRadixTree.put(ConcurrentRadixTree.java:83)
at
com.googlecode.concurrenttrees.radixinverted.ConcurrentInvertedRadixTree.put(ConcurrentInvertedRadixTree.java:185)
at
I was wondering if it would be possible to implement an optimized initial load of a ConcurrentRadixTree, where multiple mappings are supplied (maybe using a builder pattern).
I suppose the design with "mostly" immutable nodes would make this kind of difficult, as there is basically no way around the patching for every put(). Or is there? Only the locking could be skipped, but that probably doesn't make much of a difference.
Currently I just make sure to call put() in lexicographic order of the keys. I assume that should help a bit.
Regarding the patching I have a question. In the documentation it is stated that a thread reading will always see a consistent state, even with concurrent writes. I was assuming this meant the reads would basically be like "statement-level consistent reads" as provided by databases like Oracle. But looking at the implementation, this doesn't seem to be quite the case, as the patching doesn't go all the way up to the root node (as e.g. in Git I think). Did I understand this correctly.
There is a bug that an element, that is added in a concurrent thread, can be found by getValueForExactKey
and not found by the following getValuesForKeysEndingWith
.
The bug can be reproduced with Lincheck.
The scenario that can produce incorrect results:
Parallel part:
| put(aa, 4): null | getValueForExactKey(aa): 4 |
| | getValuesForKeysEndingWith(a): [] |
Looking at the implementation of ConcurrentRadixTree#getDescendantValues() (as used by #getValuesForKeysStartingWith()) I get the impression that the performance could be improved by not using lazyTraverseDescendants(), as it creates all these intermediate NodeKeyPair objects, while here only the Node objects are relevant.
Maybe it would even be possible / make sense to support a visitor pattern. Then it should also be possible to skip the "stack" which now has to be maintained by the Iterator. For my use case that would also work very nicely, since I in the end want to process all returned values.
I'd like to (be able to) set custom NodeCharacterProvider for locale sensitive
Character comparison. I use RadixTree to provide autocompletion. Right now,
RadixTree#getKeysStartingWith(CharSequence) returns keys in "not so useful"
order.
NodeCharacterComparator, the only available comparator implementation, is only
used only 10 times in whole source code. It should be relatively easy to
substitute it with something configurable.
Original issue reported on code.google.com by [email protected]
on 19 Jun 2015 at 9:44
Hi Niall,
I've started to basically port your code to C# initially without any support for thread-safety nor consistency for concurrent modifications. I'd like to know if you have any objections since I'm also planning to open-source the code. Any thoughts or comments?
Thanks,
Rodrigo Souza
Hi.
It would be nice to have to have a API to get keys and values in the tree. Similar to Map.keySet() and Map.values()
These APIs would be helpful for iteration and traversing.
Thanks.
First thank you for the implementation. It is very capable of handling large data sets in efficient running time. I tried concurrent suffix tree to store around 4 million string and each length around 50 using only single thread. However, it consumes around 50gb of memory. Are there anyway to modify the code to make the package use less memory ?
Deploy to Maven Central per:
https://docs.sonatype.org/display/Repository/Sonatype+OSS+Maven+Repository+Usage
+Guide
Original issue reported on code.google.com by [email protected]
on 4 Jul 2012 at 2:46
Hi,
I have looked everywhere in your repository but could not find any information about the license you are using. Could you specify it in the readme.md? Thanks!
Hello, unless I am confused about what to expect, I believe
getKeyValuePairsForKeysPrefixing() is not returning the correct information.
In fact, I believe it is returning "keys" that aren't even in the tree. It is
returning the values at a node, but not the full key that was stored.
What steps will reproduce the problem?
1. Run the attached TreeTest class.
Here is the output I am getting (using the most recent jar downloaded
yesterday):
**** Constructing new tree
Added key/value pair: /a/b/ -> 1
Added key/value pair: /a/blob/ -> 2
Added key/value pair: /a/blog/ -> 3
○
└── ○ /a/b
├── ○ / (1)
└── ○ lo
├── ○ b/ (2)
└── ○ g/ (3)
Keys prefixing /: {/, 1}
Keys prefixing /a/: {/, 1}
Keys prefixing /a/b/: {/a/b/, 1}
Keys prefixing /a/bl/:
Keys prefixing /a/blo/:
Keys prefixing /a/blob/: {/a/blob/, 2}
Keys prefixing /a/blog/: {/a/blog/, 3}
**** Constructing new tree
Added key/value pair: /a/b -> 1
Added key/value pair: /a/blob -> 2
Added key/value pair: /a/blog -> 3
○
└── ○ /a/b (1)
└── ○ lo
├── ○ b (2)
└── ○ g (3)
Keys prefixing /:
Keys prefixing /a:
Keys prefixing /a/b: {/a/b, 1}
Keys prefixing /a/bl: {/a/b, 1}
Keys prefixing /a/blo: {/a/b, 1}
Keys prefixing /a/blob: {/a/b, 1} {/a/blob, 2}
Keys prefixing /a/blog: {/a/b, 1} {/a/blog, 3}
It looks to me like the tree structure is correct, but
getKeyVAluePairsForKeysPrefixing() is returning the incorrect key/value pairs
for several values. For example, with first tree in the example above:
Keys prefixing /: {/, 1} <- No key "/" stored; this is the node for /a/b/
Keys prefixing /a/: {/, 1} <- Ditto; no key / was stored
I am using concurrent-trees-2.1.0.jar on Fedora 17.
Original issue reported on code.google.com by [email protected]
on 5 Oct 2013 at 7:29
Attachments:
It would be useful to support wildcard queries.
Two approaches to be investigated (both of which will be tracked in this issue):
(1) A permuterm index on top of the ConcurrentRadixTree. This would support
queries such as "<prefix>*<suffix>" on a single tree. It may be more memory
efficient than a hash-dictionary approach. See:
http://nlp.stanford.edu/IR-book/html/htmledition/permuterm-indexes-1.html
(2) A composite of a ConcurrentRadixTree and a ConcurrentReversedRadixTree. One
tree would support prefix lookup, the other suffix lookup. Query
"prefix*suffix" may return the intersection of the results from both trees,
after some post-filtering. This second approach however, is near the territory
of a query engine on top of multiple indexes, so if implemented would not
belong in this project, but in http://code.google.com/p/cqengine/
Example usage for (1) would be:
public static void main(String[] args) {
PermutermTree<Integer> tree = new ConcurrentPermutermTree<Integer>(new DefaultCharArrayNodeFactory());
tree.put("TEST", 1);
tree.put("TOAST", 2);
tree.put("TEAM", 3);
System.out.println("Keys matching 'T*T': " + Iterables.toString(tree.getKeysMatching("T", "T"))); // prefix, suffix
}
Output would be:
Keys matching 'T*T': [TOAST, TEST]
Original issue reported on code.google.com by [email protected]
on 24 Mar 2013 at 10:19
The way that it's implemented right now isn't very GC friendly. It seems like it would be trivial to actually implement binary search directly.
For my particular tree I have run some performance measurements indicating that around 15% of the time constructing a tree is spent in NodeUtil#ensureNoDuplicateEdges(). I think that is a rather steep price to be paying.
AFAICT it shouldn't be possible to violate this constraint in the first place (the ConcurrentRadixTree implementation looks correct). So if this is to help people implementing their own implementation, maybe there could be some debug flag to guard this? Alternatives would be: Speed it up (e.g. for edge cases like size() <= 2) or move it into the node constructor, where once the children are sorted it would only be a matter of looping over all children and finding any two adjacent identical ones. Should be cheaper.
Thoughts?
Currently..
Iterables.count(myTree.getValuesForClosestKeys(""))
..can be used to count the number of keys/values in the radix tree.
This ticket is to add a size() method to the trees, to simplify this, and also
it may be more efficient than to calculate size as above.
Note that calculating the size of a radix tree is an expensive operation having
O(n) time complexity. However the method may be useful for debugging purposes.
Original issue reported on code.google.com by [email protected]
on 3 Dec 2013 at 10:45
In my opinion a very useful method would be a boolean .contains()-method, which
check if a query is contained in the tree: This should be similar to
.getKeysStartingWith(query), but break if it finds the first path matching the
query and return true.
For example:
tree.put("TEST", 1);
tree.put("TOAST", 2);
tree.put("TEAM", 3);
tree.contains("TO") -> returns true.
Original issue reported on code.google.com by [email protected]
on 25 Sep 2014 at 1:34
Expose an API to scan the input for keys stored in the tree which are prefixes
of the input.
See discussion in forum:
https://groups.google.com/forum/#!topic/concurrent-trees-discuss/_IpLEzNDFWs
Example: tree contains keys 123, 1234568, 1234569
Input: 12345690
API would return keys 123, 1234569.
This could be used for processing phone numbers.
This could be calculated in a single scan through the input, thus finding keys
which are prefixes of the input very quickly. This functionality is a subset of
InvertedRadixTree.getKeysContainedIn, and can use the same traversal algorithm.
Unit test demonstrating desired functionality:
@Test
public void testGetKeysPrefixing() throws Exception {
ConcurrentInvertedRadixTree<Integer> tree = new ConcurrentInvertedRadixTree<Integer>(nodeFactory);
tree.put("1234567", 1);
tree.put("1234568", 2);
tree.put("123", 3);
// ○
// └── ○ 123 (3)
// └── ○ 456
// ├── ○ 7 (1)
// └── ○ 8 (2)
assertEquals("[123, 1234567]", Iterables.toString(tree.getKeysPrefixing("1234567")));
assertEquals("[123, 1234567]", Iterables.toString(tree.getKeysPrefixing("12345670")));
assertEquals("[123, 1234568]", Iterables.toString(tree.getKeysPrefixing("1234568")));
assertEquals("[123, 1234568]", Iterables.toString(tree.getKeysPrefixing("12345680")));
assertEquals("[123]", Iterables.toString(tree.getKeysPrefixing("1234569")));
assertEquals("[123]", Iterables.toString(tree.getKeysPrefixing("123456")));
assertEquals("[123]", Iterables.toString(tree.getKeysPrefixing("123")));
assertEquals("[]", Iterables.toString(tree.getKeysPrefixing("12")));
assertEquals("[]", Iterables.toString(tree.getKeysPrefixing("")));
}
Original issue reported on code.google.com by [email protected]
on 7 Aug 2013 at 9:03
This request may also apply to other NodeFactory
implementations, but so far I have only been using the ByteArrayNodeFactory
and CharArrayNodeFactory
implementations.
In my use case I build a tree for 1.5M+ mappings resulting in a RadixTree with 1.9M+ nodes. Of these nodes more than 33% have an incoming edge containing only a single character. Wrapping that single byte (or char) in an array incurs a rather big overhead: 4 bytes for the reference + 16 bytes for the array (12 object overhead, 1 byte payload, 3 bytes of padding). So a total of 20 bytes vs. a single byte (or two bytes for CharArrayNodeFactory
). Depending on the padding required for the concrete node implementation I suppose one could save between 16 and 20 bytes of memory, when dealing with nodes with a single character in the incoming edge. For my particular tree that makes quite a difference.
Since I can supply my own NodeFactory
implementation I can of course extend the mentioned implementations accordingly. I was just wondering if this improvement may be of general interest.
is there a way for saving the index to disk or different storage ?
if all the suffix tree is too much big is there a way for loading a part dinamically?
Right now, when we search for keys, we get ALL the keys that match the term.
It would be great to limit the result. A solution would be to return an iterator instead of a collection.
Also, if the first element could be the exact match if it exists that would be great!
There are a few classes and interfaces which declare methods with Character
parameters or return types. Examples are NodeCharacterProvider#getIncomingEdgeFirstCharacter()
, Node#getOutgoingEdge()
, and NodeUtil#binarySearchForEdge()
.
But it seems like in the character value always originates as a char
. Typically resulting as a call to CharSequence#charAt()
.
Particularly Node#getOutgoingEdge()
is called a lot. So there would at least be a little bit to gain by declaring the parameters and return types as char
instead, as the JVM wouldn't always have to autobox anymore.
I am however a bit unclear about the backwards compatibility requirements in this project. In the previous release there were methods added to RadixTree
, which will have broken any client projects providing their own implementation (not subclassing ConcurrentRadixTree
), even though there was only an increment in the minor version. I assume that is because most clients will only "use" the APIs and should thus remain compatible.
Hi,
I'm seeing a need for an InvertedSuffixTree. The scenario is where I have a list of rule suffix strings that I would insert into an InvertedSuffixTree, and then I receive input on a string where I'd like to know whether that input string ends in any of the rule suffixes.
I was wondering if not having an InvertedSuffixTree is because there's an alternative approach to this problem, or just that it's an uncommon problem that could be implemented some day?
Thanks!
03-30 12:31:41.152 3068 9801 E AndroidRuntime: java.lang.IllegalStateException: Unexpected failure to classify SearchResult: SearchResult{key=537742223652646, nodeFound=Node{edge=23652646, value=[29], edges=[]}, charsMatched=15, charsMatchedInNodeFound=8, parentNode=Node{edge=422, value=null, edges=[Node{edge=23652646, value=[29], edges=[]}, Node{edge=78335, value=[26], edges=[]}]}, parentNodesParent=Node{edge=77, value=null, edges=[Node{edge=334262, value=[10], edges=[]}, Node{edge=422, value=null, edges=[Node{edge=23652646, value=[29], edges=[]}, Node{edge=78335, value=[26], edges=[]}]}]}, classification=null}
03-30 12:31:41.152 3068 9801 E AndroidRuntime: at com.googlecode.concurrenttrees.radix.ConcurrentRadixTree$SearchResult.classify(ConcurrentRadixTree.java:989)
03-30 12:31:41.152 3068 9801 E AndroidRuntime: at com.googlecode.concurrenttrees.radix.ConcurrentRadixTree$SearchResult.<init>(ConcurrentRadixTree.java:969)
03-30 12:31:41.152 3068 9801 E AndroidRuntime: at com.googlecode.concurrenttrees.radix.ConcurrentRadixTree.searchTree(ConcurrentRadixTree.java:932)
03-30 12:31:41.152 3068 9801 E AndroidRuntime: at com.googlecode.concurrenttrees.radix.ConcurrentRadixTree.getValueForExactKey(ConcurrentRadixTree.java:102)
I'm afraid I can't supply a standalone test case since we're building the tree from a contacts database to facilitate T9 searches. All updates are made on the same thread (reads occur on another thread). Under what circumstances can this exception be thrown? Judging from the classification code and search result string, it should be returning Classification.EXACT_MATCH
It'd be nice to have a clean way to prune a tree of all nodes, similar to Map.clear()
Else it is not clear to the API user if it is enough/OK to set the tree to null.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.