armon / libart Goto Github PK
View Code? Open in Web Editor NEWAdaptive Radix Trees implemented in C
License: Other
Adaptive Radix Trees implemented in C
License: Other
When trying to add a key which is a full prefix of another key the following valgrind warning is issued:
==2159== Conditional jump or move depends on uninitialised value(s)
==2159== at 0x4015A8: add_child4 (art.c:420)
==2159== by 0x401CBD: recursive_insert (art.c:519)
==2159== by 0x401CBD: art_insert (art.c:583)
==2159== by 0x400842: append (test.c:36)
==2159== by 0x400842: main (test.c:73)
==2159==
==2159== Conditional jump or move depends on uninitialised value(s)
==2159== at 0x4010EA: find_child (art.c:138)
==2159== by 0x40193A: art_search (art.c:249)
==2159== by 0x400852: append (test.c:38)
==2159== by 0x400852: main (test.c:73)
==2159==
While that action is not allowed by the API (according to issue #12), it would be preferable to fail instead of triggering undefined behaviour.
The program triggering that behaviour is at:
http://paste.fedoraproject.org/281806/42043514
I am actually in the geosciences using geohashes -
https://en.wikipedia.org/wiki/Geohash which are 64 bit integers. I
would like to store these geohashes in memory and then do range scans.
I want to be able to do range queries with geohash ranges and these
should be very fast.
As an example if I have latitude and longitude = 61.5,N 172 E
and another latitude and longitude = 61.4. 172.1 E both of these will
geohash to similar values with a similar prefix. We can calculate the
prefix with an lower and upper bound and retrieve all the geohashes in
that range. Is it possible with your implementation of ART ?
I do not see an API for range queries.
Hi,
I'm currently thinking about using a Java based ART implementation as an in-memory index for my Open Source temporal data store (https://sirix.io), which uses an ART like trie for storing paged variable sized document index on flash drives. I'm however also using Bitmap nodes as in hash array mapped tries or Judy arrays.
Now, besides a super cache unfriendly AVL tree as a secondary in memory data structure I'm considering using ART or HOT. However, I don't understand why performance for 95% range scans and 5% inserts is worse in comparison with a TreeMap, which is a red/black binary tree implementation:
https://github.com/rohansuri/adaptive-radix-tree
You'll find the performance comparison diagram when you scroll down. Maybe you have an idea...
It would be useful to be able to serialize my ART to a file and grab it from file again without expanding to a full (k,v) list. My current data collection process is somewhat expensive and I read directly into ART, but every time I want to iterate and do some new analysis on the ART I have to re-collect data, and that sucks.
I'd prefer to just serialize the ART itself instead of the uncompressed (k,v) list. Anyway, just a thought.
So the 8.1 bytes used are besides the key's 4 bytes length?
With a specific data set, the art_iter_prefix find a full_match and goes deeper of the key_len.
// Bail if the prefix does not match
if (n->partial_len) {
[...]
// if there is a full match, go deeper
depth = depth + n->partial_len;
}
[...]
child = find_child(n, key[depth]); //KABOOUM when key_len is less than the depth
Simply change? :
// If we've matched the prefix, iterate on this node
if (prefix_len && ((depth + prefix_len) == key_len)) {
return recursive_iter(n, cb, data);
// If mismatch or no match, search is terminated
} else if(prefix_len < n->partial_len) {
return 0;
}
Not 100% sure is the optimal solution.
#include <stdio.h>
#include "art.h"
void main()
{
unsigned char* key = (unsigned char*)"abcd";
art_tree tree;
init_art_tree(&tree);
art_insert(&tree, key, 3, (void*)0x5678);
printf("%p\n", art_search(&tree, key, 3));
art_insert(&tree, key, 2, (void*)0x1234);
printf("%p\n", art_search(&tree, key, 2));
destroy_art_tree(&tree);
}
$ gcc test.c art.c -std=c99 -o test
$ ./test
I was expecting
0x5678
0x1234
But got
0x5678
(nil)
We have an application where we store lots of URIs and map them to integer IDs. Since almost all URIs start with the same prefix, we decided to use the ART trie to do this mapping. Unfortunately, I realised that ART still keeps a complete copy of the key in its leaf nodes - wouldn't it be possible to dynamically reconstruct keys by traversing the trie from the root when needed (e.g. for the iteration)? In our case this would safe tremendous amounts of main memory.
Hi ! Your current structure:
typedef struct {
uint8_t type;
uint8_t num_children;
uint32_t partial_len;
unsigned char partial[MAX_PREFIX_LEN];
} art_node;
is 20-byte sized (if MAX_PREFIX_LEN = 10) because of implicit padding for aligning addresses of fields. So, why not ordering fields by their type size ? Like:
typedef struct {
uint32_t partial_len;
unsigned char partial[MAX_PREFIX_LEN];
uint8_t type;
uint8_t num_children;
} art_node;
is now 16 bytes. You earned 4 bytes (so 25%). As consequence, the structure
art_node256
will reduce its size from 2072 to 2064 (earned 8 bytes), as well for art_node48
(664 bytes --> 656) and art_node16
(168 bytes --> 160). This can be nice for micro-controllers with little memory space.
Note 01: My architecture is an AMD 64 bytes. So addresses are 8 bytes (art_node*) and I did not check with a 32-bits archi (but I have an old computer).
With art_node4
you are always loosing 4 bytes (even with my structure) for 64-bits archi. Its 'ideal' size is 52 bytes but 52 bytes / 8 bytes = 6.5 rounding up to 7 and therefore 7 * 8 = 56 bytes. For 32-bits archi it seems ok with my structure !
Note 02: I have not made unit tests with my changes.
Note 03: I'm not a specialist of padding.
You can unit tests like:
printf("%lu\n", sizeof (art_node));
assert((sizeof(uint32_t) + 2 * sizeof(uint8_t) + MAX_PREFIX_LEN * sizeof(unsigned char)) == sizeof (art_node));
Would you like to add more error handling for return values from functions like the following?
art_size() is defined as an inline
function in art.h
. However, the inline
keyword was not introduced to C until C99.
Suggest one of (in descending order of preference):
inline
keyword, define symbol in art.c
if not compiling in c99 modeart_size()
with a macro (ART_SIZE
?)Minor nitpicking as usual :-).
Hi,
This library looks very interesting. Since I do not want to reconstruct the tree everytime I load my program, I was wondering if there is a way I can go about serializing the tree once I build it. Is there a way to do this efficiently?
Thanks!
Hi @armon,
I bother you again with another question. Looking your examples (tc), I could see that you insert only '\0' terminated keys. I was wondering if it's necessary since you pass the key length to every art operation. I did a simple test, and sometimes it fails and sometimes don't. It never fails if keys are '\0' terminated.
I wan't to know because my keys are Erlang binaries that aren't '\0' terminated. This forces me to copy the key just to add the '\0' at the end.
Here is a small test. As you'll see this test fails with key1 and pass with key2.
START_TEST(test_art_insert_non0)
{
art_tree t;
int res = init_art_tree(&t);
fail_unless(res == 0);
unsigned char key1[1] = {1};
unsigned char key2[2] = {1, 2};
fail_unless(NULL == art_insert(&t, key1, 1, (void*)key1));
fail_unless(NULL == art_insert(&t, key2, 2, (void*)key2));
fail_unless(art_size(&t) == 2);
fail_unless(NULL != art_search(&t, key2, 2));
fail_unless(NULL != art_search(&t, key1, 1));
res = destroy_art_tree(&t);
fail_unless(res == 0);
}
END_TEST
Unfortunately we weren't able to reproduce the crash locally, but it happens on arm devices in
art.c:578 -
add_child4(new_node, ref, l->key[depth+longest_prefix], SET_LEAF(l));
As far as I can tell, it's the access to l->key[depth+longest_prefix]
The assembly produced:
art.c:578 633e6: 4435 add r5, r6 633e8: 9803 ldr r0, [sp, #12] 633ea: f044 0301 orr.w r3, r4, #1 633ee: 4659 mov r1, fp 633f0: 4428 add r0, r5 *633f2: 7a02 ldrb r2, [r0, #8]
I don't understand how it can happen, but it happens. Any ideas?
The case NODE48, is not matching the recursive_iter(), so it leave some of the leaf not traversed in it.
The function prefix_mismatch on partial longer than MAX_PREFIX_LEN is incorrect. If the key search is the lowest key of the next node, a match occur but shouldn't. So a lot more of results are selected.
In art_iter_prefix, we need to cap the prefix_len found to the partial_len of the current node. Only one line is impacted:
prefix_len = prefix_mismatch(n, key, key_len, depth);
if(prefix_len > n->partial_len) prefix_len = n->partial_len;
In the following example, I insert two keys and do a search, I expect get right value, but it will raising error.
#include "art.h"
#include "assert.h"
#include "stdio.h"
void test(){
art_tree tree;
art_tree_init(&tree);
unsigned char key1[1] = {0};
unsigned char key2[2] = {0, 1};
int value1 = 1, value2 = 2;
art_insert(&tree, key1, 1, &value1);
art_insert(&tree, key2, 2, &value2);
int* value_ptr = (int*)art_search(&tree, key1, 1);
assert(value_ptr != NULL);
assert(*value_ptr == value1);
}
int main(){
test();
}
I find if we insert a string whose prefix is a existed key, the previous key may be missed due to following code in art.c:580 :
add_child4(new_node, ref, l->key[depth+longest_prefix], SET_LEAF(l));
add_child4(new_node, ref, l2->key[depth+longest_prefix], SET_LEAF(l2));
depth + longest_prefix overflows in my example
Hi!
I am trying to use this implementation of ART for 64-bit unsigned integer keys.
Since the interface accept unsigned char array, I would like to confirm that integer keys needs to have their byte-ordering swap in little-endian machine?
i.e. instead of
uint64_t key = 123456789
void* val_ptr = art_search(&t, (unsigned char*) &key, sizeof(uint64_t));
I should do
uint64_t key = 123456789
uint64_t reversed_key = __builtin_bswap64(key);
void* val_ptr = art_search(&t, (unsigned char*) &reversed_key, sizeof(uint64_t));
Thank you!
Hi,
Is there any way to serialize the tree ahead of time ?
If not, what would be the best approach for this ?
My C is a bit rusty, but I'd love to help on this.
Hi @armon,
I was wrapping libart to use it in Erlang (btw excellent pice of work!). Everything went fine until I started playing with long binaries as keys in Erlang getting the same key inserted twice.
I've done the same test in C to find out if it was a mistake in my wrapper, and the error persisted. I don´t know if I'm doing something wrong...
Here's de C unit test, it fails because key2 gets inserted twice so the size assert fails. If you can have a look I'll we appreciate.
Best regards.
START_TEST(test_art_insert2)
{
art_tree t;
int res = init_art_tree(&t);
fail_unless(res == 0);
char key1[300] = {16,0,0,0,7,10,0,0,0,2,17,10,0,0,0,120,10,0,0,0,120,10,0,
0,0,216,10,0,0,0,202,10,0,0,0,194,10,0,0,0,224,10,0,0,0,
230,10,0,0,0,210,10,0,0,0,206,10,0,0,0,208,10,0,0,0,232,
10,0,0,0,124,10,0,0,0,124,2,16,0,0,0,2,12,185,89,44,213,
251,173,202,211,95,185,89,110,118,251,173,202,199,101,0,
8,18,182,92,236,147,171,101,150,195,112,185,218,108,246,
139,164,234,195,58,177,0,8,16,0,0,0,2,12,185,89,44,213,
251,173,202,211,95,185,89,110,118,251,173,202,199,101,0,
8,18,180,93,46,151,9,212,190,95,102,178,217,44,178,235,
29,190,218,8,16,0,0,0,2,12,185,89,44,213,251,173,202,
211,95,185,89,110,118,251,173,202,199,101,0,8,18,180,93,
46,151,9,212,190,95,102,183,219,229,214,59,125,182,71,
108,180,220,238,150,91,117,150,201,84,183,128,8,16,0,0,
0,2,12,185,89,44,213,251,173,202,211,95,185,89,110,118,
251,173,202,199,101,0,8,18,180,93,46,151,9,212,190,95,
108,176,217,47,50,219,61,134,207,97,151,88,237,246,208,
8,18,255,255,255,219,191,198,134,5,223,212,72,44,208,
250,180,14,1,0,0,8, '\0'};
char key2[303] = {16,0,0,0,7,10,0,0,0,2,17,10,0,0,0,120,10,0,0,0,120,10,0,
0,0,216,10,0,0,0,202,10,0,0,0,194,10,0,0,0,224,10,0,0,0,
230,10,0,0,0,210,10,0,0,0,206,10,0,0,0,208,10,0,0,0,232,
10,0,0,0,124,10,0,0,0,124,2,16,0,0,0,2,12,185,89,44,213,
251,173,202,211,95,185,89,110,118,251,173,202,199,101,0,
8,18,182,92,236,147,171,101,150,195,112,185,218,108,246,
139,164,234,195,58,177,0,8,16,0,0,0,2,12,185,89,44,213,
251,173,202,211,95,185,89,110,118,251,173,202,199,101,0,
8,18,180,93,46,151,9,212,190,95,102,178,217,44,178,235,
29,190,218,8,16,0,0,0,2,12,185,89,44,213,251,173,202,
211,95,185,89,110,118,251,173,202,199,101,0,8,18,180,93,
46,151,9,212,190,95,102,183,219,229,214,59,125,182,71,
108,180,220,238,150,91,117,150,201,84,183,128,8,16,0,0,
0,3,12,185,89,44,213,251,133,178,195,105,183,87,237,150,
155,165,150,229,97,182,0,8,18,161,91,239,50,10,61,150,
223,114,179,217,64,8,12,186,219,172,150,91,53,166,221,
101,178,0,8,18,255,255,255,219,191,198,134,5,208,212,72,
44,208,250,180,14,1,0,0,8, '\0'};
fail_unless(NULL == art_insert(&t, key1, 299, (void*)key1));
fail_unless(NULL == art_insert(&t, key2, 302, (void*)key2));
art_insert(&t, key2, 302, (void*)key2);
fail_unless(art_size(&t) == 2);
res = destroy_art_tree(&t);
fail_unless(res == 0);
}
END_TEST
This is the current definition of art_insert()
:
/**
* Inserts a new value into the ART tree
* @arg t The tree
* @arg key The key
* @arg key_len The length of the key
* @arg value Opaque value.
* @return NULL if the item was newly inserted, otherwise
* the old value pointer is returned.
*/
void* art_insert(art_tree *t, const unsigned char *key, int key_len, void *value);
This is a really odd C return pattern. I don't like the idea that this could never fail. I'm sure there are some failure conditions and should return a return code. This function should tell me in the return code if something failed or if a key already exists.
Alternatively there could be a art_get_or_insert()
function with a void**
return argument for times you want to insert if it doesn't exist.
Also, why is there no set()
function? What if I want to overwrite a value?
test_art.c is missing at least string.h
(strlen(3)
).
Hi,
cmp = _mm_cmplt_epi8(_mm_set1_epi8(c),
_mm_loadu_si128((__m128i*)n->keys));
isn't it comparing them as signed bytes?
I found http://www.alfredklomp.com/programming/sse-intrinsics/ with nice tips how to do unsigned compares - I hope it helps :) (or maybe I'm missing something?)
nice code thanks for sharing!
Was wondering if you have considered using the low bits on each child pointer to indicate the node type. Would remove the need for a 8 bit node type. There are only 4 node types I believe, and on 64 bit systems with 8 byte alignment you can use just the lower 2 bits.
Sorry not an issue really just a question :)
Hello!
ATM I'm trying to get into the same paper you implemented here and noticed a problem:
Problem:
I think num_children should be an uint16_t since at most we can store 256 children. Thus num_children should cover the integer range 0..256 (257 numbers) which don't fit into an uint8_t.
Consider 2 key-value pairs:
"tester" => 1
"test" => 2
Inserting both of these in the radix tree, in either order, results in a tree that fails to match on lookups using key "test".
Leaf nodes are always nodes representing the right-most bytes of the key, and a leaf node is regarded as synonymous with a match. However, the node representing the superset key is not split to allow a node representing the end of the subset key (or a "string end" leaf node if the leaf == match semantics are to be preserved).
art_insert(&tree, "example", strlen("example"));
/* follow will success */
art_search(&tree, "example", strlen("example"));
/* follow will fail */
art_search(&tree, "example.com", strlen("example"));
void* art_search(const art_tree *t, const unsigned char *key, int key_len) {
...skip...
// Recursively search
child = find_child(n,
depth < key_len ? key[depth]: 0);
n = (child) ? *child : NULL;
...skip...
}
after insert "1", "2", "1\0\1", we'll get null if we search "1\0\1"。
void* reproduce_issue() {
art_tree tree;
art_tree_init(&tree);
int v1 = 1;
int v2 = 2;
int v3 = 3;
art_insert(&tree, "1", 1, &v1);
art_insert(&tree, "2", 1, &v2);
art_insert(&tree, "1\0\1", 3, &v3);
int* ret = art_search(&tree, "1\0\1", 3);
return ret;
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.