Git Product home page Git Product logo

libart's People

Contributors

amarince avatar armon avatar asmuth avatar cemeyer avatar jameshloving avatar mcprentiss avatar postwait avatar rishid avatar romanbsd avatar timgates42 avatar wuxb45 avatar zxjcarrot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

libart's Issues

Adding a key which is a full prefix of another leads to valgrind warning

When trying to add a key which is a full prefix of another key the following valgrind warning is issued:
==2159== Conditional jump or move depends on uninitialised value(s)
==2159== at 0x4015A8: add_child4 (art.c:420)
==2159== by 0x401CBD: recursive_insert (art.c:519)
==2159== by 0x401CBD: art_insert (art.c:583)
==2159== by 0x400842: append (test.c:36)
==2159== by 0x400842: main (test.c:73)
==2159==
==2159== Conditional jump or move depends on uninitialised value(s)
==2159== at 0x4010EA: find_child (art.c:138)
==2159== by 0x40193A: art_search (art.c:249)
==2159== by 0x400852: append (test.c:38)
==2159== by 0x400852: main (test.c:73)
==2159==

While that action is not allowed by the API (according to issue #12), it would be preferable to fail instead of triggering undefined behaviour.

The program triggering that behaviour is at:
http://paste.fedoraproject.org/281806/42043514

Possible to do range scans with ART ?

I am actually in the geosciences using geohashes -
https://en.wikipedia.org/wiki/Geohash which are 64 bit integers. I
would like to store these geohashes in memory and then do range scans.

I want to be able to do range queries with geohash ranges and these
should be very fast.
As an example if I have latitude and longitude = 61.5,N 172 E
and another latitude and longitude = 61.4. 172.1 E both of these will
geohash to similar values with a similar prefix. We can calculate the
prefix with an lower and upper bound and retrieve all the geohashes in
that range. Is it possible with your implementation of ART ?

I do not see an API for range queries.

Range scan performance

Hi,

I'm currently thinking about using a Java based ART implementation as an in-memory index for my Open Source temporal data store (https://sirix.io), which uses an ART like trie for storing paged variable sized document index on flash drives. I'm however also using Bitmap nodes as in hash array mapped tries or Judy arrays.

Now, besides a super cache unfriendly AVL tree as a secondary in memory data structure I'm considering using ART or HOT. However, I don't understand why performance for 95% range scans and 5% inserts is worse in comparison with a TreeMap, which is a red/black binary tree implementation:

https://github.com/rohansuri/adaptive-radix-tree

You'll find the performance comparison diagram when you scroll down. Maybe you have an idea...

Enhancement Request: Serialization, deserialization

It would be useful to be able to serialize my ART to a file and grab it from file again without expanding to a full (k,v) list. My current data collection process is somewhat expensive and I read directly into ART, but every time I want to iterate and do some new analysis on the ART I have to re-collect data, and that sucks.

I'd prefer to just serialize the ART itself instead of the uncompressed (k,v) list. Anyway, just a thought.

art_iter_prefix depth goes over key_len

With a specific data set, the art_iter_prefix find a full_match and goes deeper of the key_len.

// Bail if the prefix does not match
if (n->partial_len) {
[...]
// if there is a full match, go deeper
depth = depth + n->partial_len;
}
[...]
child = find_child(n, key[depth]); //KABOOUM when key_len is less than the depth

Simply change? :

            // If we've matched the prefix, iterate on this node
            if (prefix_len && ((depth + prefix_len) == key_len)) {
                return recursive_iter(n, cb, data);

            // If mismatch or no match, search is terminated
            } else if(prefix_len < n->partial_len) {
                return 0;
            }

Not 100% sure is the optimal solution.

Is this a bug?

#include <stdio.h>
#include "art.h"

void main()
{
    unsigned char* key = (unsigned char*)"abcd";
    art_tree tree;
    init_art_tree(&tree);

    art_insert(&tree, key, 3, (void*)0x5678);
    printf("%p\n", art_search(&tree, key, 3));

    art_insert(&tree, key, 2, (void*)0x1234);
    printf("%p\n", art_search(&tree, key, 2));

    destroy_art_tree(&tree);
}

$ gcc test.c art.c -std=c99 -o test
$ ./test

I was expecting

0x5678
0x1234

But got

0x5678
(nil)

avoid storing full keys in leafs

We have an application where we store lots of URIs and map them to integer IDs. Since almost all URIs start with the same prefix, we decided to use the ART trie to do this mapping. Unfortunately, I realised that ART still keeps a complete copy of the key in its leaf nodes - wouldn't it be possible to dynamically reconstruct keys by traversing the trie from the root when needed (e.g. for the iteration)? In our case this would safe tremendous amounts of main memory.

Eliminate implicit padding in art_node struct

Hi ! Your current structure:

typedef struct {
    uint8_t type;
    uint8_t num_children;
    uint32_t partial_len;
    unsigned char partial[MAX_PREFIX_LEN];
} art_node; 

is 20-byte sized (if MAX_PREFIX_LEN = 10) because of implicit padding for aligning addresses of fields. So, why not ordering fields by their type size ? Like:

typedef struct {
  uint32_t partial_len;
  unsigned char partial[MAX_PREFIX_LEN];
  uint8_t type;
  uint8_t num_children;
} art_node;

is now 16 bytes. You earned 4 bytes (so 25%). As consequence, the structure
art_node256 will reduce its size from 2072 to 2064 (earned 8 bytes), as well for art_node48 (664 bytes --> 656) and art_node16 (168 bytes --> 160). This can be nice for micro-controllers with little memory space.

Note 01: My architecture is an AMD 64 bytes. So addresses are 8 bytes (art_node*) and I did not check with a 32-bits archi (but I have an old computer).

With art_node4 you are always loosing 4 bytes (even with my structure) for 64-bits archi. Its 'ideal' size is 52 bytes but 52 bytes / 8 bytes = 6.5 rounding up to 7 and therefore 7 * 8 = 56 bytes. For 32-bits archi it seems ok with my structure !

Note 02: I have not made unit tests with my changes.
Note 03: I'm not a specialist of padding.

You can unit tests like:

printf("%lu\n", sizeof (art_node));
assert((sizeof(uint32_t) + 2 * sizeof(uint8_t) + MAX_PREFIX_LEN * sizeof(unsigned char)) == sizeof (art_node));

Not actually ANSI C

art_size() is defined as an inline function in art.h. However, the inline keyword was not introduced to C until C99.

Suggest one of (in descending order of preference):

  • Augment README claim from "ANSI C" to "mostly ANSI C", or just drop ANSI and say "C99 (MSVC compatible)"?
  • Drop inline keyword, define symbol in art.c if not compiling in c99 mode
  • Replace art_size() with a macro (ART_SIZE?)

Minor nitpicking as usual :-).

Serializing the tree

Hi,

This library looks very interesting. Since I do not want to reconstruct the tree everytime I load my program, I was wondering if there is a way I can go about serializing the tree once I build it. Is there a way to do this efficiently?

Thanks!

Non '\0' terminating keys

Hi @armon,

I bother you again with another question. Looking your examples (tc), I could see that you insert only '\0' terminated keys. I was wondering if it's necessary since you pass the key length to every art operation. I did a simple test, and sometimes it fails and sometimes don't. It never fails if keys are '\0' terminated.

I wan't to know because my keys are Erlang binaries that aren't '\0' terminated. This forces me to copy the key just to add the '\0' at the end.

Here is a small test. As you'll see this test fails with key1 and pass with key2.

START_TEST(test_art_insert_non0)
{
    art_tree t;
    int res = init_art_tree(&t);
    fail_unless(res == 0);

    unsigned char key1[1] = {1};
    unsigned char key2[2] = {1, 2};

    fail_unless(NULL == art_insert(&t, key1, 1, (void*)key1));
    fail_unless(NULL == art_insert(&t, key2, 2, (void*)key2));
    fail_unless(art_size(&t) == 2);
    fail_unless(NULL != art_search(&t, key2, 2));
    fail_unless(NULL != art_search(&t, key1, 1));

    res = destroy_art_tree(&t);
    fail_unless(res == 0);
}
END_TEST

Crash in recursive_insert

Unfortunately we weren't able to reproduce the crash locally, but it happens on arm devices in
art.c:578 -
add_child4(new_node, ref, l->key[depth+longest_prefix], SET_LEAF(l));

As far as I can tell, it's the access to l->key[depth+longest_prefix]

The assembly produced:

art.c:578
   633e6: 4435        add r5, r6
   633e8: 9803        ldr r0, [sp, #12]
   633ea: f044 0301   orr.w r3, r4, #1
   633ee: 4659        mov r1, fp
   633f0: 4428        add r0, r5
  *633f2: 7a02        ldrb  r2, [r0, #8]

I don't understand how it can happen, but it happens. Any ideas?

prefix_mismatch not considering partial_len on longer partial

The function prefix_mismatch on partial longer than MAX_PREFIX_LEN is incorrect. If the key search is the lowest key of the next node, a match occur but shouldn't. So a lot more of results are selected.

In art_iter_prefix, we need to cap the prefix_len found to the partial_len of the current node. Only one line is impacted:
prefix_len = prefix_mismatch(n, key, key_len, depth);
if(prefix_len > n->partial_len) prefix_len = n->partial_len;

bugs for art_insert in some case

In the following example, I insert two keys and do a search, I expect get right value, but it will raising error.

#include "art.h"
#include "assert.h"
#include "stdio.h"
void test(){
  art_tree tree;
  art_tree_init(&tree);
  unsigned char key1[1] = {0};
  unsigned char key2[2] = {0, 1};
  int value1 = 1, value2 = 2;
  art_insert(&tree, key1, 1, &value1);
  art_insert(&tree, key2, 2, &value2);

  int* value_ptr = (int*)art_search(&tree, key1, 1);
  assert(value_ptr != NULL);
  assert(*value_ptr == value1);
}

int main(){
  test();
}

I find if we insert a string whose prefix is a existed key, the previous key may be missed due to following code in art.c:580 :

add_child4(new_node, ref, l->key[depth+longest_prefix], SET_LEAF(l));
add_child4(new_node, ref, l2->key[depth+longest_prefix], SET_LEAF(l2));

depth + longest_prefix overflows in my example

Integer keys need byte-ordering swap?

Hi!

I am trying to use this implementation of ART for 64-bit unsigned integer keys.
Since the interface accept unsigned char array, I would like to confirm that integer keys needs to have their byte-ordering swap in little-endian machine?

i.e. instead of

uint64_t key = 123456789
void* val_ptr = art_search(&t, (unsigned char*) &key, sizeof(uint64_t));

I should do

uint64_t key = 123456789
uint64_t reversed_key = __builtin_bswap64(key);
void* val_ptr = art_search(&t, (unsigned char*) &reversed_key, sizeof(uint64_t));

Thank you!

Serialization

Hi,
Is there any way to serialize the tree ahead of time ?
If not, what would be the best approach for this ?
My C is a bit rusty, but I'd love to help on this.

Error inserting long key

Hi @armon,

I was wrapping libart to use it in Erlang (btw excellent pice of work!). Everything went fine until I started playing with long binaries as keys in Erlang getting the same key inserted twice.

I've done the same test in C to find out if it was a mistake in my wrapper, and the error persisted. I don´t know if I'm doing something wrong...

Here's de C unit test, it fails because key2 gets inserted twice so the size assert fails. If you can have a look I'll we appreciate.

Best regards.

START_TEST(test_art_insert2)
{
    art_tree t;
    int res = init_art_tree(&t);
    fail_unless(res == 0);

    char key1[300] = {16,0,0,0,7,10,0,0,0,2,17,10,0,0,0,120,10,0,0,0,120,10,0,
      0,0,216,10,0,0,0,202,10,0,0,0,194,10,0,0,0,224,10,0,0,0,
      230,10,0,0,0,210,10,0,0,0,206,10,0,0,0,208,10,0,0,0,232,
      10,0,0,0,124,10,0,0,0,124,2,16,0,0,0,2,12,185,89,44,213,
      251,173,202,211,95,185,89,110,118,251,173,202,199,101,0,
      8,18,182,92,236,147,171,101,150,195,112,185,218,108,246,
      139,164,234,195,58,177,0,8,16,0,0,0,2,12,185,89,44,213,
      251,173,202,211,95,185,89,110,118,251,173,202,199,101,0,
      8,18,180,93,46,151,9,212,190,95,102,178,217,44,178,235,
      29,190,218,8,16,0,0,0,2,12,185,89,44,213,251,173,202,
      211,95,185,89,110,118,251,173,202,199,101,0,8,18,180,93,
      46,151,9,212,190,95,102,183,219,229,214,59,125,182,71,
      108,180,220,238,150,91,117,150,201,84,183,128,8,16,0,0,
      0,2,12,185,89,44,213,251,173,202,211,95,185,89,110,118,
      251,173,202,199,101,0,8,18,180,93,46,151,9,212,190,95,
      108,176,217,47,50,219,61,134,207,97,151,88,237,246,208,
      8,18,255,255,255,219,191,198,134,5,223,212,72,44,208,
      250,180,14,1,0,0,8, '\0'};
    char key2[303] = {16,0,0,0,7,10,0,0,0,2,17,10,0,0,0,120,10,0,0,0,120,10,0,
      0,0,216,10,0,0,0,202,10,0,0,0,194,10,0,0,0,224,10,0,0,0,
      230,10,0,0,0,210,10,0,0,0,206,10,0,0,0,208,10,0,0,0,232,
      10,0,0,0,124,10,0,0,0,124,2,16,0,0,0,2,12,185,89,44,213,
      251,173,202,211,95,185,89,110,118,251,173,202,199,101,0,
      8,18,182,92,236,147,171,101,150,195,112,185,218,108,246,
      139,164,234,195,58,177,0,8,16,0,0,0,2,12,185,89,44,213,
      251,173,202,211,95,185,89,110,118,251,173,202,199,101,0,
      8,18,180,93,46,151,9,212,190,95,102,178,217,44,178,235,
      29,190,218,8,16,0,0,0,2,12,185,89,44,213,251,173,202,
      211,95,185,89,110,118,251,173,202,199,101,0,8,18,180,93,
      46,151,9,212,190,95,102,183,219,229,214,59,125,182,71,
      108,180,220,238,150,91,117,150,201,84,183,128,8,16,0,0,
      0,3,12,185,89,44,213,251,133,178,195,105,183,87,237,150,
      155,165,150,229,97,182,0,8,18,161,91,239,50,10,61,150,
      223,114,179,217,64,8,12,186,219,172,150,91,53,166,221,
      101,178,0,8,18,255,255,255,219,191,198,134,5,208,212,72,
      44,208,250,180,14,1,0,0,8, '\0'};


    fail_unless(NULL == art_insert(&t, key1, 299, (void*)key1));
    fail_unless(NULL == art_insert(&t, key2, 302, (void*)key2));
    art_insert(&t, key2, 302, (void*)key2);
    fail_unless(art_size(&t) == 2);

    res = destroy_art_tree(&t);
    fail_unless(res == 0);
}
END_TEST

art_insert() should return a return code

This is the current definition of art_insert():

/**
 * Inserts a new value into the ART tree
 * @arg t The tree
 * @arg key The key
 * @arg key_len The length of the key
 * @arg value Opaque value.
 * @return NULL if the item was newly inserted, otherwise
 * the old value pointer is returned.
 */
void* art_insert(art_tree *t, const unsigned char *key, int key_len, void *value);

This is a really odd C return pattern. I don't like the idea that this could never fail. I'm sure there are some failure conditions and should return a return code. This function should tell me in the return code if something failed or if a key already exists.

Alternatively there could be a art_get_or_insert() function with a void** return argument for times you want to insert if it doesn't exist.

Also, why is there no set() function? What if I want to overwrite a value?

Tag pointers to avoid node type byte

nice code thanks for sharing!

Was wondering if you have considered using the low bits on each child pointer to indicate the node type. Would remove the need for a 8 bit node type. There are only 4 node types I believe, and on 64 bit systems with 8 byte alignment you can use just the lower 2 bits.

Sorry not an issue really just a question :)

Node Header: MAX_PREFIX_LEN & uint8_t num_children

Hello!

ATM I'm trying to get into the same paper you implemented here and noticed a problem:

  1. You defined the node header with a contant size of 16 Byte like the paper.
  2. The paper proposed to store a compressed path of 8 Bytes whilst your implementation defines it through MAX_PREFIX_LEN as 10 Bytes.
  3. Due to the necessity to store a compressed path of 10 Bytes & it's length (4 Bytes), you only have 2 Bytes left to store the node type & the amount of non null children (num_children).

Problem:
I think num_children should be an uint16_t since at most we can store 256 children. Thus num_children should cover the integer range 0..256 (257 numbers) which don't fit into an uint8_t.

Insertion of a key, A, that is a complete subset of another key, B, prevents A from being found

Consider 2 key-value pairs:

"tester" => 1
"test" => 2

Inserting both of these in the radix tree, in either order, results in a tree that fails to match on lookups using key "test".

Leaf nodes are always nodes representing the right-most bytes of the key, and a leaf node is regarded as synonymous with a match. However, the node representing the superset key is not split to allow a node representing the end of the subset key (or a "string end" leaf node if the leaf == match semantics are to be preserved).

[BUG] key_len not work with longer key

BUG:

art_insert(&tree, "example", strlen("example"));
/* follow will success */
art_search(&tree, "example", strlen("example"));
/* follow will fail */
art_search(&tree, "example.com", strlen("example"));

FIX:

void* art_search(const art_tree *t, const unsigned char *key, int key_len) {
...skip...
// Recursively search
child = find_child(n,depth < key_len ? key[depth]: 0);
n = (child) ? *child : NULL;
...skip...
}

search get null when str2 = str1 + '\0'

after insert "1", "2", "1\0\1", we'll get null if we search "1\0\1"。

void* reproduce_issue() {
  art_tree tree;
  art_tree_init(&tree);
  int v1 = 1;
  int v2 = 2;
  int v3 = 3;
  
  art_insert(&tree, "1", 1, &v1);
  art_insert(&tree, "2", 1, &v2);
  art_insert(&tree, "1\0\1", 3, &v3);
  int* ret = art_search(&tree, "1\0\1", 3);

  return ret;
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.