Git Product home page Git Product logo

Comments (13)

keithchew avatar keithchew commented on June 5, 2024

Hmm, my apologies, I actually have 170K more entries in the dataset which has totalSold below 100000. I will prepare a more concrete test case, please hold.

from redisearch.

keithchew avatar keithchew commented on June 5, 2024

OK, I have a better test case below:

In bash, create 200,000 records:

for i in {1..200000}
do
redis-cli hset namespace:test$i address test$i totalSold $i
done

Create index:

FT.CREATE testidx ON hash PREFIX 1 'namespace:' SCHEMA address TAG totalSold numeric SORTABLE

Then, execute the queries:

ft.profile testidx search query "(-@address:{test200000}) @totalSold:[200000 200000]" SORTBY totalSold asc WITHCOUNT LIMIT 0 5 RETURN 2 totalSold address
ft.profile testidx search query "(-@address:{test200000}) @totalSold:[200000 200000]" SORTBY totalSold asc WITHOUTCOUNT LIMIT 0 5 RETURN 2 totalSold address

First one will return 0 items (since address is negated), but the second optimized query will return the record.

from redisearch.

keithchew avatar keithchew commented on June 5, 2024

If it helps, the query:

ft.profile testidx search query "(-@address:{test200000}) @totalSold:[199999 200000]" SORTBY totalSold asc WITHOUTCOUNT LIMIT 0 5 RETURN 2 totalSold address

returns record 199999 and excluded 200000 as expected. Perhaps this bug is only triggered when the optimizer has to SkipTo the a record which is part of the negation, in which case it accepts the record incorrectly as a result.

from redisearch.

keithchew avatar keithchew commented on June 5, 2024

Hmm, not all the time. If I drop and re-create the index, [199999 200000] still includes record 200000, so it is dependent on where the document sits in the numeric index. Will keep testing and post here if I find anything else useful to fix the bug.

from redisearch.

keithchew avatar keithchew commented on June 5, 2024

OK, I have dug deeper into the code, and NOT iterator's NI_SkipTo() function returns INDEXREAD_NOTFOUND if it is not a match. But the loop in OPT_Read() does not actually check the results, and treats it as good if the docid's match, ie:

...
      if (childRes->docId == numericRes->docId) {
...

As a quick test, I updated the logic to be:

...
      if (childRes->docId == numericRes->docId && rc1 == INDEXREAD_OK) {
...

and this fixes the issue. Will keep testing to see this breaks other test cases which passed perviously.

from redisearch.

keithchew avatar keithchew commented on June 5, 2024

The fix above is not quite right. NOT iterators return INDEXREAD_NOTFOUND when the next record does not match, but for other iterators, it points to the next valid record when INDEXREAD_NOTFOUND is returned.

To cater for this, the fix should be:

      bool readOk = child->type == NOT_ITERATOR ? rc1 == INDEXREAD_OK : true;
      if (childRes->docId == numericRes->docId && readOk) {

I am not sure what happens if the iterator is an intersect or union and the NOT iterator is further down. Will need to keep testing.

from redisearch.

keithchew avatar keithchew commented on June 5, 2024

Tested with NOT clause within a UNION (below), and it breaks the fix above... So, not too sure what is a general robust way to handle NOTs in the query:

ft.profile testidx search query "((-@address:{test200000}) | @address:{test199999}) @totalSold:[199999 200000]" SORTBY totalSold asc WITHOUTCOUNT LIMIT 0 5 RETURN 2 totalSold address

Will keep playing around with it, but if anyone has some pointers to this bug, please let me know.

from redisearch.

keithchew avatar keithchew commented on June 5, 2024

I have abandoned the above fix, and found a simpler one, but with greater risks as it is further down at the iterator level. In index.c, we can see the NotInterator's skip to function NI_SkipTo():

  // If the child docId is the one we are looking for, it's an anti match!
  if (childId == docId) {
    nc->base.current->docId = docId;
    nc->lastDocId = docId;
    *hit = nc->base.current;
    return INDEXREAD_NOTFOUND;
  }

Since the the Optimizer ignores INDEXREAD_NOTFOUND and INDEXREAD_OK and only reads the docId, I updated the above to:

  // If the child docId is the one we are looking for, it's an anti match!
  if (childId == docId) {
    nc->base.current->docId = docId + 1; <---- Added + 1 here
    nc->lastDocId = docId + 1; <---- Added + 1 here
    *hit = nc->base.current;
    return INDEXREAD_NOTFOUND;
  }

Surprisingly, this has passed all the Optimizer test cases that failed before! I would like to ask if there is test suite in RediSearch which I can run to see the effect of the change above? And if so, how do I run it?

from redisearch.

meiravgri avatar meiravgri commented on June 5, 2024

Hi @keithchew
Thank you for bringing this to our attention! I have confirmed that there is indeed a bug.
We are aware of it and are looking into potential solutions.
As a temporary workaround, I recommend using the non-optimized version.

from redisearch.

keithchew avatar keithchew commented on June 5, 2024

Hi @meiravgri

Thank you for your confirmation of the bug. I have a few updates. The docId + 1 fix above broke some test cases in redisearch's "make pytest", so that fix is not good, I thought it was too deep in anyway.

I am currently testing the original fix above, but applying it in a few more places. In OPT_Read(), UI_SkipToHigh(), UI_SkipTo() and II_SkipTo(), I am checking if the type is NOT and only setting the minId on INDEXREAD_OK, ie:

      bool readOk = it->type == NOT_ITERATOR ? rc == INDEXREAD_OK : true;
      if (res && readOk) {
        it->minId = res->docId;
      }

So far, it seems to have passed all the test cases, but I am not comfortable with it as it touches quite a number of critical places in the code. Please advise here when a PR is ready, so I can help test the official fix and remove my changes.

Thanks again.

from redisearch.

keithchew avatar keithchew commented on June 5, 2024

Hi @meiravgri

Just wanted to note here another test case (can probably be classified as another bug since it is further down at the index level). In index.c for NI_SkipTo(), it says:

  // If the child is ahead of the skipto id, it means the child doesn't have this id.
  // So we are okay!
  if (childId > docId || !IITER_HAS_NEXT(nc->child)) {
    goto ok;
  }

This logic might apply to other iterators, but for the NOT iterator, because it can either be INDEXREAD_OK or INDEXREAD_NOTFOUND if the childId is > dockId (you can verify this by by reading the code further down the snipper above). Here is a case where the Optimizer's OPT_Read() is requesting these SkipTos to the NOT iterator:

10:M 04 Feb 2024 23:33:03.842 # <module> rc1 [2] SkipTo [2169]
10:M 04 Feb 2024 23:33:03.842 # <module> rc1 [2] SkipTo [2175]
10:M 04 Feb 2024 23:33:03.842 # <module> rc1 [1] SkipTo [2169]

You can see above the first call's rc value is INDEXREAD_NOTFOUND but the third call with the same skip to docId returns INDEXREAD_OK, which is incorrect.

from redisearch.

keithchew avatar keithchew commented on June 5, 2024

Hi @meiravgri

Do you have any ideas/suggestions on how to solve this. I had another good look at the code, and this appears to be non-trivial to fix. Without the fix, it is too dangerous to use, eg we can have a query like "-@type:{critical}" to purge non critical records, but the optimizer can return critical records that will be accidentally wiped out! And without using the optimizer, there can be a huge performance hit (up to 10x to 100x) depending on the dataset.

Since I have had a good look into the code, let me know if you need me to try out some ideas/suggestions, or even contribute to the fix.

from redisearch.

meiravgri avatar meiravgri commented on June 5, 2024

@keithchew fixing this bug is on our roadmap.
If you have specific suggestions or ideas, please feel free to open a PR with your proposed fix.
Thanks!

from redisearch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.