Git Product home page Git Product logo

Comments (28)

nppoly avatar nppoly commented on August 21, 2024

It writes ac's size into file, when init ac from buff, it checks whether the size of buff is smaller than ac's size. if it is, then throw this exception. maybe you can print ac.buff_size(), and file's size. Then I can debug it.

from cyac.

nppoly avatar nppoly commented on August 21, 2024

why don't you save ac all first? It seems your code has bug.

from cyac.

OblackatO avatar OblackatO commented on August 21, 2024

It writes ac's size into file, when init ac from buff, it checks whether the size of buff is smaller than ac's size. if it is, then throw this exception. maybe you can print ac.buff_size(), and file's size. Then I can debug it.

Yes, I understood this. Before posting the issue I even changed the code where the exception is raised in my fork of this library.

why don't you save ac all first? It seems your code has bug.

I wish it was just that ahaha :p It should change nothing as long as the AC are properly saved to the disk. Anyway, I tried your suggestion and it did not work.

Here is a specific practical example where the exception will occur.

import json
import mmap
from cyac import AC
from multiprocessing import Process


words_file = "static_ioc_sample_10k.json"
fields_dict = {
    0 : ["domain", "email"],
    1 : ["domain", "email", "sha1", "md5", "sha2_256"],
    2 : ["domain"]
}
ac_file_names = ["ac1", "ac2", "ac3"]

total_ac=3
total_processes_per_ac = 4
for x in range(0,total_ac):
    with open(words_file, "r", encoding="utf-8") as words_file_buf:
        words = json.load(words_file_buf)
    ac_patterns = list()
    for key in words:
        if key in fields_dict[x]:
            for ele in words[key]:
                ac_patterns.append(ele)
    ac = AC.build(ac_patterns)
    ac.save(ac_file_names[x])
    print("ac_name:{} -- buff_size:{}Megabytes".format(ac_file_names[x], ac.buff_size()/(1024*1024)))

# Function for child processes
def search_something(ac_file_name):
    with open(ac_file_name, "r+b") as bf:
        buff_object = mmap.mmap(bf.fileno(), 0)
        automaton = AC.from_buff(buff_object, copy=False)

for ac_name in ac_file_names:
    for x in range(0, total_processes_per_ac):
        p = Process(
          target=search_something,
          args=(ac_name,)
        )
        p.start()

I am doing what you suggested, created all AC first, only then load them. I am pretty sure you can easily understand the code, but let me know if something is not clear. Here is the json file used on the code. It is just random data, it has no meaning nor value.
The size of the AC is printed for each iteration. You can easily compare the size of the files created with ls -lh

from cyac.

nppoly avatar nppoly commented on August 21, 2024

I find your bug. please put your code into method

def init_ac():
    pass

if __name__ == '__main__':
    init_ac()
    for ac_name in ac_files_names:
        pass

otherwise every process will try to write files.

from cyac.

OblackatO avatar OblackatO commented on August 21, 2024

The exception is still raised. Updated code:

import json
import mmap
from cyac import AC
from multiprocessing import Process


def init_ac():
    total_ac=3
    for x in range(0,total_ac):
        with open(words_file, "r", encoding="utf-8") as words_file_buf:
            words = json.load(words_file_buf)
        ac_patterns = list()
        for key in words:
            if key in fields_dict[x]:
                for ele in words[key]:
                    ac_patterns.append(ele)
        ac = AC.build(ac_patterns)
        ac.save(ac_file_names[x])
        print("ac_name:{} -- buff_size:{}Megabytes".format(ac_file_names[x], ac.buff_size()/(1024*1024)))

# Function for child processes
def search_something(ac_file_name):
    with open(ac_file_name, "r+b") as bf:
        buff_object = mmap.mmap(bf.fileno(), 0)
        automaton = AC.from_buff(buff_object, copy=False)


if __name__ == '__main__':
    words_file = "static_ioc_sample_10k.json"
    fields_dict = {
        0 : ["domain", "email"],
        1 : ["domain", "email", "sha1", "md5", "sha2_256"],
        2 : ["domain"]
    }
    ac_file_names = ["ac1", "ac2", "ac3"]
    init_ac()
    total_processes_per_ac = 4
    for ac_name in ac_file_names:
        for x in range(0, total_processes_per_ac):
            p = Process(
            target=search_something,
            args=(ac_name,)
            )
            p.start()

Also tried to put the code in if name == 'main': into a "main" function and to put the code where the processes are launched in a function, did not help.

from cyac.

nppoly avatar nppoly commented on August 21, 2024

I cannot reproduce the exception with your code.

from cyac.

OblackatO avatar OblackatO commented on August 21, 2024

I cannot reproduce the exception with your code.

Hum, that is pretty weird. I can reproduce it on win10-64x, ubuntu20.04-64x and on debian buster. With both code samples I provided.
Thank you for your help anyway.

from cyac.

spindensity avatar spindensity commented on August 21, 2024

I can confirm this bug, and it's nothing to do with multi-process. A minimal case to reproduce iis as follows, please use the attachment test.txt to reproduce the bug, this bug can be only triggered by some specific data,, please execute the code multiple times since not every execution could trigger it:

import mmap
import os
from cyac import AC


def init_ac(persistent_file, words_file):
    with open(words_file, "r", encoding="utf-8") as f:
        words = list(f)

    print("Nr of words: {}".format(len(words)))

    ac = AC.build(words)
    ac.save(persistent_file)


def search_something(persistent_file):
    with open(persistent_file, "r+b") as bf:
        buff_object = mmap.mmap(bf.fileno(), 0)
        automaton = AC.from_buff(buff_object, copy=False)


if __name__ == "__main__":
    words_file = "test.txt"
    persistent_file = "ac.dump"
    if os.path.exists(persistent_file):
        os.remove(persistent_file)
    init_ac(persistent_file, words_file)
    search_something(persistent_file)

Output:

Nr of words: 1500
Traceback (most recent call last):
  File "C:\Dev\Projects\tools-test\cyac\issue_7.py", line 31, in <module>
    search_something(persistent_file)
  File "C:\Dev\Projects\tools-test\cyac\issue_7.py", line 22, in search_something
    automaton = AC.from_buff(buff_object, copy=False)
  File "lib\cyac\ac.pyx", line 413, in cyac.ac.AC.from_buff
  File "lib\cyac\ac.pyx", line 465, in cyac.ac.ac_from_buff
Exception: invalid data, buf size is not correct
Press any key to continue . . .

from cyac.

nppoly avatar nppoly commented on August 21, 2024

still cannot reproduce....

from cyac.

nppoly avatar nppoly commented on August 21, 2024

@OblackatO I tested on Ubuntu 20.04 x64 and Mac.

from cyac.

OblackatO avatar OblackatO commented on August 21, 2024

I can confirm this bug, and it's nothing to do with multi-process. A minimal case to reproduce iis as follows, please use the attachment test.txt to reproduce the bug, this bug can be only triggered by some specific data,, please execute the code multiple times since not every execution could trigger it:

I can also reproduce without using emails at all. Just hashing algorithms and domains.

from cyac.

OblackatO avatar OblackatO commented on August 21, 2024

still cannot reproduce....

Weird. Unfortunately, this sudden bug makes this library unusable for me. It has a great potential, it is quite a shame.
I will close this issue soon if the author does not do it earlier.

from cyac.

nppoly avatar nppoly commented on August 21, 2024

@OblackatO maybe you can add more print, and debug by your self.

from cyac.

OblackatO avatar OblackatO commented on August 21, 2024

@OblackatO maybe you can add more print, and debug by your self.

I did try that already, I added prints and compiled from source. I 1st compared the size of the created ac files with size of the AC before saving. The files area actually smaller than the ac. This means that the problem is with the save method. The save method calls the write method:

cdef write(self, FILE* ptr_fw):
        print("mn size: {}".format(sizeof(magic_number)))
        fwrite(<void*>&magic_number, sizeof(magic_number), 1, ptr_fw)
        cdef int size = self.buff_size()
        print("buff_size: {} in write function".format(size))
        fwrite(<void*>&size, sizeof(size), 1, ptr_fw)
        self.trie.write(ptr_fw)
        fwrite(<void*>self.output, sizeof(OutNode), self.trie.array_size, ptr_fw)
        fwrite(<void*>self.fails, sizeof(int), self.trie.array_size, ptr_fw)
        fwrite(<void*>self.key_lens, sizeof(unsigned int), self.trie.array_size, ptr_fw)

I added a couple of other prints, but I did not find anything abnormal. The size printed here is the same of the size printed in the main logic, you seem to include all the necessary data to be saved to the file. The bug only happens randomly for no apparent reason, and sometimes the exception with "invalid magic number" is also raised, not only "invalid size buff". There is not a single factor that points to where the bug is. Specially because I cannot reproduced all the time, with the same files, same system and same py interpreter.
In the write function of the trie, you wrote something like: " # for arm processor, we should align data in 4bytes." would this make a difference for a none ARM processor? Because that is my case, but if you used macOS yours as well I guess, unless you have the new mac series with the Apple M1 chip ahahah.
Anyway, I really do not see why the bug is raised, I could of course dive deeper into this, but it seems to me it is going to be like finding a needle in a grass terrain, mostly because of the random factor.

from cyac.

nppoly avatar nppoly commented on August 21, 2024

@OblackatO you can also give me the wrong saved file. I will see it again.

from cyac.

OblackatO avatar OblackatO commented on August 21, 2024

Ok tell me if you need help.
The file I used in the code samples I posted, is on the 2nd post I wrote on this issue, embedded on the sentence : Here is the json file used on the code

from cyac.

spindensity avatar spindensity commented on August 21, 2024

@nppoly

I think the cause of the problem is the mismatch between the size of the key_lens array building and the size of its usage.

The size building the key_lens array is trie.leaf_size:

key_lens = <unsigned int*> malloc(sizeof(unsigned int) * trie.leaf_size)

But the size used in write function is trie.array_size:

fwrite(<void*>self.key_lens, sizeof(unsigned int), self.trie.array_size, ptr_fw)

trie.array_size is actually larger than trie.leaf_size, so the status of the memory beyond the key_lens is random and the actual elements written to the file are implementation-dependent. It seems MSVC would not accept this condition.

Here is a patch that could simply fix the bug

--- a/lib/cyac/ac.pyx
+++ b/lib/cyac/ac.pyx
@@ -355,7 +355,7 @@ cdef class AC(object):
         self.trie.write(ptr_fw)
         fwrite(<void*>self.output, sizeof(OutNode), self.trie.array_size, ptr_fw)
         fwrite(<void*>self.fails, sizeof(int), self.trie.array_size, ptr_fw)
-        fwrite(<void*>self.key_lens, sizeof(unsigned int), self.trie.array_size, ptr_fw)
+        fwrite(<void*>self.key_lens, sizeof(unsigned int), self.trie.leaf_size, ptr_fw)

     def save(self, fname):
         """
@@ -375,7 +375,7 @@ cdef class AC(object):
         """
         return the memory size of buffer needed for exporting to external buffer.
         """
-        return self.trie.buff_size() + sizeof(magic_number) + sizeof(int) + (sizeof(OutNode) + sizeof(int) + sizeof(unsigned int)) * self.trie.array_size
+        return self.trie.buff_size() + sizeof(magic_number) + sizeof(int) + (sizeof(OutNode) + sizeof(int)) * self.trie.array_size + sizeof(unsigned int) * self.trie.leaf_size;


     def to_buff(self, buff):

from cyac.

OblackatO avatar OblackatO commented on August 21, 2024

I compiled cyac with your patch @spindensity , at least for me it did not solve the issue, on win10 and ubuntu20.04.
I also tried to change the line:

key_lens = <unsigned int*> malloc(sizeof(unsigned int) * trie.leaf_size)

to:

key_lens = <unsigned int*> malloc(sizeof(unsigned int) * trie.array_size)

and let the rest of the code, where you replaced array_size intact.

Basically, if I use your patch to run the sample code of my 2nd post, the output is:

ac_name:ac1 -- buff_size:1.2047004699707031Megabytes
ac_name:ac2 -- buff_size:5.716983795166016Megabytes
ac_name:ac3 -- buff_size:0.3254508972167969Megabytes
===================================
Exception:
...
  File "lib\cyac\trie.pyx", line 1103, in cyac.trie.trie_from_buff
    raise Exception("invalid data, magic number is not correct")
Exception: invalid data, magic number is not correct
===================================
ls (to get saved files)
-a----        11/15/2020   3:01 PM        1263220 ac1 (1.26Megabytes)
-a----        11/15/2020   3:01 PM        5994692 ac2 (5.99Megabytes)
-a----        11/15/2020   3:01 PM         341260 ac3  (0.34Megabytes)

If I do what I explained above, just replacing leaf_size to array_size and let the rest intact, the output is:

...
ac_name:ac1 -- buff_size:1.33502197265625Megabytes
ac_name:ac2 -- buff_size:6.56744384765625Megabytes
ac_name:ac3 -- buff_size:0.36187744140625Megabytes
===================================
Exception:
...
  File "lib\cyac\trie.pyx", line 1103, in cyac.trie.trie_from_buff
    raise Exception("invalid data, magic number is not correct")
Exception: invalid data, magic number is not correct
===================================
ls (to get saved files)
-a----        11/15/2020   2:30 PM        1399872 ac1 (1.39Megabytes)
-a----        11/15/2020   2:30 PM        6886464 ac2 (6.8Megabytes)
-a----        11/15/2020   2:30 PM         379456  ac3 (0.37Megabytes)

If I used the normal cyac version, installed from pypi

ac_name:ac1 -- buff_size:1.33502197265625Megabytes
ac_name:ac2 -- buff_size:6.56744384765625Megabytes
ac_name:ac3 -- buff_size:0.36187744140625Megabytes
===================================
Exception: invalid data, buf size is not correct
  File "lib\cyac\ac.pyx", line 465, in cyac.ac.ac_from_buff
    raise Exception("invalid data, buf size is not correct")
  File "lib\cyac\ac.pyx", line 465, in cyac.ac.ac_from_buff
    raise Exception("invalid data, buf size is not correct")
Exception: invalid data, buf size is not correct
===================================
ls (to get saved files)
-a----        11/15/2020   3:11 PM        1253376 ac1 (1.2Megabytes)
-a----        11/15/2020   3:11 PM        5967872 ac2 (5.9 Megabytes)
-a----        11/15/2020   3:11 PM         339968 ac3 (0.33 Megabytes)

It seems that your patch is the version that gets the sizes, of both the AC to be saved and saved AC close to each other.

from cyac.

spindensity avatar spindensity commented on August 21, 2024

@OblackatO

Maybe you did not use the patch correctly. Could you post the ac.pyx after you patched?

from cyac.

spindensity avatar spindensity commented on August 21, 2024

@OblackatO

And also make sure you uninstall the old package, build and install the new package. I can use the patched package without any problems.

from cyac.

nppoly avatar nppoly commented on August 21, 2024

@OblackatO I mean the saved binary file. I can analysis it.

from cyac.

nppoly avatar nppoly commented on August 21, 2024

@spindensity yes. you are correct

from cyac.

spindensity avatar spindensity commented on August 21, 2024

@OblackatO

If you use python 3.8 64bit on windows, you can install the attachment cyac-1.1-cp38-cp38-win_amd64.zip package and try again.

Unzip it and install the wheel package:

pip uninstall cyac
pip install cyac-1.1-cp38-cp38-win_amd64.whl

from cyac.

nppoly avatar nppoly commented on August 21, 2024

@spindensity I committed the patch.

from cyac.

spindensity avatar spindensity commented on August 21, 2024

@nppoly

OK.

@OblackatO

@nppoly committed the more complete patch, the root cause of the bug and the analysis in my post are right but there were some other array_size I forgot to change them to leaf_size in my patch, so use code in fix-buf-size branch and forget my patch and package above.

from cyac.

OblackatO avatar OblackatO commented on August 21, 2024

@spindensity I confirm that your patch works for me, on both win10 and ubuntu20.04. Thank you for your time.

I added a test in the file ac_test to make sure the sizes of the AC to be saved and saved AC are equal. I made a pull request #9

@nppoly for the test I added, the "test.txt" file that spindensity used here is needed, could u please add it if you merge? Before or after. Otherwise the test will fail.

from cyac.

nppoly avatar nppoly commented on August 21, 2024

@OblackatO I already have test, but it runs correctly in my mac. ಥ_ಥ

from cyac.

OblackatO avatar OblackatO commented on August 21, 2024

@OblackatO I already have test, but it runs correctly in my mac. ಥ_ಥ

You don't have any test in the file ac_test that tests the sizes.
Great it runs on mac as well.

from cyac.

Related Issues (12)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.