Git Product home page Git Product logo

Comments (7)

nishihatapalmer avatar nishihatapalmer commented on May 20, 2024

Hi,

I think byteseek should be able to do what you want. My documentation is not very good though! You almost got what's needed.

First, instead of using a Parser, use a Compiler. The parser only parses the syntax and produces a parse tree from it. This itself isn't executable - but the compiler will turn it into something you can use to match or search with. And we should use a Matcher compiler, rather than the RegexCompiler. I did say my documentation is pretty bad - none of this is clear.

SequenceMatcherCompiler compiler = new SequenceMatcherCompiler();
SequenceMatcher matcher = compiler.compile("64 84 71 94 00 18 .{512}");

If you just want to match that byte sequence at a particular position, you can call match methods on a SequenceMatcher directly. However, you want to search in the file for that sequence. To do this efficiently, we need to use a Searcher. These use efficient search algorithms which significantly outperform simply matching at each position in turn. The Horspool searcher is generally the fastest of the algorithms currently in byteseek.

HorspoolFinalFlagSearcher searcher = new HorspoolFinalFlagSearcher(matcher);

Now you can use the methods on the Searcher to search over the file. It will return SearchResult objects that tell you where a match is located and the length of the match.

One more thing - instead of using the InputStreamReader, why not use the FileReader? Having direct random access to the file is usually more efficient than copying it into a stream first.

Hope that helps, any other questions please feel free to ask.

from byteseek.

nishihatapalmer avatar nishihatapalmer commented on May 20, 2024

One thing byteseek won't do is return the actual data for you that matched. It returns the match position and length of a match, but not the data itself. You would have to extract those byte sequences from the file once you found a match.

It's not a bad idea to build that capability in - I'll consider adding that for a future release.

from byteseek.

nishihatapalmer avatar nishihatapalmer commented on May 20, 2024

I just realised that there is a problem with searching for your pattern, and it's because of some algorithmic issues. It's not hard to solve though. You're searching for 64 84 71 94 00 18 .{512}.

The 512 wildcard bytes at the end are essentially impossible to search for efficiently with most sub-linear search algorithms. The way this is usually dealt with is to search for the non wildcard prefix, then extract the bytes after it.

So you should search for 64 84 71 94 00 18 and if you find a match, extract the 512 bytes after it (assuming there are still 512 bytes, of course).

If you're interested, the reason why it's hard to efficiently search for wildcards at the end of an sequence is because most sub-linear search algorithms work from the end of a search pattern, rather than the start. Since all of the wildcard pattern at the end matches everywhere it looks, it prevents the algorithmic optimisations from skipping ahead in the file. Conversely, a wildcard pattern at the start of a sequence doesn't really impact performance at all.

from byteseek.

nishihatapalmer avatar nishihatapalmer commented on May 20, 2024

So - better documentation is needed, but also a higher level interface, more like a normal regex so specialist knowledge isn't required to use it safely.

from byteseek.

nishihatapalmer avatar nishihatapalmer commented on May 20, 2024

One more thought. It's actually faster in the horspool algorithm to search for longer patterns than shorter ones.

So if you could expand the number of bytes to recognise beyond 64 84 71 94 00 18, the search would work a lot faster.

This is because if the search finds something that isn't in your pattern, it can skip ahead the entire length of the pattern. Essentially, the longer the pattern, the longer the skip you can get, up to a few hundred bytes at least before the advantage goes away.

from byteseek.

Yaldabaoth64 avatar Yaldabaoth64 commented on May 20, 2024

Thanks a lot for that much help.
I could extend the pattern a little bit, but after that there would just follow more wildcards. But the results are pretty good already.

Thats how much i got now:


            FileReader reader = new FileReader(new File("test.bin"));

            SequenceMatcherCompiler smc = new SequenceMatcherCompiler();
            SequenceMatcher matcher = smc.compile("64 84 71 94 00 18 60 8c 0e 86 71 94");

            HorspoolFinalFlagSearcher searcher = new HorspoolFinalFlagSearcher(matcher);

            ArrayList<Long> results = new ArrayList<>();

            List<SearchResult<SequenceMatcher>> result = searcher.searchForwards(reader);
            while(result.size()>0){
                results.add(result.get(0).getMatchPosition());
                result = searcher.searchForwards(reader, results.get(results.size()-1)+1);
            }

            for(int i = 0; i< results.size(); i++){
                System.out.println(results.get(i));
            }

            //get packages from offset
            ArrayList<byte[]> packages = new ArrayList<>();
            File file = new File("test.bin");
            RandomAccessFile raf = new RandomAccessFile(file, "r");

            for(int i = 0; i< results.size(); i++){
                packages.add(new byte[512]);
                raf.seek(results.get(i));
                raf.read(packages.get(packages.size()-1), 0, 512);
            }

            for(int i = 0; i< packages.size(); i++){
                System.out.println(new String(packages.get(i)));
            }

All matches are saved in results and extracted with a RandomAccessFile.
Its really fast. Thanks a lot for that.

from byteseek.

nishihatapalmer avatar nishihatapalmer commented on May 20, 2024

Great, I'm glad it working for you.

from byteseek.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.