Comments (7)
Hi,
I think byteseek should be able to do what you want. My documentation is not very good though! You almost got what's needed.
First, instead of using a Parser, use a Compiler. The parser only parses the syntax and produces a parse tree from it. This itself isn't executable - but the compiler will turn it into something you can use to match or search with. And we should use a Matcher compiler, rather than the RegexCompiler. I did say my documentation is pretty bad - none of this is clear.
SequenceMatcherCompiler compiler = new SequenceMatcherCompiler();
SequenceMatcher matcher = compiler.compile("64 84 71 94 00 18 .{512}");
If you just want to match that byte sequence at a particular position, you can call match methods on a SequenceMatcher directly. However, you want to search in the file for that sequence. To do this efficiently, we need to use a Searcher. These use efficient search algorithms which significantly outperform simply matching at each position in turn. The Horspool searcher is generally the fastest of the algorithms currently in byteseek.
HorspoolFinalFlagSearcher searcher = new HorspoolFinalFlagSearcher(matcher);
Now you can use the methods on the Searcher to search over the file. It will return SearchResult objects that tell you where a match is located and the length of the match.
One more thing - instead of using the InputStreamReader
, why not use the FileReader
? Having direct random access to the file is usually more efficient than copying it into a stream first.
Hope that helps, any other questions please feel free to ask.
from byteseek.
One thing byteseek won't do is return the actual data for you that matched. It returns the match position and length of a match, but not the data itself. You would have to extract those byte sequences from the file once you found a match.
It's not a bad idea to build that capability in - I'll consider adding that for a future release.
from byteseek.
I just realised that there is a problem with searching for your pattern, and it's because of some algorithmic issues. It's not hard to solve though. You're searching for 64 84 71 94 00 18 .{512}
.
The 512 wildcard bytes at the end are essentially impossible to search for efficiently with most sub-linear search algorithms. The way this is usually dealt with is to search for the non wildcard prefix, then extract the bytes after it.
So you should search for 64 84 71 94 00 18
and if you find a match, extract the 512 bytes after it (assuming there are still 512 bytes, of course).
If you're interested, the reason why it's hard to efficiently search for wildcards at the end of an sequence is because most sub-linear search algorithms work from the end of a search pattern, rather than the start. Since all of the wildcard pattern at the end matches everywhere it looks, it prevents the algorithmic optimisations from skipping ahead in the file. Conversely, a wildcard pattern at the start of a sequence doesn't really impact performance at all.
from byteseek.
So - better documentation is needed, but also a higher level interface, more like a normal regex so specialist knowledge isn't required to use it safely.
from byteseek.
One more thought. It's actually faster in the horspool algorithm to search for longer patterns than shorter ones.
So if you could expand the number of bytes to recognise beyond 64 84 71 94 00 18
, the search would work a lot faster.
This is because if the search finds something that isn't in your pattern, it can skip ahead the entire length of the pattern. Essentially, the longer the pattern, the longer the skip you can get, up to a few hundred bytes at least before the advantage goes away.
from byteseek.
Thanks a lot for that much help.
I could extend the pattern a little bit, but after that there would just follow more wildcards. But the results are pretty good already.
Thats how much i got now:
FileReader reader = new FileReader(new File("test.bin"));
SequenceMatcherCompiler smc = new SequenceMatcherCompiler();
SequenceMatcher matcher = smc.compile("64 84 71 94 00 18 60 8c 0e 86 71 94");
HorspoolFinalFlagSearcher searcher = new HorspoolFinalFlagSearcher(matcher);
ArrayList<Long> results = new ArrayList<>();
List<SearchResult<SequenceMatcher>> result = searcher.searchForwards(reader);
while(result.size()>0){
results.add(result.get(0).getMatchPosition());
result = searcher.searchForwards(reader, results.get(results.size()-1)+1);
}
for(int i = 0; i< results.size(); i++){
System.out.println(results.get(i));
}
//get packages from offset
ArrayList<byte[]> packages = new ArrayList<>();
File file = new File("test.bin");
RandomAccessFile raf = new RandomAccessFile(file, "r");
for(int i = 0; i< results.size(); i++){
packages.add(new byte[512]);
raf.seek(results.get(i));
raf.read(packages.get(packages.size()-1), 0, 512);
}
for(int i = 0; i< packages.size(); i++){
System.out.println(new String(packages.get(i)));
}
All matches are saved in results
and extracted with a RandomAccessFile.
Its really fast. Thanks a lot for that.
from byteseek.
Great, I'm glad it working for you.
from byteseek.
Related Issues (20)
- SequenceMatcherCompiler fails to compile case insensitive strings.
- Regular expression syntax definition HOT 1
- Performance Issue with small alphabets and long texts HOT 47
- StringReader broken HOT 2
- Bounds checking does not look right HOT 2
- Typo: "first" should be "last" in syntax.md page HOT 7
- Design: should use Java charset names directly in regular expression language? HOT 5
- Prefix syntax for binary HOT 3
- Silent replacement of algorithms or throw an error? HOT 4
- Parse search strings as byteseek regexes, or just convert to byte arrays? HOT 8
- Support mark / reset in InputStreamReader HOT 2
- Support matching integer values or ranges > 8 bits HOT 5
- Feature - return data matched. HOT 1
- Better documentation
- High level interface needed HOT 1
- Search and match variable length wildcards HOT 1
- Implement SBNDM search
- Implement HashChain search HOT 1
- MutableState has non-mutable initializations resulting in errors on deepCopy calls HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from byteseek.