Git Product home page Git Product logo

singrep's Introduction

singrep

singe's grep - a fast grep using single-file parallelism

singrep makes use of deterministic kernel file cache'ing to read the file fast enough to make multi-threading useful. It instructs the kernel to cache sections of the file to memory, then memory maps them for fast reads. Chunks of the file are then sent to separate thread to do the matching. On a modern multi-core system, this is significantly faster than other fast grep utilities.

This only works on Linux and macOS.

Compiling

You'll need a rust install, the easiet way is to use rustup.

In the cloned repository run:

cargo build --release

The resulting binary will be in target/releases/singrep.

Usage

singrep <pattern> <file>

Will search for occurances of pattern in the supplied file.

Advanced usage

  • Regex Match --regex, -r - will match using a regular expression
  • Exact Match --exact, -e - will only match lines that entirely match the pattern, incompatible with regex
  • First Match --first, -f - will exit after the first match is found, incompatible with regex
  • Byte Position --position, -p - will display the byte (not line) number where the pattern was found
  • Verbose --verbose, -v - will display some extra information

Performance Tuning

Block Size --block, -b

The block size controls how big a block will be read from the file at a time. This depends on the optimal speed of your drive. By default it is 8M (8_388_608). One way to test this is to do the following on a large file:

for x in 1M 1M 2M 4M 8M 12M; do time dd if=somefile of=/dev/null bs=$x; done

Running in --verbose mode will give stats on how fast the file was read from disk, for optimisation.

Cache Size --cache, -c

The cache size control how big the blocks of the file that are cached to the kernel's file pages are. On the systems I tested, this is about 68% of total system memory. But, if there's a ton of stuff running, your file cache can have less available space (MS Teams is a great way to test this). By default it is set to 2G (2_147_483_648).

You can find total memory with:

Linux cat /proc/meminfo |head -n1

macOS sysctl hw.memsize

Shard Size --shard, -s

The shard size controls how big the blocks of data to send to the threads should be. Running with --verbose and examining the thread waits can help to optimise this for your system. Fewer waits means the threads spend less time waiting for a new chunk to arrive.

singrep's People

Contributors

singe avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.