Git Product home page Git Product logo

wikinaut / agrep Goto Github PK

View Code? Open in Web Editor NEW
291.0 14.0 50.0 3.5 MB

AGREP - approximate GREP for fast fuzzy string searching. Files are searched for a string or regular expression, with approximate matching capabilities and user-definable records. Developed 1989-1991 by Udi Manber, Sun Wu et al. at the University of Arizona. ISC open source license since Sept. 2014.

License: Other

C 94.35% Makefile 5.00% REXX 0.65%

agrep's Introduction

AGREP - an approximate GREP.

Fast searching files for a string or regular expression, with approximate matching capabilities and user-definable records.

Developed 1989-1991 by Udi Manber, Sun Wu et al. at the University of Arizona.

For Glimpse and WebGlimpse - AGREP is an essential part of them - see

Usage

Type

agrep

to get the six built-in help pages.

           Approximate Pattern Matching GREP -- Get Regular Expression
Usage:
AGREP [-#cdehi[a|#]klnprstvwxyABDGIRS] [-f patternfile] [-H dir] pattern [files]
-#  find matches with at most # errors     -A  always output filenames
-b  print byte offset of match
-c  output the number of matched records   -B  find best match to the pattern
-d  define record delimiter                -Dk deletion cost is k
-e  for use when pattern begins with -     -G  output the files with a match
-f  name of file containing patterns       -Ik insertion cost is k
-h  do not display file names              -Sk substitution cost is k
-i  case-insensitive search; ISO <> ASCII  -ia ISO chars mapped to lower ASCII
-i# digits-match-digits, letters-letters   -i0 case-sensitive search
-k  treat pattern literally - no meta-characters
-l  output the names of files that contain a match
-n  print line numbers of matches  -q print buffer byte offsets
-p  supersequence search                   -CP 850|437 set codepage
-r  recurse subdirectories (UNIX style)    -s silent
-t  for use when delimiter is at the end of records
-v  output those records without matches   -V[012345V] version / verbose more
-w  pattern has to match as a word: "win" will not match "wind"
-u  unterdruecke record output             -x  pattern must match a whole line
-y  suppresses the prompt when used with -B best match option
@listfile  use the filenames in listfile                              <1>23456Q

Branches

The present repository contains three different branches:

  • master: agrep based on agrep 3.0, ported to OS/2, DOS and Windows in the 90ies, and backported to LINUX (the present version you are visiting)
  • agrep3.0-as-found-in-glimpse4.18.6-20130216: agrep 3.0 as it was found in the glimpse software
  • agrep2.04: the first published and original agrep version

Installation

git clone [email protected]:Wikinaut/agrep.git
cd agrep
make

Algorithms

COPYRIGHT

As of Sept 18, 2014, Webglimpse, Glimpse and Agrep are available under the ISC open source license, thanks to the University of Arizona Office of Technology Transfer and all the developers, who were more than happy to release it.

Sources: http://webglimpse.net/sublicensing/licensing.html http://opensource.org/licenses/ISC

Anyone distributing the AGREP code should include the following license which is applicable since September 2014:

Copyright 1996, Arizona Board of Regents on behalf of The University of Arizona.

Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS.

IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

Contributors

History

Alternatives to AGREP

Alternatives:

  • TRE is a lightweight, robust, and efficient POSIX compliant regexp matching library with some exciting features such as approximate (fuzzy) matching.
  • AGREPY: Python port of agrep string matching with errors
  • The bitap library , another new and fresh implementation of the bitap algorithm. Windows - C - Cygwin
  • PERL module String:Approx. Perl extension for approximate matching (fuzzy matching) by Jarkko Hietaniemi, Finland
  • ugrep https://github.com/Genivia/ugrep

Further stuff with the same name (agrep)

  • aGrep, published in 2012, is an Android implementation of grep (but not agrep).

Homepage and references

agrep's People

Contributors

con-mo8 avatar timemachine3030 avatar wikinaut avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

agrep's Issues

Remove duplicate files

There are several duplicate or nearly-duplicate files in the repository:

  • bitap.c and bitap.c.orig
  • checkfil.c and checkfile.c
  • checkfil.h and checkfile.h
  • dummyfil.c and dummyfilters.c
  • io.c and io.c.orig
  • preproce.c and preprocess.c
  • recursiv.c and recursive.c
  • utilitie.c and utilities.c

What's going on? Can the differences between the two files in each set be resolved and the duplicate files be removed?

Remove OS, compiler version, date, and time from version string

agrep -V prints this on macOS:

AGREP 3.41.5/TG for NATIVE LINUX compiled with GCC Clang 17.0.5 (Nov 25 2023 18:24:29). Manber/Wu/Gries et al.

See:

#define AGREP_VERSION_STRING "AGREP "AGREP_VERSION" for "AGREP_OS" compiled with GCC "__VERSION__" ("__DATE__" "__TIME__"). Manber/Wu/Gries et al."

Please remove the OS name from this line since it is obviously fictitious on operating systems like macOS that you didn't explicitly consider and is not necessary since everybody already knows what operating system they're using.

Please remove "GCC" and the compiler version since it is confusing when the compiler that was used is not GCC and nobody cares what compiler or version a program was compiled with.

Please remove the compilation date and time since that renders your build non-reproducible and nobody cares when a program was compiled.

Add missing options to help text and reformat

In #27 I submitted some minor corrections to the help text but more extensive editing is still needed. After that PR, the first help screen from README.md looks like this:

           Approximate Pattern Matching GREP -- Get Regular Expression
Usage:
AGREP [-#cdehi[a|#]klnprstvwxyABDGIRS] [-f patternfile] [-H dir] pattern [files]
-#  find matches with at most # errors     -A  always output filenames
-b  print byte offset of match
-c  output the number of matched records   -B  find best match to the pattern
-d  define record delimiter                -Dk deletion cost is k
-e  for use when pattern begins with -     -G  output the files with a match
-f  name of file containing patterns       -Ik insertion cost is k
-h  do not display file names              -Sk substitution cost is k
-i  case-insensitive search; ISO <> ASCII  -ia ISO chars mapped to lower ASCII
-i# digits-match-digits, letters-letters   -i0 case-sensitive search
-k  treat pattern literally - no meta-characters
-l  output the names of files that contain a match
-n  print line numbers of matches          -q  print buffer byte offsets
-p  supersequence search                   -CP 850|437 set codepage
-r  recurse subdirectories (UNIX style)    -s  silent
-t  for use when delimiter is at the end of records
-v  output those records without matches   -V[012345V] version / verbose more
-w  pattern has to match as a word: "win" will not match "wind"
-u  suppress record output                 -x  pattern must match a whole line
-y  suppress the prompt when used with -B best match option
@listfile  use the filenames in listfile                              <1>23456Q
  • The -H option is mentioned in the one-line help but not in the detailed help.
  • The -b, -i0, -q, -CP codepage, and -V[012345V] options are mentioned in the detailed help but not in the one-line help.
  • The -g, -m, -o, -z, -L, -M, -O, and -P options are not documented anywhere in the help.
  • The -R option mentioned in the one-line help does not appear to exist (but is requested in #15). (How is this different from the -r option?)
  • The -d, -e, and -k options apparently require an argument which the help does not make obvious. (Compare with how the -f option is documented in the one-line help.)
  • The detailed help would be easier to navigate if it were sorted alphabetically by option (either case-sensitively or case-insensitively). It is currently somewhat alphabetical though with many deviations.
  • The detailed help would be easier to read if it were arranged either in a single column (my preference) or two columns (if compactness is valued more highly than legibility). The current mix of sometimes one-column and sometimes two-column, varying by line, requires more effort to understand.

If you don't want to fix this yourself, I or someone else might be able to submit a PR if you could describe the undocumented options and indicate which of the various alternatives you prefer.

build error

When I try to build on centos, I get this error:

checkfil.c:53: error: storage size of ‘buf’ isn’t known

Please suggest how to resolve this

output by cost instead of # errors

Hi,

I am a post-doc in Stanford University, who is bioinformatician on cancer genomic field.
Thank you for making great program.
I am just wondering I can select records that have at most # cost instead of # errors.
This option enables me to differentiate between substitution and insertion/deletion.
It would help my research a lot!!

utf-8 support

Is it possible to search in utf-8 files? I have some foreign names in my file and the -ia option doesn't return those names. I also tried the codepages 437, 850 and 8859 with the -CP option. Is there any workaround?

Does not make on mac os x

on 10.9.2, make ends with the following:

^~~~~~~~
10 warnings generated.
gcc -DMEASURE_TIMES=0 -DAGREP_POINTER=1 -DDOTCOMPRESSED=0 -c -DHAVE_DIRENT_H=1 -DHAVE_SYS_DIR_H=0 -DHAVE_SYS_NDIR_H=0 -DHAVE_NDIR_H=0 -DUTIME=1 -DISO_CHAR_SET=1 -DS_IFLNK=-1 -Dlstat=stat -O3   -c -o checkfil.o checkfil.c
checkfil.c:53:14: error: variable has incomplete type 'struct stat'
        struct stat buf;
                    ^
checkfil.c:53:9: note: forward declaration of 'struct stat'
        struct stat buf;
               ^
checkfil.c:55:6: warning: implicit declaration of function 'stat' is invalid in C99 [-Wimplicit-function-declaration]
        if (stat(fname, &buf) != 0) {
            ^
1 warning and 1 error generated.
make: *** [checkfil.o] Error 1

Make errors — MacOS Catalina, v10.15.7

A sequence of errors occurred during installation when calling the 'make' command. In general, most of them seem to be related to implicit declarations, examples:

  • newmgrep.c:980:61: error: implicit declaration of function 'eval_tree' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
  • recursiv.c:118:3: error: implicitly declaring library function 'strcpy' with type 'char *(char *, const char *)'
    [-Werror,-Wimplicit-function-declaration]
  • bitap.c:106:11: error: implicit declaration of function 're' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
    return re(fd, M, D); /* SUN: need to find a even point */

To correct this for my machine, I modified the files to either remove or move the if blocks beginning with 'ifdef_WIN32', though there is likely a better solution that doesn't affect other non-Windows operating systems. To make it successfully build on my machine I changed the following:

  • asearch.c - Comment out 'ifdef_WIN32' on line 26 and 'endif' on line 30
  • asearch1.c - Comment out 'ifdef_WIN32' on line 23 and 'endif' on line 26
  • agrep.c - Moved 'endif' up to like 225, effectively removing the declarations outside of the if block
  • bitap.c - Moved 'endif' up to like 68, effectively removing the declarations outside of the if block
  • main.c - Comment out 'ifdef_WIN32' on line 23 and 'endif' on line 25
  • parse.c - Comment out 'ifdef_WIN32' on line 7 and 'endif' on line 10
  • preproce.c - Comment out 'ifdef_WIN32' on line 45 and 'endif' on line 51
  • recursiv.c - Moved 'int exec();' outside of if block and added 'include <stdlib.h>' near top of file
  • sgrep.c - Comment out 'ifdef_WIN32' on line 126 and 'endif' on line 136
  • newmgrep.c - Comment out 'ifdef_WIN32' on line 122 and 'endif' on line 127
  • utilite.c - Moved 'include <string.h>' outside of if block and added 'include <stdlib.h>' near top of file
  • agrephlp.c - Moved 'endif' up to like 13, effectively removing the declarations outside of the if block

I may have missed a change or 2 while writing this issue, but these same actions can be used to correct other similar errors.

get match position in record

I would like to know what is the match position in the record. Is there something i don't get ?

I don't get exactly what -b and -q do and what they can be used for.

Thanks for this great tool.

Possible missing breaks in switch statement in agrep_init

Is the lack of break; in the 'O' and 'M' cases here intentional? They're undocumented (see #28) so I don't know how they're intended to function.

agrep/agrep.c

Lines 2707 to 2713 in b7d180f

case 'O':
POST_FILTER = ON;
case 'M':
MULTI_OUTPUT = ON;
case 'Z': break; /* no-op: used by glimpse */

When I compile with clang with -Werror=implicit-fallthrough in CFLAGS I get:

agrep.c:2710:4: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough]
 2710 |                         case 'M':
      |                         ^
agrep.c:2710:4: note: insert '__attribute__((fallthrough));' to silence this warning
 2710 |                         case 'M':
      |                         ^
      |                         __attribute__((fallthrough)); 
agrep.c:2710:4: note: insert 'break;' to avoid fall-through
 2710 |                         case 'M':
      |                         ^
      |                         break; 
agrep.c:2713:4: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough]
 2713 |                         case 'Z': break;        /* no-op: used by glimpse */
      |                         ^
agrep.c:2713:4: note: insert 'break;' to avoid fall-through
 2713 |                         case 'Z': break;        /* no-op: used by glimpse */
      |                         ^
      |                         break; 

For both cases, please either add break; if its omission was unintentional or annotate the fallthrough if it was intentional (or indicate the intention in a comment in this issue so someone can submit a PR).

There are various syntaxes for how to annotate an intentional fallthrough depending on the compiler so if you are going that route you may need to define a macro for that. For example LLVM defines a macro like this (yours can probably be simpler since you can skip the C++ variants, however for older compilers you'll need to check first if __has_attribute exists):

#if defined(__cplusplus) && __cplusplus > 201402L && LLVM_HAS_CPP_ATTRIBUTE(fallthrough)
#define LLVM_FALLTHROUGH [[fallthrough]]
#elif LLVM_HAS_CPP_ATTRIBUTE(gnu::fallthrough)
#define LLVM_FALLTHROUGH [[gnu::fallthrough]]
#elif __has_attribute(fallthrough)
#define LLVM_FALLTHROUGH __attribute__((fallthrough))
#elif LLVM_HAS_CPP_ATTRIBUTE(clang::fallthrough)
#define LLVM_FALLTHROUGH [[clang::fallthrough]]
#else
#define LLVM_FALLTHROUGH
#endif

deletions in output beginning 1023 characters after record start

I am using agrep on a unix system (someone else in our group installed it), and have a strange bug. Using the -d option, a few of the output matching records have deletions that begin after the first 1023 characters (and continue for varying lengths). The behavior is reproducible, given the same input file and command, the exact same output problems occur.
In https://github.com/peterwang0/agrep-test I have put test and output files that demonstrate the problem.
A detailed description is in https://github.com/peterwang0/agrep-test/blob/main/agrep_deletion_bug_github.pdf
Thanks!

Remove references to webglimpse.net

There are four references to webglimpse.net in three files in this repository. That web site no longer exists so those references should be replaced with references to wherever that information can now be found.

32 characters limit: 64bit?

Would it be possible to surpass the ~32 characters limit of agrep by using 64 bit unsigned long instead of unsigned?

Tried a bit with using unsigned long and doubling some agrep.h defines and replacing (unsigned)037777777777 to the 64 bit equivalent etc. But didn't work.

illegal pattern with "-w -i B" and a comma in expression

I use agrep in ding.
There I searched for "Straße", which resulted in the following search command:
agrep -h -w -i -B -y -e "Strasse,Straße" /usr/share/trans/de-en
This throws the message
illegal pattern: cannot handle OR (',') and AND (';')/regular-expressions simultaneously
while it is accepted in the 4.17 version, found in glimpse without problems.
I tried to reduce the options to a minimum and found out that the combination of the options -w -i -B and a comma in the search pattern triggers this error message.
I did not dig deeper into the code yet to find out the root cause for this.

Recursivity?

This tool has no option of being recursive while matching patterns.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.