Git Product home page Git Product logo

gvansickle / ucg Goto Github PK

View Code? Open in Web Editor NEW
132.0 8.0 17.0 5.49 MB

UniversalCodeGrep (ucg) is an extremely fast grep-like tool specialized for searching large bodies of source code.

Home Page: https://gvansickle.github.io/ucg/

License: GNU General Public License v3.0

C++ 62.11% Makefile 4.12% Shell 1.69% M4 28.40% Awk 0.32% Python 2.78% C 0.58%
pcre2 ack silver-searcher ripgrep grep

ucg's Introduction

UniversalCodeGrep

License Travis-CI Build Status Coverity Scan Build Status

UniversalCodeGrep (ucg) is an extremely fast grep-like tool specialized for searching large bodies of source code.

Table of Contents

Introduction

UniversalCodeGrep (ucg) is an extremely fast grep-like tool specialized for searching large bodies of source code. It is intended to be largely command-line compatible with Ack, to some extent with ag, and where appropriate with grep. Search patterns are specified as PCRE regexes.

Speed

ucg is intended to address the impatient programmer's code searching needs. ucg is written in C++20 and takes advantage of the concurrency (and other) support of the language to increase scanning speed while reducing reliance on third-party libraries and increasing portability. Regex scanning is provided by the PCRE2 library, with its JIT compilation feature providing a huge performance gain on most platforms. Directory tree traversal is performed by multiple threads, reducing the impact of waiting for I/O completions. Critical functions are implemented with hand-rolled vectorized (SSE2/4.2/etc.) versions selected at program load-time based on what the system supports, with non-vectorized fallbacks.

As a consequence of its overall design for maximum concurrency and speed, ucg is extremely fast. As an example, under Fedora 25, one of the benchmarks in the test suite which scans the Boost 1.58.0 source tree with ucg and a selection of similar utilities yields the following results:

Benchmark: '#include\s+".*"' on Boost source

Command Program Version Elapsed Real Time, Average of 10 Runs Num Matched Lines Num Diff Chars
/usr/bin/ucg --noenv --cpp '#include\s+.*' ~/src/boost_1_58_0 0.3.0 0.228973 9511 189
/usr/bin/rg -Lun -t cpp '#include\s+.*' ~/src/boost_1_58_0 0.2.9 0.167586 9509 0
/usr/bin/ag --cpp '#include\s+.*' ~/src/boost_1_58_0 0.32.0 2.29074 9511 189
grep -Ern --color --include=\*.cpp --include=\*.hpp --include=\*.h --include=\*.cc --include=\*.cxx '#include\s+.*' ~/src/boost_1_58_0 grep (GNU grep) 2.26 0.370082 9509 0

Note that UniversalCodeGrep is in fact somewhat faster than grep itself, even when grep is only using Extended Regular Expressions. And ucg certainly wins the ease-of-use contest.

License

GPL (Version 3 only)

Installation

UniversalCodeGrep packages are currently available for Fedora 23/24/25/26, Arch Linux, and OS X.

Fedora Copr Repository

If you are a Fedora user, the easiest way to install UniversalCodeGrep is from the Fedora Copr-hosted dnf/yum repository here. Installation is as simple as:

# Add the Copr repo to your system:
sudo dnf copr enable grvs/UniversalCodeGrep
# Install UniversalCodeGrep:
sudo dnf install universalcodegrep

Arch Linux User Repository

If you are a Arch Linux user, the easiest way to install UniversalCodeGrep is from the Arch Linux User Repository (AUR) here. Installation is as simple as:

# Install using yaourt:
yaourt -S ucg

Or you can install manually:

# Install manually:
cd /tmp/
curl -L -O https://aur.archlinux.org/cgit/aur.git/snapshot/ucg.tar.gz
tar -xvf ucg.tar.gz
cd ycg
makepkg -sri

OS X

ucg has been accepted into homebrew-core, so installing it is as easy as:

brew install ucg

Building the Source Tarball

If a ucg package is not available for your platform, UniversalCodeGrep can be built and installed from the distribution tarball (available here) in the standard autotools manner:

tar -xaf universalcodegrep-0.3.3.tar.gz
cd universalcodegrep-0.3.3
./configure
make
make install

This will install the ucg executable in /usr/local/bin. If you wish to install it elsewhere or don't have permissions on /usr/local/bin, specify an installation prefix on the ./configure command line:

./configure --prefix=~/<install-root-dir>

*BSD Note

On at least PC-BSD 10.3, g++48 can't find its own libstdc++ without a little help. Configure the package like this:

./configure LDFLAGS='-Wl,-rpath=/usr/local/lib/gcc48'

Build Prerequisites

gcc and g++ versions 4.8 or greater.

Versions of gcc prior to 4.8 do not have sufficiently complete C++11 support to build ucg. clang/clang++ is also known to work, but is not the primary development compiler.

PCRE: libpcre2-8 version 10.20 or greater, or libpcre version 8.21 or greater.

One or both of these should be available from your Linux/OS X/*BSD distro's package manager. You'll need the -devel versions if they're packaged separately. Prefer libpcre2-8; while ucg will currently work with either PCRE2 or PCRE, you'll get better performance with PCRE2, and further development will be concentrated on PCRE2.

OS X Prerequisites

OS X additionally requires the installation of argp-standalone, which is normally part of the glibc library on Linux systems. This can be installed along with a pcre2 library from Homebrew:

$ brew update
$ brew install pcre2 argp-standalone

Supported OSes and Distributions

UniversalCodeGrep 0.3.3 should build and run on any reasonably POSIX-compliant platform where the prerequisites are available. It has been built and tested on the following OSes/distros:

  • Linux:
    • Fedora 23, 24, 25, 26
    • Arch Linux
    • Ubuntu 16.04 (Xenial), 14.04 (Trusty Tahr)
  • OS X:
    • OS X 10.10, 10.11, 10.12, with Xcode 6.4, 7.3.1, 8gm, 8.1, and 8.2 resp.
  • *BSDs:
    • TrueOS (nee PC-BSD) 12.0 (FreeBSD 12.0)
  • Windows:
    • Windows 7 + Cygwin 64-bit

Note that at this time, only x86-64/amd64 architectures are fully supported. 32-bit x86 builds are also occasionally tested.

Usage

Invoking ucg is the same as with ack or ag:

ucg [OPTION...] PATTERN [FILES OR DIRECTORIES]

...where PATTERN is a PCRE-compatible regular expression.

If no FILES OR DIRECTORIES are specified, searching starts in the current directory.

Command Line Options

Version 0.3.3 of ucg supports a significant subset of the options supported by ack. In general, options specified later on the command line override options specified earlier on the command line.

Searching

Option Description
--[no]smart-case Ignore case if PATTERN is all lowercase (default: enabled).
-i, --ignore-case Ignore case distinctions in PATTERN.
-Q, --literal Treat all characters in PATTERN as literal.
-w, --word-regexp PATTERN must match a complete word.

Search Output

Option Description
--column Print column of first match after line number.
--nocolumn Don't print column of first match (default).

File presentation

Option Description
--color, --colour Render the output with ANSI color codes.
--nocolor, --nocolour Render the output without ANSI color codes.

File/directory inclusion/exclusion:

Option Description
--[no]ignore-dir=name, --[no]ignore-directory=name [Do not] exclude directories with this name.
--exclude=GLOB, --ignore=GLOB Files matching GLOB will be ignored.
--ignore-file=FILTER:FILTERARGS Files matching FILTER:FILTERARGS (e.g. ext:txt,cpp) will be ignored.
--include=GLOB Only files matching GLOB will be searched.
-k, --known-types Only search in files of recognized types (default: on).
-n, --no-recurse Do not recurse into subdirectories.
-r, -R, --recurse Recurse into subdirectories (default: on).
--type=[no]TYPE Include only [exclude all] TYPE files. Types may also be specified as --[no]TYPE: e.g., --cpp is equivalent to --type=cpp. May be specified multiple times.

File type specification:

Option Description
--type-add=TYPE:FILTER:FILTERARGS Files FILTERed with the given FILTERARGS are treated as belonging to type TYPE. Any existing definition of type TYPE is appended to.
--type-del=TYPE Remove any existing definition of type TYPE.
--type-set=TYPE:FILTER:FILTERARGS Files FILTERed with the given FILTERARGS are treated as belonging to type TYPE. Any existing definition of type TYPE is replaced.

Performance Tuning:

Option Description
--dirjobs=NUM_JOBS Number of directory traversal jobs (std::thread<>s) to use. Default is 2.
-j, --jobs=NUM_JOBS Number of scanner jobs (std::thread<>s) to use. Default is the number of cores on the system.

Miscellaneous:

Option Description
--noenv Ignore .ucgrc files.

Informational options:

Option Description
-?, --help give this help list
--help-types, --list-file-types Print list of supported file types.
--usage give a short usage message
-V, --version print program version

Configuration (.ucgrc) Files

UniversalCodeGrep supports configuration files with the name .ucgrc, in which command-line options can be stored on a per-user and per-directory-hierarchy basis.

Format

.ucgrc files are text files with a simple format. Each line of text can be either:

  1. A single-line comment. The line must start with a # and the comment continues for the rest of the line.
  2. A command-line parameter. This must be exactly as if it was given on the command line.

Location and Read Order

When ucg is invoked, it looks for command-line options from the following locations in the following order:

  1. The .ucgrc file in the user's $HOME directory, if any.
  2. The first .ucgrc file found, if any, by walking up the component directories of the current working directory. This traversal stops at either the user's $HOME directory or the root directory. This is called the project config file, and is intended to live in the top-level directory of a project directory hierarchy.
  3. The command line itself.

Options read later will override earlier options.

User-Defined File Types

ucg supports user-defined file types with the --type-set=TYPE:FILTER:FILTERARGS and --type-add=TYPE:FILTER:FILTERARGS command-line options. Three FILTERs are currently supported, ext (extension list), is (literal filename), and glob (glob pattern).

Extension List Filter

The extension list filter allows you to specify a comma-separated list of file extensions which are to be considered as belonging to file type TYPE. Example: --type-set=type1:ext:abc,xqz,def

Literal Filename Filter

The literal filename filter simply specifies a single literal filename which is to be considered as belonging to file type TYPE. Example: --type-add=autoconf:is:configure.ac

Glob filter

The glob filter allows you to specify a glob pattern to match against filenames. If the glob matches, the file is considered as belonging to the file type TYPE. Example: --type-set=mk:glob:?akefile*

Author

Gary R. Van Sickle

ucg's People

Contributors

gvansickle avatar ismail avatar kenorb avatar larryhynes avatar silvernexus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ucg's Issues

Coverity CID 53715: Not checking fstat() return code on known-good file descriptor

I suppose fstat() could fail here, leading to problems:

30File::File(const std::string &filename)
31{
32 // open() the file. We have to do this regardless of whether we'll subsequently mmap() or read().
33 m_file_descriptor = open(filename.c_str(), O_RDONLY);
34
\1. Condition this->m_file_descriptor == -1, taking false branch

35 if(m_file_descriptor == -1)
36 {
37 // Couldn't open the file, throw exception.
38 throw std::system_error(errno, std::generic_category());
39 }
40
41 // Check the file size.
42 struct stat st;
CID 53715 (#1 of 1): Unchecked return value from library (CHECKED_RETURN)2. check_return: Calling fstat(this->m_file_descriptor, &st) without checking return value. This library function may fail and return an error code. [Note: The source code implementation of the function has been overridden by a builtin model.]
43 fstat(m_file_descriptor, &st);
44 m_file_size = st.st_size;
45 // If filesize is 0, skip.
46 if(m_file_size == 0)

Coverity CID 53718: File descriptor leak in the m_start_paths logic of Globber::Run()

Not much of a leak (O(n) in number of paths on command line), but still:

54void Globber::Run()
55{
56 char * dirs[m_start_paths.size()+1];
57
58 int i = 0;
\1. Iterating over another element of this->m_start_paths
59 for(const std::string& path : m_start_paths)
60 {
61 dirs[i] = const_cast<char*>(path.c_str());
62
63 // Check if this start path exists and is a file or directory.
64 DIR *d = opendir(dirs[i]);
\2. open_fn: Returning handle opened by open.
\3. var_assign: Assigning: f = handle returned from open(dirs[i], 0).
65 int f = open(dirs[i], O_RDONLY);
66 if(d != NULL)
67 {
68 closedir(d);
69 }
70 else if(f != -1)
71 {
72 close(f);
73 }
74 else
75 {
76 m_bad_path = dirs[i];
77 return;
78 }
79
80 ++i;
CID 53718 (#1 of 1): Resource leak (RESOURCE_LEAK)4. leaked_handle: Handle variable f going out of scope leaks the handle.
81 }
82 dirs[m_start_paths.size()] = 0;

Remove dependency on Boost

It's only used for the message queues, which would be simple enough to implement directly in C++11. Eliminates an extra download for builders. Maybe make Boost optional, for comparison of queue implementations.

Files with ~100k matches result in long periods of no output

The way the MatchList/OutputTask mechanism and/or the line-finding mechanism currently work can result in long periods without any output in certain cases. E.g. if you do this:

ucg 'endif' boost_dir > endifs.txt

endifs.txt ends up being about 4MB/46000lines, every line containing an endif. If you then do a:

ucg 'endif'

and your only files are files such as endifs.txt, there can be long periods (minutes) where there is no console output, and ucg appears hung.

TypeManager::notype() too simplistic, doesn't match ack behavior.

When "--noTYPE" is given to ack, all extensions for that type are no longer matched, even if they appear in another type. Because notype() only removes the entry for the type from the active type map, ucg doesn't do this. The ack behavior is more correct. This behavior should probably extend to all file filter types.

Configure: Detect bad std::regex lib (e.g. gcc 4.8.x)

Some std::regex libs, in particular the one shipping with gcc 4.8.x, are essentially just stubs and do not function correctly, even though the compiler ostensibly supports C++11. Detect this situation at configure (or possibly build) time and output an appropriate error message.

Add --no-recurse.

Separate from --recurse issue because this one requires additional logic to implement.

Add Autotest

  • Compare performance to ack, ag
  • Determine performance vs. number of threads.

Coverity CID 53716,53717: ArgParse::GetProjectRCFilename(): if homedirname.empty() == false, open() could still fail, the home_fd would be invalid in subsequent uses.

Could happen, should be fixed:

451std::string ArgParse::GetProjectRCFilename() const
452{
453 // Walk up the directory hierarchy from the cwd until we:
454 // 1. Get to the user's $HOME dir, in which case we don't return an rc filename even if it exists.
455 // 2. Find an rc file, which we'll then return the name of.
456 // 3. Can't go up the hierarchy any more (i.e. we hit root).
457 /// @todo We might want to reconsider if we want to start at cwd or rather at whatever
458 /// paths may have been specified on the command line. cwd is what Ack is documented
459 /// to do, and is easier.
460
461 std::string retval;
462
463 // Get a file descriptor to the user's home dir, if there is one.
464 auto homedirname = GetUserHomeDir();
465 int home_fd = -1;
\1. Condition !homedirname.empty(), taking true branch
466 if(!homedirname.empty())
467 {
\2. negative_return_fn: Function open(homedirname.c_str(), 65536) returns a negative number.
\3. var_assign: Assigning: signed variable home_fd = open.
468 home_fd = open(homedirname.c_str(), O_RDONLY | O_DIRECTORY);
469 }
470
471 // Get the current working directory's absolute pathname.
472 /// @note GRVS - get_current_dir_name() under Cygwin will currently return a DOS path if this is started
473 /// under the Eclipse gdb. This mostly doesn't cause problems, except for terminating the loop
474 /// (see below).
475 char _original_cwd = get_current_dir_name();
476
477 //std::clog << "INFO: cwd = "" << original_cwd << """ << std::endl;
478
479 auto current_cwd = original_cwd;
\4. Condition current_cwd != NULL, taking true branch
\5. Condition current_cwd[0] != '.', taking true branch
\13. Condition current_cwd != NULL, taking true branch
\14. Condition current_cwd[0] != '.', taking true branch
480 while((current_cwd != nullptr) && (current_cwd[0] != '.'))
481 {
482 // See if this is the user's $HOME dir.
483 auto cwd_fd = open(current_cwd, O_RDONLY | O_DIRECTORY);
CID 53716: Improper use of negative value (NEGATIVE_RETURNS) [select issue]
\6. Condition is_same_file(cwd_fd, home_fd), taking false branch
CID 53717 (#3-2 of 3): Improper use of negative value (NEGATIVE_RETURNS)15. negative_returns: home_fd is passed to a parameter that cannot be negative. [show details]
484 if(is_same_file(cwd_fd, home_fd))
485 {
486 // We've hit the user's home directory without finding a config file.
487 close(cwd_fd);
488 break;
489 }
490 close(cwd_fd);
491
492 // Try to open the config file.
493 auto test_rc_filename = std::string(current_cwd);
\7. Condition *std::__cxx11::basic_string<char, std::char_traits, std::allocator >::reverse_iterator(test_rc_filename.rbegin()) != '/', taking true branch
494 if(_test_rc_filename.rbegin() != '/')
495 {
496 test_rc_filename += "/";
497 }
498 test_rc_filename += ".ucgrc";
499 //std::clog << "INFO: checking for rc file "" << test_rc_filename << """ << std::endl;
500 auto rc_file = open(test_rc_filename.c_str(), O_RDONLY);
\8. Condition rc_file != -1, taking false branch
501 if(rc_file != -1)
502 {
503 // Found it. Return its name.
504 //std::clog << "INFO: found rc file "" << test_rc_filename << """ << std::endl;
505 retval = test_rc_filename;
506 close(rc_file);
507 break;
508 }
509
510 /// @note GRVS - get_current_dir_name() under Cygwin will currently return a DOS path if this is started
511 /// under the Eclipse gdb. This mostly doesn't cause problems, except for terminating the loop.
512 /// The clause below after the || handles this.
\9. Condition strlen(current_cwd) == 1, taking false branch
\10. Condition strlen(current_cwd) <= 4, taking true branch
\11. Condition current_cwd[1] == ':', taking false branch
513 if((strlen(current_cwd) == 1) || (strlen(current_cwd) <= 4 && current_cwd[1] == ':'))
514 {
515 // We've hit the root and didn't find a config file.
516 break;
517 }
518
519 // Go up one directory.
520 current_cwd = dirname(current_cwd);
\12. Jumping back to the beginning of the loop
521 }
522
523 // Free the cwd string.
524 free(original_cwd);
525
526 // Close the homedir we opened above.
CID 53716: Argument cannot be negative (NEGATIVE_RETURNS) [select issue]
CID 53717 (#1 of 3): Argument cannot be negative (NEGATIVE_RETURNS) [select issue]
527 close(home_fd);
528
529 return retval;
530}

Add version to --help

Currently there's no version info printed with "--help". "--version" does contain the info.

Add optional pcre support

C++ implementations as of this writing (gcc 5.2, clang 3.7) are buggy (gcc) causing SIGSEGVs due to use of recursion, or non-existent (clang on Linux). Add configure-time support for libpcre and see if that's any better.

Add file types that ack detects as text

ack 2.x scans every file to see if it's text or not. At least as a temporary measure, add as many of these as practical as types; we can come up with a scan-for-binary algorithm later.

Gracefully handle regex compile errors

Currently FileScanner::FileScanner() throws an exception when regex compilation or study fails, and there's nothing to catch it. On Cygwin 64 (at least), this results in e.g.:
"
terminate called without an active exception
[2] 3448 abort (core dumped) ./ucg --noenv '[]' ~/src/boost_1_58_0
"
We should at least not dump core in this situation.

Improve logging

Currently we're just using std::clog and std::cerr. Come up with a better way, which can be at least somewhat controlled by command-line params.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.