zot / microfts Goto Github PK

Small and fast FTS (full text search)

License: MIT License

Go 82.95% Emacs Lisp 17.05%

microfts's Introduction

Microfts

A small full text indexing and search tool focusing on speed and space. Initial tests seem to indicate that the database takes about twice as much space as the files it indexes.

Microfts implements a trigram GIN (generalized inverted index), relying on LMDB for storage, an open source, embedded, NOSQL, key-value store library (so it’s linked into microfts, not an external service). It uses AskAlexSharov’s fork of bmatsuo’s lmdb-go package to connect to it.

LICENSE

Building

Note that building may generate warning messages from lmdb-go’s compilation of the LMDB C code.

go build -o microfts

Examples

Creating a database

./microfts create /tmp/bubba

Adding Text

This adds /tmp/tst to the database in /tmp/bubba

rm -rf /tmp/bubba
./microfts create /tmp/bubba
cat > /tmp/tst <<here
one
two three
four
four five
one two three
one three two
here
./microfts input -file /tmp/bubba /tmp/tst

Getting Info

./microfts info /tmp/bubba

Searching

./microfts search /tmp/bubba "one two"

Deleting a file’s information

./microfts delete /tmp/bubba /tmp/tst

Reclaiming space in the database (only really matters after deleting a large file)

./microfts compact /tmp/bubba

Finding grams for a string

./microfts grams "this is a test"
./microfts grams -gx "this is a test"

Finding candidates for grams

./microfts search -candidates -grams /tmp/bubba thi tes est

Usage

Exit Codes

misc error
file is missing
file has changed
file is unreadable
no entry for file in database
database missing

Help text

Usage:
   microfts info -groups DB
                   print information about each group in the database,
                   whether it is missing or changed
                   whether it is an org-mode entry
   microfts info [-chunks] DB GROUP
                   print info for a GROUP
                   -chunks also prints the chunks in GROUP if it has a corresponding file
   microfts info [-grams] DB
                   print info for database
                   displays any groups which do not exist as files
                   displays any groups which refer to files that have changed
                   -grams displays distribution information about the trigram index
   microfts create [-s GRAMSIZE] DB
                   create DATABASE if it does not exist
   microfts chunk [-nx | -data D | -dx] -d DELIM DB GROUP GRAMS
   microfts chunk [-nx | -data D | -dx] -gx DB GROUP GRAMS
                   ADD a chunk to GROUP with GRAMS.
                   -d means use DELIM to split GRAMS.
                   -gx means GRAMS is hex encoded with two bytes for each gram using base 37.
   microfts grams [-gx] CHUNK
                   output grams for CHUNK
   microfts input [-nx | -dx | -org] DB FILE...
                   For each FILE, create a group with its name and add a CHUNK for each chunk of input.
                   Chunk data is the line number, offset, and length for each chunk (starting at 1).
                   -org means chunks are org elements, otherwise chunks are lines
   microfts delete [-nx] DB GROUP
                   delete GROUP, its chunks, and tag entries.
                   NOTE: THIS DOES NOT RECLAIM SPACE! USE COMPACT FOR THAT
   microfts compact DB
                   Reclaim space for deleted groups
   microfts search [-n | -partial | -f | - limit N | -filter REGEXP | -u] DB TEXT
                   query with TEXT for objects
                   -f force search to skip changed and missing files instead of exiting
                   -filter makes search only return chunks that match the REGEXP
                   REGEXP syntax is here: https://golang.org/pkg/regexp/syntax/
   microfts search -candidates [-grams | -gx | -gd | -n | -f | -limit N | -dx | -u] DB TERM1 ...
                   dispay all candidates with the grams for TERMS without filtering
                   -grams indicates TERMS are grams, otherwise extract grams from TERMS
                   -gx: grams are in hex, -gd: grams are in decimal, otherwise they are 3-char strings
   microfts data [-nx | -dx] DB GROUP
                   get data for each doc in GROUP
   microfts update [-t] DB
                   reinput files that have changed
                   delete files that have been removed
                   -t means do a test run, printing what would have happened
   microfts empty DB GROUP...
                   Create empty GROUPs, ignoring existing ones

   microfts is targeted for groups of small documents, like lines in a file.

  -candidates
        return docs with grams for search
  -chunks
        info DB GROUP: display all of a group's chunks
  -comp string
        compression type to use when creating a database
  -d string
        delimiter for unicode tags (default ",")
  -data string
        data to define for object
  -dx
        use hex instead of unicode for object data
  -end-format string
        search: Go format string for the end of a group
        Arg to printf is the FILE
        The default value is ""
        if -sexp is provided and -end-format is not, the default is "\n"
        Not used with search -fuzzy -sort
  -f    search: skip changed and missing files instead of exiting
  -file
        search: display files rather than chunks
  -filter string
        search: filter results that match REGEXP
  -format string
        search: Go format string for each result
        Args to printf are FILE POSITION LINE OFFSET PERCENTAGE CHUNK
        FILE (string) is the name of the file
        POSITION (int) is the 1-based character position of the chunk in the file
        LINE (int) is the 1-based line of the chunk in the file
        OFFSET (int) is the 0-based offset of the first match in the chunk
        PERCENTAGE (float) is the percentage of a fuzzy match
        Note that you can place [ARGNUM] after the % to pick a particular arg to format
        The default format is %s:%[2]s:%[5]s\n
        -sexp sets format to (:filename "%s" :line %[3]d :offset %[4]d :text "%[6]s" :percent %[5]f)
          Note that this will cause all matches to be on one (potentially large) line of output (default "%[6]s:%[2]d:%[5]s\n")
  -fuzzy float
        search: specify a percentage fuzzy match
  -gd
        use decimal instead of unicode for grams
  -grams
        get: specify tags for intead of text
        info: print gram coverage
        search: specify grams instead of search terms
  -groups
        info: display information for each group
  -gx
        use hex instead of unicode for grams
  -limit int
        search: limit the number of results (default 9223372036854775807)
  -n    only print line numbers for search
  -org
        index org-mode chunks instead of lines
  -partial
        search: allow partial matches in search
  -prof
        profile cpu
  -s int
        gram size
  -sep
        print candidates on separate lines
  -sexp
        search: output matches as an s-expression ((FILE (POS LINE OFFSET chunk) ... ) ... )
        POS is the 1-based character position of the chunk in the file
        LINE is the 1-based line of the chunk in the file
        OFFSET is the 0-based offset of the first match in the chunk
  -sort
        search -fuzzy: sort all matches
        This ignores start-format and end-format because it sorts all matches, regardless of
        which file they come from.
  -start-format string
        search: Go format string for the start of a group
        Arg to printf is the FILE
        The default value is ""
        Not used with search -fuzzy -sort
  -t    update: do a test run, printing what would have happened
  -u    search: update the database before searching
  -v    verbose

Notes

Grams

Only alphanumeric characters are represented faithfully in grams, other characters are considered whitespace and display as ‘.’. This makes a base-37 triple (0-9 and A-Z), which just fits into 2 bytes. Which is a big deal, spacewise. Grams for starts of words begin with two whitespaces and ends of words end with one whitespace. There are no grams that end with two whitespaces.

Groups and chunks

The index consists of grams for chunks that belong to groups. Groups have names and the default is to use file names as group names.

Supported groups and chunks

Microfts supports using file names as groups and splitting files into chunks either by line or by org-mode element, with the chunk data being a triple of line, offset, chunk-length. Searching finds candidate chunks by intersecting gram entries and then consults the files named by the groups for the actual content.

Custom groups and chunks

If this is not sufficient, the command also supports custom usage: you can add chunks to a group, specifying data and grams. Searching can return candidate chunks for a set of grams.

Compressed representation for unsigned integers (lexicographically orderable)

7 bits	0 - 127	0xxxxxxx
12 bits	128 - 4095	1000xxxx X
20 bits	4096 - 1048575	1001xxxx X X
28 bits	1048576 - 268435455	1010xxxx X X X
36 bits	268435456 - 68719476735	1011xxxx X X X X
44 bits	68719476736 - 17592186044415	1100xxxx X X X X X
52 bits	17592186044416 - 4503599627370495	1101xxxx X X X X X X
60 bits	4503599627370496 - 1152921504606846975	1110xxxx X X X X X X X
64 bits	1152921504606846976 - 18446744073709551615	1111—- X X X X X X X X

LMDB Trees

Grams: GRAM-> BLOCK

GRAM is a 2-byte value

OID LIST

OID LISTS

9 lists of oids: [9][]byte.

Note – this is probably too ornate and a simple byte array and a count might have the same performance and space.

OIDS
# 1-byte OIDS
# 2-byte OIDS
# 3-byte OIDS
# 4-byte OIDS
# 5-byte OIDS
# 6-byte OIDS
# 7-byte OIDS
# 8-byte OIDS
# 9-byte OIDS

Gram 0 holds the info since 0 is not a legal gram

free gids
next unused oid
next unused gid
free oids

Chunks: OID -> BLOCK

OIDS are compressed integers

gram count
GID
data (e.g. line number)

Groups: GID -> BLOCK

GIDS are compressed integers

org flag (whether -org was used)
NAME
oid count
last changed timestamp
validity (valid = 0, deleted = 1)

Group Names: NAME->GID

microfts's People

Contributors

Stargazers

Watchers

Forkers

sje30 devexpert10 jmikedupont2

microfts's Issues

Run checkdoc/package-lint

Both of these tools will help bring your library in line with Elisp conventions.

checkdoc reports 20 errors:

 org-fts.el    25     info            White space found at end of line (emacs-lisp-checkdoc)
 org-fts.el    40     info            First sentence should end with punctuation (emacs-lisp-checkdoc)
 org-fts.el    45     info            First sentence should end with punctuation (emacs-lisp-checkdoc)
 org-fts.el    50     info            First sentence should end with punctuation (emacs-lisp-checkdoc)
 org-fts.el    60     info            First sentence should end with punctuation (emacs-lisp-checkdoc)
 org-fts.el    60     info            First line should be capitalized (emacs-lisp-checkdoc)
 org-fts.el    69     info            First sentence should end with punctuation (emacs-lisp-checkdoc)
 org-fts.el    69     info            First line should be capitalized (emacs-lisp-checkdoc)
 org-fts.el    69     info            Argument ‘item’ should appear (as ITEM) in the doc string (emacs-lisp-checkdoc)
 org-fts.el    76     info            First sentence should end with punctuation (emacs-lisp-checkdoc)
 org-fts.el    76     info            Lisp symbol ‘org-mode’ should appear in quotes (emacs-lisp-checkdoc)
 org-fts.el    86     info            First sentence should end with punctuation (emacs-lisp-checkdoc)
 org-fts.el    86     info            Lisp symbol ‘org-mode’ should appear in quotes (emacs-lisp-checkdoc)
 org-fts.el   103     info            First sentence should end with punctuation (emacs-lisp-checkdoc)
 org-fts.el   115     info            All variables and subroutines might as well have a documentation string (emacs-lisp-checkdoc)
 org-fts.el   136     info            All variables and subroutines might as well have a documentation string (emacs-lisp-checkdoc)
 org-fts.el   146     info            All variables and subroutines might as well have a documentation string (emacs-lisp-checkdoc)
 org-fts.el   163     info            First sentence should end with punctuation (emacs-lisp-checkdoc)
 org-fts.el   175     info            First sentence should end with punctuation (emacs-lisp-checkdoc)
 org-fts.el   192     info            First sentence should end with punctuation (emacs-lisp-checkdoc)

46 package-lint issues found:

46 issues found:

1:71: warning: You should depend on (emacs "24.1") if you need lexical-binding.
8:0: error: Expected (package-name "version-num"), but found cl-lib.
8:0: error: Expected (package-name "version-num"), but found executable.
8:0: error: Expected (package-name "version-num"), but found ivy.
8:0: error: Expected (package-name "version-num"), but found org.
8:0: error: Expected (package-name "version-num"), but found package.
11:0: error: Package should have a non-empty ;;; Commentary section.
15:10: error: You should depend on (emacs "24.3") or the cl-lib package if you need `cl-lib'.
18:10: error: You should depend on (emacs "24.1") if you need `package'.
25:0: error: `org-fts/microfts-url-alist' contains a non-standard separator `/', use hyphens instead (see Elisp Coding Conventions).
30:0: error: `org-fts/baseprogram-alist' contains a non-standard separator `/', use hyphens instead (see Elisp Coding Conventions).
35:0: error: `org-fts/baseprogram' contains a non-standard separator `/', use hyphens instead (see Elisp Coding Conventions).
39:0: error: `org-fts/program' contains a non-standard separator `/', use hyphens instead (see Elisp Coding Conventions).
44:0: error: `org-fts/db' contains a non-standard separator `/', use hyphens instead (see Elisp Coding Conventions).
49:0: error: `org-fts/search-args' contains a non-standard separator `/', use hyphens instead (see Elisp Coding Conventions).
54:0: error: `org-fts/hits' contains a non-standard separator `/', use hyphens instead (see Elisp Coding Conventions).
55:0: error: `org-fts/args' contains a non-standard separator `/', use hyphens instead (see Elisp Coding Conventions).
56:0: error: `org-fts/timer' contains a non-standard separator `/', use hyphens instead (see Elisp Coding Conventions).
57:0: error: `org-fts/actual-program' contains a non-standard separator `/', use hyphens instead (see Elisp Coding Conventions).
59:0: error: `org-fts/check-db' contains a non-standard separator `/', use hyphens instead (see Elisp Coding Conventions).
68:0: error: `org-fts/test' contains a non-standard separator `/', use hyphens instead (see Elisp Coding Conventions).
75:0: error: `org-fts/save-hook' contains a non-standard separator `/', use hyphens instead (see Elisp Coding Conventions).
85:0: error: `org-fts/open-hook' contains a non-standard separator `/', use hyphens instead (see Elisp Coding Conventions).
102:0: error: `org-fts/idle-task' contains a non-standard separator `/', use hyphens instead (see Elisp Coding Conventions).
114:0: error: `org-fts/microfts-search' contains a non-standard separator `/', use hyphens instead (see Elisp Coding Conventions).
131:47: error: You should depend on (emacs "24.3") if you need `file-name-base'.
133:14: warning: Closing parens should not be wrapped onto new lines.
135:0: error: `org-fts/found' contains a non-standard separator `/', use hyphens instead (see Elisp Coding Conventions).
139:38: error: You should depend on (emacs "27.1") if you need `org-show-all'.
142:0: error: `org-fts/history' contains a non-standard separator `/', use hyphens instead (see Elisp Coding Conventions).
143:0: error: `org-fts/file-history' contains a non-standard separator `/', use hyphens instead (see Elisp Coding Conventions).
145:0: error: `org-fts/display' contains a non-standard separator `/', use hyphens instead (see Elisp Coding Conventions).
146:18: error: You should depend on (emacs "24.3") or the cl-lib package if you need `cl-search'.
147:18: error: You should depend on (emacs "24.3") or the cl-lib package if you need `cl-search'.
152:5: error: You should depend on (emacs "24.3") or the cl-lib package if you need `cl-do'.
152:18: error: You should depend on (emacs "24.3") or the cl-lib package if you need `cl-incf'.
154:9: error: You should depend on (emacs "24.3") or the cl-lib package if you need `cl-do'.
154:20: error: You should depend on (emacs "24.3") or the cl-lib package if you need `cl-search'.
154:55: error: You should depend on (emacs "24.3") or the cl-lib package if you need `cl-search'.
162:0: error: `org-fts/search' contains a non-standard separator `/', use hyphens instead (see Elisp Coding Conventions).
174:0: error: `org-fts/find-org-file' contains a non-standard separator `/', use hyphens instead (see Elisp Coding Conventions).
184:20: error: You should depend on (emacs "25.1") or the seq package if you need `seq-filter'.
185:37: error: You should depend on (emacs "25.1") or the seq package if you need `seq-sort'.
186:49: error: You should depend on (emacs "25.1") if you need `string-collate-lessp'.
191:0: error: `org-fts/ensure-binary' contains a non-standard separator `/', use hyphens instead (see Elisp Coding Conventions).
217:20: error: You should depend on (emacs "24.4") if you need `zlib-decompress-region'.

Add flags for version and update checks

Need a way to tell what version you have.

Need a way to check if there's a newer version out.

DB never updated

I'm having the problem, that the DB is never automatically updated. I can create and update the DB from the command line, but the function never gets called when I open and/or save an org-file in emacs.

offset in search -sexp seems to be the beginning of the line position, not the beginning of the match position

Whenever I use (goto-char offset) it appears that (point) is at the beginning of the line where the match occurred.

missing input files generate a panic

Hi,
If /tmp/tst is missing (I had created /tmp/test by mistake, following your README), I get a panic. The panic message explains the error, but then it is followed by a long
backtrace (truncated below).

./microfts input -file /tmp/bubba /tmp/tst
panic: Error: stat /tmp/tst: no such file or directory, args: [/tmp/bubba /tmp/tst]

goroutine 1 [running]:
main.check(0x59fe60, 0xc00007a720)
        /home/bill/work/microfts/fulltext.go:205 +0x150
main.(*lmdbConfigStruct).openInputFile(0x686980, 0x7fff9d4de550, 0x8, 0xc00013baa0, 0x4e740c, 0x7f9cec003ffb, 0x4)
        /home/bill/work/microfts/fts-lmdb.go:535 +0x79
main.(*lmdbConfigStruct).indexLines(0x686980, 0x7fff9d4de550, 0x8)
        /home/bill/work/microfts/fts-lmdb.go:579 +0x5d
main.(*lmdbConfigStruct).index(0x686980, 0x7fff9d4de550, 0x8)
        /home/bill/work/microfts/fts-lmdb.go:529 +0x6f
main.cmdInput.func1()

finer indexing of paragraphs

Don't merge org paragraphs together

microfts doesn't handle org source blocks

The org parser/chunker in microfts seems to get confused by source blocks. When searching for a term that appears in a file with source blocks, the search UI displays just one huge line with the first source block for the org file (and none of the other matches). The screen shot shows a situation, where many lines in the file match "sphinx". In this case, the source block doesn't even have a match for the search term (it looks like the org parser squeezed all text into the source block chunk).

A simple workaround obviously is to remove -org from org-fts-input-args, which I did (and it's still very useful then, but not quite what you intended, I guess).

Here is a shell transcript of reproducing this from the command line:

$ THIS IS WRONG
$ rm org-fts.db
$ ./microfts create org-fts.db
$ ./microfts input -org org-fts.db ~/org/links.org
$ ./microfts search org-fts.db sphinx | cut -c -80
~/org/links.org:29:    #+begin_src python\n      import sys, 

$ WITHOUT ORG PARSING (same input file)
$ rm org-fts.db
$ ./microfts create org-fts.db
$ ./microfts input org-fts.db ~/org/links.org
$ ./microfts search org-fts.db sphinx | cut -c -80
~/org/links.org:682:** Sphinx
~/org/links.org:684:*** Requirements, Bugs, Test cases, … ins
~/org/links.org:688:*** Why use reStructuredText and Sphinx s
~/org/links.org:695:    documents, then Sphinx (or any of Mar
~/org/links.org:698:    sphinx-static-site-generator-for-main
...

I looked into the code, but my Go fu is moot, so no patch, sorry ...

Searching just for org headings?

Hi,
thanks for this neat project.

Is there a way to restrict a search just for org mode headings rather than the entire file? I saw in #6 mention of searching by headlines?

Thanks.

Non Latin scripts

Hi. Wanted to ask if this should work with non Latin scripts. I've installed and quickly tested and it seems it will not find anything searching for texts in Hebrew or Arabic. I only had spent a short time testing so sorry if I'm missing something here.

automatic filter options

Make combinable options that set the filter:

-select-paragraph
-select-headline
...
-invert-selection -- select everything you don't specify

Feature request - one line per search result for completion tools in Emacs

To use asynchronous search with completion tools like ivy/helm, we need the search command to output one complete line per result. I think you want something like:

(:filename path :line line-number :offset integer :text "matched chunk" :percent float)

I used a plist there, but other forms would work too, as long as you can "read" them in emacs.

for each match. Then you can use a transformer function in ivy to convert that to what you want to do completion on.

Note this is already supported in the vanilla search output which does output a line per result, but you have to parse that line to use it, and that isn't as reliable as using read in emacs. Also, it only outputs the line number so you can't jump to the match with offset.

search -files option

return a list of files containing the terms
set format to (:filename xxx)

runtime error on certain org files (minimal example included)

I get a runtime error on certain org files. A minimal org file that shows the error is the following:

#+STARTUP: hidestars

Reading
** Q
- a

And here is the backtrace

$ ./microfts input -org /tmp/db.db /tmp/a.org
panic: runtime error: slice bounds out of range [1:0]

goroutine 1 [running]:
main.orgPart(0x24, 0xc000020180, 0x2b, 0x23, 0xc00002019f, 0x4)
/home/dicosmo/code/microfts/fulltext.go:144 +0x5cf
main.forParts(0xc000020180, 0x2b, 0xc000143bc0)
/home/dicosmo/code/microfts/fulltext.go:108 +0xf7
main.(*lmdbConfigStruct).indexOrg(0x665de0, 0x7ffceb819254, 0xa)
/home/dicosmo/code/microfts/fts-lmdb.go:554 +0x18f
main.(*lmdbConfigStruct).index(0x665de0, 0x7ffceb819254, 0xa)
/home/dicosmo/code/microfts/fts-lmdb.go:527 +0x48
main.cmdInput.func1()
/home/dicosmo/code/microfts/fts-lmdb.go:518 +0x4f
main.(*lmdbConfigStruct).update.func1.1()
/home/dicosmo/code/microfts/fts-lmdb.go:1682 +0x2f
main.(*lmdbConfigStruct).runTxn(0x665de0, 0xc00002cc40, 0x0, 0xc000143d08)
/home/dicosmo/code/microfts/fts-lmdb.go:1720 +0x1d0
main.(*lmdbConfigStruct).update.func1(0xc00002cc40, 0xc00002cc40, 0x0)
/home/dicosmo/code/microfts/fts-lmdb.go:1681 +0x72
github.com/AskAlexSharov/lmdb-go/lmdb.(*Txn).runOpTerm(0xc00002cc40, 0xc000143df0, 0x0, 0x0)
/home/dicosmo/go/pkg/mod/github.com/!ask!alex!sharov/[email protected]/lmdb/txn.go:158 +0x6e
github.com/AskAlexSharov/lmdb-go/lmdb.(*Env).run(0xc00007a6f0, 0x1, 0x0, 0xc000143df0, 0x0, 0x0)
/home/dicosmo/go/pkg/mod/github.com/!ask!alex!sharov/[email protected]/lmdb/env.go:515 +0xbb
github.com/AskAlexSharov/lmdb-go/lmdb.(*Env).Update(...)
/home/dicosmo/go/pkg/mod/github.com/!ask!alex!sharov/[email protected]/lmdb/env.go:482
main.(*lmdbConfigStruct).update(0x665de0, 0xc000143e70)
/home/dicosmo/code/microfts/fts-lmdb.go:1680 +0x77
main.cmdInput(0x665de0)
/home/dicosmo/code/microfts/fts-lmdb.go:516 +0x19e
main.runLmdb(0xc000070180)
/home/dicosmo/code/microfts/fts-lmdb.go:210 +0x3be
main.main()
/home/dicosmo/code/microfts/fulltext.go:357 +0xc6f

Move Ivy support to its own package.

It's better to implement as much as possible using Emac's default completion interface.
Ivy (or any other) completion specific code should be it's own package. e.g. ivy-microfts or helm-microfts

does search -partial work?

I have a db setup so that this command

./microfts search ../cache/fts.db hashtag

yields this result
: /Users/jkitchin/Dropbox/emacs/microfts/examples/one.org:7:And #one hashtag.

But this command yields no result

./microfts search ../cache/fts.db -partial hash

I thought it would give me the same match. Is this expected for -partial?

multiple search terms only match when they are in the same line

This may not be a bug. I use the default microfts input with no flags to index the files. The file one.org has the words "example" and "hashtag" in them, but not on the same line. So the first two searches below work where both words are in a single line. But the last one returns nothing, which was a surprise to me.

./microfts search -u ../cache/fts.db one example

#+RESULTS:
: /Users/jkitchin/Dropbox/emacs/microfts/examples/one.org:3:This is the first example with a one in it.

./microfts search -u ../cache/fts.db one hashtag

#+RESULTS:
: /Users/jkitchin/Dropbox/emacs/microfts/examples/one.org:7:And #one hashtag.

./microfts search -u ../cache/fts.db example hashtag