Git Product home page Git Product logo

fist's Introduction

Fist - (F)ull-(t)ext (i)ndex (s)erver

Slack Patreon CircleCI

Fist is a fast, lightweight, full-text search and index server. Fist stores all information in memory making lookups very fast while also persisting the index to disk. The index can be accessed over a TCP connection and all data returned is valid JSON.

Fist is still heavily under development. Not all features are implemented or stable yet.

Motivation

Most software that requires full-text search is not really that complicated and does not need an overly complex solution. Using a complex solution often times leads to headaches. Setting up Elasticsearch when Elasticsearch really isn't needed for the particular application costs more time and money to maintain.

This is where Fist comes in. Fist is intended to be extremely easy to deploy and integrate into your application. Just start the Fist server and start sending commands.

Build and start Fist server

make
./bin/fist
Fist started at localhost:5575

Run Tests

make test

Example Usage

Commands can be sent over a TELNET connection

Commands: INDEX, SEARCH, EXIT, VERSION, DELETE

telnet localhost 5575
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
INDEX document_1 Some text that I want to index
Text has been indexed
INDEX document_2 Some other text that I want to index
Text has been indexed
SEARCH I want to index
["document_1","document_2"]
DELETE I want to index
Key Deleted
SEARCH I want to index
[]
EXIT
Bye

Docker Usage

# Build image
docker build . -t fist:latest

# Run tests
docker run --rm -it fist test

# Run server and make volume for database
docker run -d --init --rm -p 5575:5575 -v /var/local/lib/fist fist

Key Features

  • Full text indexing and searching
  • Persisting data to disk
  • Compression of index file
  • Accessible over TCP connection

Client Libraries

NodeJS

Python

Go

Ruby

Contributors

fist's People

Contributors

00-matt avatar 0xflotus avatar andrerenaud avatar f-prime avatar palash25 avatar stefanslehta avatar wanghenshui avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fist's Issues

Database portability

@00-matt pointed out that uint32_t should be used instead of int when storing integers in the fist.db file. Since the size of int isn't standard on some systems the int could be 2 bytes instead of the currently assumed 4 bytes causing the db file to become corrupted.

So this needs to be fixed.

Closing server when index becomes very large is slow

Not sure if I would call this a bug, but when the index gets very large and the server is closed the sdump function takes a very long time causing the server to look like its hanging.

A quick improvement would be to create a message stating that Fist is attempting to Exit cleanly before sdump is run, but there is a deeper problem to be solved here so that's why I am opening this issue.

Why not in C++?

Hi,

This might be a stupid way to ask a question on Github but I want to ask you this question. Why did you write it in C and not in C++?

Also I would like to contribute to it. How can I?

Unicode symbols cause a crash

fist λ ./bin/fist 
Database file has been loaded. Previous state restored.
Fist started at localhost:5575
14 'INDEX a 😀'
INDEX
TEXT: '😀'
INDEX SIZE: 1
Segmentation fault
fist λ telnet localhost 5575
Trying ::1...
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
INDEX a 😀
Connection closed by foreign host.

Limited to one connection at a time

Creating this issue to remind myself that it is an issue and also to create some conversation around this.

The current state of the server only allows for a single client to be connected at a time. So this means that if there are multiple applications that need to access the same Fist instance, one client will have to wait for the other to disconnect before it can start sending commands.

This was just easier to do at first but it will need to be addressed because it's not acceptable.

Create proper documentation

With the addition of a configuration file and a few new commands it's time to properly document everything.

Delete an entry

There is currently no way to delete an entry into the database. A delete command needs to be implemented.

Indexing causes SEARCH to become very slow

While attempting to index a lot of Documents, I also tried to send a few SEARCH commands to see how things would behave. While the indexing was still going on, SEARCH commands became very very slow.

Help me understand max_phrase_length

I've enjoyed reading this source code. Thank you for sharing. I also understand that it is still in development so my question is really, did I miss something or are my expectations to be met with a future modification?

I see max_phrase_lenght is set to 10.

dstringa index = indexer(text, 10);

Is it expected that I could index a document longer than 10 words? I would expect it to index all phrases present in the document up to 10 words long. 10 would seem a generous number in this case. Maybe 3 would be sufficient?

A query then might search for all phrases up to max_phrase_length, favoring the longest matches. Again, with this behavior, I would expect 3 to offer good results.

I would expect dump and load to require consistent values of max_phrase_length. Or do you anticipate accommodating changes somehow?

Better Hashing Algorithm

Currently we are using a very primitive hashing algorithm for the hashmap. A better algorithm should be used to help better prevent collisions.

Server quits unexpectedly

This is a bug that is also in the release version 0.0.1 and might have something to do with PR
#15

The below script pulls the Beemovie transcript from Pastebin and attempts to index it. When run, the Fist server stops suddenly with no error message.

I used this code when I was originally testing indexing large files and it worked file (Other than issue #18)

import socket
import requests
import string

data = requests.get("https://pastebin.com/raw/Gb02THWc").content.decode().replace("\n", ' ').replace('\r', ' ').replace('  ', ' ').replace("  ", " ").lower()
for p in string.punctuation:
    data = data.replace(p, '')

data = ' '.join(list(filter(None, data.split())))
print(data)

s = socket.socket()
s.connect(("localhost", 5575))
s.send("INDEX beemovie {}\r\n".format(data).encode())
print(s.recv(1024))
s.send(b"SEARCH bee movie\r\n")
print(s.recv(1024))
s.send(b"EXIT\r\n")

Website

  • setup sass
  • choose fonts
  • find favicon
  • add code highlighter
  • add nav links to github, patreon, slack
  • style footer
  • add social media icons
  • add links to clients
  • fix mobile version
  • add media queries

Index Compression

Currently the index is stored as is in the fist.db file. As the index grows it will require a lot of space. Some kind of compression will need to be implemented to allow the index to get sufficiently large without consuming a huge amount of resources.

EDIT: As an example of how important this is, I attempted to index all of the Joe Rogan podcast transcripts found here( https://github.com/achendrick/jrescribe-transcripts ) The index grew to be over 1.5gb. When compressed (manually) with gzip the file size was brought down to 300mb.

Versioning

Hi,

It would be great to get a version number going so that we could track changes that break compatibility with the protocol or database file.

Configuration File

Settings such as max_phrase_length, host, port, and buffer_size` should be configurable via a config file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.