f-prime / fist Goto Github PK
View Code? Open in Web Editor NEWA lightweight full-text index server with a focus on speed and efficiency.
License: MIT License
A lightweight full-text index server with a focus on speed and efficiency.
License: MIT License
I've enjoyed reading this source code. Thank you for sharing. I also understand that it is still in development so my question is really, did I miss something or are my expectations to be met with a future modification?
I see max_phrase_lenght is set to 10.
Line 57 in 9049066
Is it expected that I could index a document longer than 10 words? I would expect it to index all phrases present in the document up to 10 words long. 10 would seem a generous number in this case. Maybe 3 would be sufficient?
A query then might search for all phrases up to max_phrase_length, favoring the longest matches. Again, with this behavior, I would expect 3 to offer good results.
I would expect dump and load to require consistent values of max_phrase_length. Or do you anticipate accommodating changes somehow?
Allow the user to specify stop words to be not indexed, like the
and is
.
Hi,
This might be a stupid way to ask a question on Github but I want to ask you this question. Why did you write it in C and not in C++?
Also I would like to contribute to it. How can I?
Is the project continuing?
The signature of calloc
looks like this:
void* calloc (size_t num, size_t size);
In the current hashmap allocation, num
of elements depends on the size of the structure, which doesn't seem correct.
Line 18 in b0dbed6
Indexing algorithm can (should) be able to be sped up by using a divide and conquer algorithm.
@00-matt pointed out that uint32_t
should be used instead of int
when storing integers in the fist.db
file. Since the size of int
isn't standard on some systems the int
could be 2 bytes instead of the currently assumed 4 bytes causing the db file to become corrupted.
So this needs to be fixed.
There is currently no way to delete an entry into the database. A delete command needs to be implemented.
This is a bug that is also in the release version 0.0.1
and might have something to do with PR
#15
The below script pulls the Beemovie
transcript from Pastebin and attempts to index it. When run, the Fist server stops suddenly with no error message.
I used this code when I was originally testing indexing large files and it worked file (Other than issue #18)
import socket
import requests
import string
data = requests.get("https://pastebin.com/raw/Gb02THWc").content.decode().replace("\n", ' ').replace('\r', ' ').replace(' ', ' ').replace(" ", " ").lower()
for p in string.punctuation:
data = data.replace(p, '')
data = ' '.join(list(filter(None, data.split())))
print(data)
s = socket.socket()
s.connect(("localhost", 5575))
s.send("INDEX beemovie {}\r\n".format(data).encode())
print(s.recv(1024))
s.send(b"SEARCH bee movie\r\n")
print(s.recv(1024))
s.send(b"EXIT\r\n")
Would you like to add more error handling for return values from functions like the following?
When indexing large blocks of text (e.g. a movie script) not all phrases get indexed properly. It seems like randomly some phrases/words get indexed while others do not.
like matt said #50 , need a stop words, or a limit output, or rank option(need INDEX support)
This issues was discussed in more detail here:
Currently the index is stored as is in the fist.db
file. As the index grows it will require a lot of space. Some kind of compression will need to be implemented to allow the index to get sufficiently large without consuming a huge amount of resources.
EDIT: As an example of how important this is, I attempted to index all of the Joe Rogan podcast transcripts found here( https://github.com/achendrick/jrescribe-transcripts ) The index grew to be over 1.5gb. When compressed (manually) with gzip the file size was brought down to 300mb.
Settings such as max_phrase_length
, host
, port, and
buffer_size` should be configurable via a config file.
Creating this issue to remind myself that it is an issue and also to create some conversation around this.
The current state of the server only allows for a single client to be connected at a time. So this means that if there are multiple applications that need to access the same Fist instance, one client will have to wait for the other to disconnect before it can start sending commands.
This was just easier to do at first but it will need to be addressed because it's not acceptable.
fist λ ./bin/fist
Database file has been loaded. Previous state restored.
Fist started at localhost:5575
14 'INDEX a 😀'
INDEX
TEXT: '😀'
INDEX SIZE: 1
Segmentation fault
fist λ telnet localhost 5575
Trying ::1...
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
INDEX a 😀
Connection closed by foreign host.
With the addition of a configuration file and a few new commands it's time to properly document everything.
Not sure if I would call this a bug, but when the index gets very large and the server is closed the sdump
function takes a very long time causing the server to look like its hanging.
A quick improvement would be to create a message stating that Fist is attempting to Exit cleanly
before sdump is run, but there is a deeper problem to be solved here so that's why I am opening this issue.
Hi,
It would be great to get a version number going so that we could track changes that break compatibility with the protocol or database file.
While attempting to index a lot of Documents, I also tried to send a few SEARCH commands to see how things would behave. While the indexing was still going on, SEARCH commands became very very slow.
Currently we are using a very primitive hashing algorithm for the hashmap. A better algorithm should be used to help better prevent collisions.
Need to write tests for the binary search tree implementation in bst.c and bst.h
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.