Git Product home page Git Product logo

zincsearch's Introduction

ZincSearch

This is a project I created for a job application. It is split into 2 excecutables and one small API. The backbone of the project is the ZincSearch Search Engine.

This project used technologies that were completely new to me in both the frontend and the backend.

The tecnologies used for the frontend where:

  • Vue
  • Tailwind (css)

And for the backend:

  • Go
  • ZincSearch

Enrin

Enrin (ENron INdexer) is a utility program to populate the ZincSearch backend with data from the Enron Email Database. It recursively traverses the enron database and parses each of the mail files. Originally Enrin was singlethreaded, taking up to 30+ minutes to index the entirety of the enron database. Now it is multithreaded and can index the whole database in around 5 minutes, but with a hefty memory requirement.

Envi

Envi (ENron VIsualizer) is a utility webserver to display the enron mail data from ZincSearch.

It is responsive to multiple screensizes, but IMO it would still not be a super enjoyable experience. Altough the responsiveness does at least make it useable instead of outright unbearable.

image

API

This project exposes a simple API. The main way you interact with it is through an Authentication object created from a Authenticate function. You give it the address of the ZincSearch server, an index, and a username and password. Using the Authentication object you can perform various actions such as search queries or ingesting new data.

Improvements

While at the current moment the project is serviceable, it could still do with some improvements.

  • Pagination: Currently whenever you search for a term from within Envi it will query for all of the results at once. This makes it so if you search a common term (e.g. "enron") it can take a couple seconds for the data to be processed. If pagination where to be used this issue could be completely fixed.

  • Deduplication: In the enron database an email can appear multiple times. This is because if person A sends an email to person B each of them will have one copy of the email in the enron database. Currently Enrin does not account for this and will bloat the database with duplicates.

  • Multithreaded Indexing: The indexing of the enron database is currently single threaded making it take a long time to process the 600'000 emails in the enron database. Due to Go's strengths it should be possible to easily multithread the indexing and make the process much faster.

  • Better Parsing: The parsing that is currently implemented is very fast, but frankly doesn't work the best. The most glaring issue is that currently the body of emails have erronious newlines at the beginning.

Optimizations

The indexing of a single file is relatively fast, but a lot of time (80%+) is spent waiting to be given the file handle. Due to this there isn't much that can be done to speed up the process, apart from multithreading the program. With the indexing becoming multithreaded (one goroutine per file) and with a cap of 500 goroutines by default the indexing time is greatly shortened. The indexing time goes from 30+ minutes (~333 files a second) to 5±1 minutes (~2000 files a second).

zincsearch's People

Contributors

logdot avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.