Git Product home page Git Product logo

jere283 / zincsearch-indexer-websearchtool Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 10.03 MB

The Indexer crawls over the enron email dataset folders and indexed each file in the ZincSearch database. It also have a User Interface built with vue which allows you to search over the indexed files based on a keyword.

Go 56.32% JavaScript 5.58% HTML 2.19% Vue 35.62% CSS 0.29%
enron-dataset enron-email-dataset go indexer-cli zincsearch

zincsearch-indexer-websearchtool's Introduction

ZincSearch Indexer and Data Visualizer

Project Overview

The project is a combination of an Indexer and an API that hosts a Vue application which works as a User Interface to search over the indexed files. The Indexer is responsible for crawling through a specified directory structure, extracting information from the email files, and indexing them using the ZincSearch API. My API, built with Go and Chi, allows users to search for emails based on specific keywords.

Technologies used in this project

Project Structure

  • ZincSearch-Indexer-WebSearchTool
    • api
      • main.go ## Api main source code, search endpoint and static serve of the dist folder (the build of the vue app)
      • dist ## vue app minimized folder
      • go.mod
      • go.sum
      • api.exe ## Api executable file, hosted in port 3000
    • frontend ## vue 3 source code
    • profiling
      • proftests ## folder with the profiling tests of indexer V1 and V2
      • go.mod
      • profiling.go # go package with function to control the profiling profiles
    • zincsearch
      • go.mod
      • zinc.go ## go package with functions to interact with the zincsearch API (createDocument, BulkCreateDocuments, Search)
    • go.mod
    • go.work
    • go.work.sum
    • improvementsV2.md ## Document with information about the improvements in v1 and v2
    • Indexer.exe ## Indexer executable file
    • main.go ## Indexer main source code
    • README.md

Configuration

The Config struct holds the configuration details, including the ZincSearch base URL, index name, username, and password.

Installation

  1. Clone the git repository
  Git clone github.com/Jere283/ZincSearch-Indexer-WebSearchTool
  1. Download ZincSearch and follow the ZincSearch Quick Start
  2. Download the Enron-Email-Dataset ( you will need the path of this folder later)
  curl -L http://www.cs.cmu.edu/~enron/enron_mail_20110402.tgz -o enron_mail_20110402.tgz && tar -xf enron_mail_20110402.tgz
  1. Copy the path of the maildir folder inside the enron_mail_20110402 and add it as the value of the path variable in line 186 inside the https://github.com/Jere283/ZincSearch-Indexer-WebSearchTool/blob/main/main.go file.
var path string = "your_path" //here

With these steps you will be able to index the files inside of the maildir folder.

IMPORTANT NOTE: to serve the vue js app you need to copy the folderpath and paste it in the fs variable in line 49 of your https://github.com/Jere283/ZincSearch-Indexer-WebSearchTool/blob/main/api/main.go file. This will be fixed in future versions

  fs := http.FileServer(http.Dir("dist folder directory")) //Here

Indexer (Go Application)

The Indexer is a Go application that performs the following tasks:

  1. Folder Listing: Recursively lists all files and subfolders in the specified directory.
  2. Email Parsing: Reads email files, extracts relevant information (headers and body), and structures the data into a JSON format.
  3. Bulk Indexing: Utilizes the ZincSearch API to bulk index the parsed email data.

Usage

  • The main function configures the ZincSearch connection (-You need to set the Index name you want to use in the config structure in line 172.), performs CPU and memory profiling, and processes the email files in the specified directory.
  • The ConvertEmailFileToJson function parses individual email files.
  • The ProcessFiles function handles the parallel processing of files and subfolders.

Dependencies

  • ZincSearch: A search and analytics engine for Elasticsearch.

API (Go application)

The API is a Go application built with the Chi router. It provides basic CORS support and exposes endpoints for retrieving information from the ZincSearch index.

Endpoints

  • /api/v1: Welcome message.
  • /api/v1/search/{word}: Search endpoint to retrieve emails jsons based on a keyword.

Serving a Vue.js App

The API also serves a Vue.js dist folder to allow users to interact with the indexed data visually.

image

Dependencies

Chi: A lightweight, idiomatic web framework.

ZincSearch Functions (Go package)

The Zinc package contains functions for interacting with the ZincSearch API. It includes functions for creating documents, bulk indexing, and searching.

Functions

  • CreateDocument: Creates a single document in the ZincSearch index.
  • BulkCreateDocument: Performs bulk indexing of multiple documents.
  • SearchDocument: Searches for documents in the ZincSearch index based on a specified word.

Additional Notes

  • CPU and memory profiling have been implemented in the Indexer for performance analysis.
  • The project assumes a specific directory structure for email files.

Feel free to customize the code based on your specific requirements and directory structure. If you encounter any issues or have suggestions for improvement, please open an issue in the project repository.

zincsearch-indexer-websearchtool's People

Contributors

jere283 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.