swciitg / intranetsearch Goto Github PK

License: Apache License 2.0

JavaScript 84.48% Dockerfile 2.88% Python 5.78% HTML 4.26% CSS 2.59%

intranetsearch's Introduction

IntranetSearch - IIT Guwahati Intranet Search Engine

Overview

Welcome to the IntranetSearch repository! This project aims to create a powerful search engine specifically tailored for the Intranet pages of the Indian Institute of Technology (IIT) Guwahati. IntranetSearch will help users efficiently search and retrieve information from the institute's internal web resources, enhancing productivity and ease of access to important information.

Getting Started

Before using IntranetSearch, you'll need to set up Elasticsearch, and the backend is implemented using Node.js. In this repository, you'll find a Python folder containing requirements.txt, and there's also a Dockerfile that can be used to host the project locally.

Prerequisites

Make sure you have the following prerequisites installed:

Elasticsearch
Node.js and npm
Docker (for local hosting)

Follow these steps:

First, clone the IntranetSearch repository to your local machine:

git clone https://github.com/swciitg/IntranetSearch.git

Make sure your elasticsearch server is up and running. Note down it's localhost endpoint and your username and password for Step 4.
Navigate to the project's backend folder:

cd IntranetSearch/backend

Create .env file with fields as in .env.sample inside backend folder
Install Node.js dependencies:

npm install

Start Backend Server:

npm start

Go to python folder containing python API endpoints for this project

cd IntranetSearch/python

Open a new terminal window in this folder and build the docker image:

docker build -t intranetsearch .

Run the Docker Container locally:

 docker run -p 8080:80 -v "$(pwd)\..\data:/app/data" intranetsearch

Remember to handle Elasticsearch security settings and access control as needed to protect sensitive Intranet content.

Usage

Access the IntranetSearch python FastAPI backend by navigating to http://localhost:8080/docs (or the URL of your deployment).
Access the Node API by navigating to http://localhost:PORT as in .env file.

Contribution guidelines

Visit CONTRIBUTING.md for more insights on contributing to this repo.

Happy Hacking!

Thank you for participating in intranetSearch's Hacktoberfest. We appreciate your contributions, and together, we can make intranetsearch even better for college sites everywhere. If you have any questions or need assistance, feel free to reach out to us via GitHub issues or our community chat.

Happy coding! 🚀🎉

intranetsearch's People

Contributors

Stargazers

Forkers

shifat-ali harshiitsingh botketan vivekreddy049 atrichatterjee1 sidling1 adityagupta118 prachibindal theadich

intranetsearch's Issues

npm bugs and code refactoring

The current codebase after adding web-crawler is missing some node dependencies. Resolve it so that it will work with npm i itself.
Implement proper error handling for web-crawler part.
Follow the response code convention like other APIs for this controller fully.
Rename the controller file so as to have Controller as suffix - to maintain uniformity (see other controller files).
Update the csvController.js accordingly after adding heading field. (Optionally, you can try to refactor the code so that any changes to contentModel doesn't require any changes to be made to this controller)
Migrate the route to scrape.js file instead of new file. (i.e, scrape/web-crawler can be the endpoint)
Return the number of Links scraped successfully along with failed links and any other such statistics that might be necessary as a part of successful response.

MongoDB + Routes

Issue: Setting up MongoDB and Creating Content Model

Task Description

In order to enhance the functionality of the IntranetSearch project, we need to perform the following tasks:

Setup MongoDB:
- Configure MongoDB to store scraped content data.
- Store the MongoDB database URL and other necessary configurations in a .env file inside the backend folder.
Ccomplete connectDB.js:
- Write the code for MongoDB connection configuration in a connectDB.js file inside the configs folder in the backend.
Create Content Model:
- Create a MongoDB model for storing scraped content.
- This model should have the following fields:
  - content: To store the textual content of the scraped page.
  - url: To store the URL of the scraped page.
  - embeddings: To store a vector of 384 dimensions, which will be used for various content analysis tasks.
    (For now, store some dummy data inside these for testing purposes. embedding field can be made not required)
Name the Content Model File:
- Name the content model file as contentModel.js and place it inside the models folder (to be created) in the backend.
Create a CSV Export Controller:
- Create a controller inside the controllers/web-crawler directory as csvSaveController.js
- This controller should be responsible for converting the documents stored in the content model to a CSV file.
- The CSV file should have an appropriate header as given in Contentmodel
- The name of the CSV file should be provided in req.body.
- get the required fields to be included in the csv files also from req.body. It is basically a enum of [content,url,embeddings]
  (i.e. if the user specifies only content as required in req.body then only content column should be there in csv)
- The generated CSV file has to be saved inside the data folder, which has already been created.
Update Routes:
- Add appropriate routes and endpoints for the new functionalities in the routes folder.

Expected Outcome

Upon completion of these tasks, we will have MongoDB configured to store scraped content data, a content model for structured data storage, and the ability to convert content to CSV format for analysis. This will enhance our project's capabilities significantly.

Note: Please make sure to create separate commits for each task, and include relevant documentation and comments in your code. The necessary folders and file have been already created. Provide a postman collection for testing the same also.

Dockerize the backend part

Write the Dockerfile and docker-compose.yml for backend part.
Write docker-compose.yml for python endpoint also inside python folder.
Might be needing to change the Dockerfile for including frontend part also in future.
Dockerize the elasticsearch part also, make sure that it is secure (uses password and username)
Note: The codebase should be able to run in the docker container. Test this locally.

Create UI

Create dashing UI for the backend routes already created inside /backend folder

Improve package size in docker container

The current docker container in python folder uses lots of space due to sentence-transformers library.
Optimise this space if possible by making changes in Dockerfile.

Create Web Crawler

Create a working web crawler inside controllers/web-crawler directory.
Store the scrapped data as content, url fields appropriately in the contentModel (created by Issue #6).
Create a route for the same (/web-crawler) in routes folder. Name it as crawler.js.
Use this route in app.js and create a PR after proper testing using postman.

Make Admin Panel

Use admin bro or something similar to that to make an admin panel for managing scraped content and all other relavent functions properly.

Non Readability

Update Readme.md to show how to setup and requirements
Try to add comments

Setup github actions

GitHub Actions Setup

Tasks

Setup github actions to:

Automatically add the hacktoberfest-accepted label to merged pull requests for an issue.
Automatically add the hacktoberfest and hacktoberfest2023 labels to newly created issues.
Comment assigned, tagging the user as soon as someone is assigned to an issue.

Context

GitHub Actions can help streamline our development and contribution processes by automating certain repetitive tasks. These tasks, as mentioned above, will be beneficial for tracking contributions related to Hacktoberfest.

Clean the crawled content before saving it to mongodb

The cleaned content, especially headings have lot of newline and trailing and leading backspace, which might introduce unnecessary padding vectors in word embeddings. Hence clean them before saving it to the database.
Additionally, scrape additional contents like some tags etc if they will be useful for searching in elasticsearch database.

Design User Interface for IntranetSearch Engine

Issue: Design User Interface for IntranetSearch Engine

Task Description

To improve the user experience of the IntranetSearch Engine, we need to design a clean and intuitive user interface. The UI should include a search bar and display search results with links and brief content, similar to popular search engines like Google.

UI Design Specifications

Design a UI that includes the following elements:

Search Bar:
- A prominent search bar at the top of the page for users to input search queries.
Search Results:
- Display search results in a structured and easy-to-read format.
- Each search result should include:
  - Title or link to the intranet page.
  - A brief content snippet or description to provide context.
Pagination:
- If there are multiple search results, include a pagination feature to navigate through pages of results.
Clean and Minimalist Design:
- Use a clean and minimalist design with a focus on readability and usability.
- Ensure responsive design for different screen sizes.

Provide Figma Design

Please create a Figma design that visually represents the proposed UI. The Figma design should include wireframes or mockups of the search results page, highlighting the search bar, search results, and any additional elements you think would improve the user experience.

Note: Please provide a link to the Figma design once it's ready, and ensure that the design is user-centric and visually appealing.

Write a detailed instruction on how to set-up elastic search locally

Issue: Setup Elasticsearch Locally in Windows and Test Connection

Task Description

Setting up an Elasticsearch server locally in Windows can sometimes be challenging for newcomers. To address this, we need to provide comprehensive instructions and a video tutorial demonstrating the setup process. Additionally, we will connect to Elasticsearch using a client (e.g., JavaScript) to ensure it's working correctly.

Detailed Instructions and Video Tutorial

Setup Elasticsearch Locally in Windows:
- Provide detailed step-by-step instructions on how to download and install Elasticsearch on a Windows machine.
- Mention any common pitfalls and troubleshooting tips.
Video Tutorial:
- Create a video tutorial that complements the written instructions. make it short and crisp.
- Demonstrate the entire setup process, from downloading the Elasticsearch package to running the server.
Connect to Elasticsearch Using a Client:
- Utilize an Elasticsearch client, such as JavaScript (e.g., Elasticsearch.js library), to establish a connection to the local Elasticsearch server.
- Create a simple script that performs a basic query or index operation to showcase the connection.

Note: Please make sure the video is short and crisp and include your documentation along with video in ELASTICSEARCHSETUP.md file in root directory.

Upload Data from CSV to Elasticsearch

Issue: Upload Data from CSV to Elasticsearch

Task Description

To enhance the functionality of the project, we need to create a route in the main.py file (located inside the python/app folder) for uploading data from a CSV file into Elasticsearch. We'll use the initialized Elasticsearch client for this purpose. Additionally, we should refer to the create-index route in the backend folder for the structure of the Elasticsearch index.

Detailed Task

Modify the store route in main.py:

Inside python/app in main.py file store route, write the code for uploading this data from the df created into elasticsearch index using the intitialised elasticsearch client.
Refer to create-index route in backend folder for the structure of index.

Refer to create-index Structure:
- Refer to the structure of the Elasticsearch index used in the create-index route located in the backend folder.

Index structure

embeddings: {
        type: "dense_vector",
        dims: req.body.dim ? req.body.dim : 384,
        index: true,
        similarity: "cosine",
      },
      content: { type: "text" },
      url: { type: "text" },

Hint - Bulk API for Data Upload:
- Use the Elasticsearch Bulk API to efficiently upload data from the CSV file into the Elasticsearch index.

Write comments wherever possible and write clean and efficient codes.

swciitg / intranetsearch Goto Github PK

intranetsearch's Introduction

IntranetSearch - IIT Guwahati Intranet Search Engine

Overview

Getting Started

Prerequisites

Usage

Contribution guidelines

Happy Hacking!

intranetsearch's People

Contributors

Stargazers

Forkers

intranetsearch's Issues

Issue: Setting up MongoDB and Creating Content Model

Task Description

Expected Outcome

GitHub Actions Setup

Tasks

Context

Issue: Design User Interface for IntranetSearch Engine

Task Description

UI Design Specifications

Provide Figma Design

Issue: Setup Elasticsearch Locally in Windows and Test Connection

Task Description

Detailed Instructions and Video Tutorial

Issue: Upload Data from CSV to Elasticsearch

Task Description

Detailed Task

Recommend Projects

Recommend Topics

Recommend Org