The profile-exposer from codeonnnn

Team name: Bicameral Minds

Team members:

Avik Kuthiala (101803116)
Naman Tuli (101983040)

KINDLY READ THE ENTIRE README FILE. IT CONTAINS SOME IMPORTANT INFERENCES.

Video Link:

https://youtu.be/96deQshdM6g

Kindly view the video in 720p or above.

Presentation File Here.

Installing Dependencies

Make sure to use venv for installing dependencies. Use the following line of code to install all the required dependencies in your created virtual environment. To make an env, run:

python -m venv myenv
myenv\Scripts\activate.bat

Replace your environment name with myenv.

Run the following in your terminal:

pip install -r requirements.txt

Note: If we missed any dependency, kindly pip install the library which was not found.

Setting Up

First navigate to /Bicameral-Minds, the repository you cloned. After installing the dependencies, navigate to directory Code/NER Models/ go to each folder and select Extract Here for all 4 zip files. Setup complete.

Running the Crawler

Navigate to directory Code/ To run the crawler with default settings, just use:

scrapy crawl mygovscraper

The default running time for the crawler is 3 minutes. To run the crawler for a specific amount of time, use:

scrapy crawl mygovscraper -s CLOSESPIDER_TIMEOUT=<time in secs>

Example (To run for 1800 seconds):

scrapy crawl mygovscraper -s CLOSESPIDER_TIMEOUT=1800

Note: Do not give space after CLOSESPIDER_TIMEOUT= as it will give error. Other specific settings to configure the crawler can be found on official scrapy documentation.
A database will be created:

database.csv

Post Processing

Navigate to directory Code/ and run in terminal:

python postprocess.py

New database will be created:

clean_database.csv

Important Notes and Inferences:

The problem statement mentioned that 15 countries will be considered for the hackathon. Hence we used only 3 countries to train our NLP models(since training for 15 countries and then producing results on them would not make sense). But in the FAQs of the e-mail sent by MSC on 17-09-2020(for deadline extention), it was mentioned, "Try to train with as many countries' govt websites' HTML structures as possible". We trained with data for 3 countries and scaled it for 15, hence we believe we can train for 15 countries and scale the solution to 70-80 countries. However, it was not feasible to carry out this upscaling in under 4 days.
The crawler visits all the sites, but without following any particular order. It may happen that you would have to wait for some while before you start seeing meaningful websites appear in logs.
After the crawler starts running, you can see the sites that are being crawled in Code/log.txt file. Currently it has been emptied out.
The sites which are to be crawled in are to be mentioned in starter_sites.txt currently, the file contains all 14 sites to be considered. We strongly advise that for testing purposes, try with only 1 site since crawling govt sites is a very computationally costly process.
The sample database that we created was run for a total of 16 hours.
Results on news article pages:

The above image is an example of a news article webpage from which Prefix, Name and Position held was correctly extracted. The scraper is not designed to extract profiles from huge paragraphs, yet we noticed a few profiles being successfully extracted from news articles as well.

High Level Diagram of Solution:

Low Level Diagram of Scraper (universal):

Low Level Diagram of Crawler:

API

The API has been built using node.js and express.js and functions by fetching data from a mongo.db database.
Ensure that you have a functioning mongo.db database setup before running the API on postman.

Navigate to the db_api directory and install the required modules using npm:
npm install express node body-parser mongoose nodemon

First start the database server using -
mongod

To import the csv database into mongo.db, run the following command -
mongoimport --type csv -d record_db -c records --headerline --drop final_db.csv

Then make sure you are in db_api/ and run the command -
npm run start

Now the setup is running and you can use postman to test your API. The parameters need to be passed using x-www-form-urlencode to the Body.

The API will be hosted on http://localhost:3000/
To get all records, use - http://localhost:3000/records\

To get details of a single person, use -
http://localhost:3000/findrecord and pass parameter name as explained above and in the accompanying video.

codeonnnn / profile-exposer Goto Github PK

profile-exposer's Introduction

Team name: Bicameral Minds

KINDLY READ THE ENTIRE README FILE. IT CONTAINS SOME IMPORTANT INFERENCES.

Kindly view the video in 720p or above.

Presentation File Here.

Installing Dependencies

Replace your environment name with myenv.

Setting Up

Running the Crawler

Post Processing

Important Notes and Inferences:

High Level Diagram of Solution:

Low Level Diagram of Scraper (universal):

Low Level Diagram of Crawler:

API

profile-exposer's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent