We crawled over 12,000 research papers written by our UIUC Computer Science Faculty and provided a search interface for the papers.
Note that, on first visit, our app will take a while to load because it is deployed on Heroku.
# install PhantomJS
wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2
tar xvjf phantomjs-2.1.1-linux-x86_64.tar.bz2
# ADD phantomjs-2.1.1-linux-x86_64.tar.bz2/bin to $PATH
# install scholarly
pip3 install scholarly --user
# install selenium
pip3 install selenium --user
# install PyPDF2
pip3 install pypdf2 --user
# install Flask
pip3 install flask --user
# inside website/ directory
FLASK_APP=app.py flask app run
- Obtain list of names of all professors (see professors.txt)
- Utilize google scholar profiles, like Professor Zhai's and Python module scholarly obtain the link for eah paper.
- For each link discovered, take first 5000 chars respecting word boundaries.
- After a professor is crawled, save the json to a file to prevent crashes from corrupting all of the data
Our RPR is our crawler for obtaining the data we needed for buildling our inverted index. Given a list of names to crawl, the RPR crawler will utilize Google scholar profiles, like this one to retrieve information on each paper on the profile. Specifically, the RPR crawler will retrieve
- The name of the research paper
- The link to the research paper
- Around 5000 characters, respecting word boundaries of the page/pdf behind the link.
The RPR module handles both links that point to web documents as well as pdf files. If the link points to an html file, then the RPR module will attempt to eliniminate unhelpful html tags like script, and iframe. If the link points to a pdf file, then it will download the pdf into a temporary directory, scrape around 5000 characters, and then delete the file after saving the information it collected.
The RPR also supports incremental crawling via a history file. The history file is in the same format as the json file saved by the crawler. If there is sufficient information for each paper in the history file, the RPR crawler will not crawl it again. The history file is especially useful if the crawler runs into 503's in which it has to restart the crawl on a different IP.
from crawl.rpr import PaperRetriever
# TEST 1 retreive papers for one professor with no history file
save_as_file = "test1.json" # the file which the data will be save to
profs = ['Svetlana Lazebnik', 'Chengxiang Zhai'] # crawl the papers of these professors
paper_retriever = PaperRetriever(profs, save_as_file, None, num_threads=3) # delegate the work on 3 threads without history
paper_retriever.retrieve() # execute the crawl
To enable search, we decided to create an inverted index using metapy, a data science toolkit. The crawler returned data in JSON format. We needed to convert this to a .dat file that metapy could access. We created two .dat files: one with the professor data and a metadata file. Each line in the professor data file was a document that we crawled (we used a line corpus for the project). The metadata file contains information such as the title of the document and a link to the document.
After JSON was changed to .dat, inverted index was built using metapy. We removed the lemur stop-words to avoid storing unnecessary terms. OkapiBM25 ranker was used as the ranking function. Note that data was processed as unigrams.
from query_retriever import get_matching_docs, print_results
best_docs = get_matching_docs("Zhai")
print_results(best_docs)
- Worked on Crawling the data (RPR Module)
- Worked on Backend Flask App
- Paging Results
- Description Algorithm
- Templating
-
Worked on building inverted index with data and comparing rankers (using metapy)
-
Worked on Flask App
- Methods to retrieve documents
- Added logic for app (e.g. stemming for queries to retrieve base words)
- Worked on design of app (implemented some aesthetic features).