Graph search for academic articles.

Authors: Aman Kapur and Jimmy Wu

Built at Olin College of Engineering as a part of an AI class.

Introduction

Academic literature is abundant, but rather opaque to the layperson. Usually only the researchers actively working in a given area know their way around the literature. A non-expert has a difficult time identifying the important authors, gauging the level of activity, or obtaining a complete survey of an area of research. Perhaps some of this information can be revealed programmatically. Papers which cite related papers can be represented as a graph. This graph can reveal many things - which authors hold the most authority in a field, which papers are most important, how many articles are clustered toward a given topic.

We built a small version of that graph and experimented with what this graph could provide for us. Our articles came from http://arxiv.org/, an online collection of research papers available to the public domain. Recursively scraping articles by their citations, we represented the articles in a graph database. We have around 2000 articles, all in the realm of high energy physics and quantum cosomology. Our scraping functions ensure a graph as complete as the information arxiv gives us.

The vision is to have an intuitive interface with which to search topics, view general information about that topic, and traverse articles within that topic along citations.

The current state of the tool is as follows:

We implemented search algorithms to allow the user to search for topics using different ranking criteria.

Front-end: a search box which allows the user to choose ranking functions

Simple boolean match, ranking by article id
Term frequency (TF) ranking
TFIDF ranking
PageRank
PageRank and TFIDF (most accurate and relevant)

Users can "explore" a topic by traversing our graph.

They can view related articles (generated by graph properties and ranked by Levenshtein distance).
They can also traverse by author in the following ways:
View all articles given by a given author.
View a list of authors related to a given author. (Determined by how related their publications are to each other)

Setting Up

Make sure you have the following dependencies:

Ruby version 1.9.3
Rails version 3.2.x

You can check your operating version of Ruby and Rails with the commands

ruby -v
rails -v

If you have the Ruby Version Manager (RVM), you can switch to the right version with

rvm use 1.9.3

Next, run these commands in order

git clone [email protected]:fireblade99/journalgraph.git
cd journalgraph/db
wget https://dl.dropboxusercontent.com/u/45072018/development.sqlite3
cd ..
bundle install
rake db migrate
rails s

To view the running app, open up a browser to the url http://localhost:3000

Database representation

Nodes are articles and authors. Edges are articles-article, article-author, and author-author relations.

Article nodes hold metadata such as publication date, and subject category. They also hold information to aid the search engine:

abstract keywords via automatic summarization
ranking factors, such as PageRank

PageRank

PageRank is a measure of article importance. Highly ranked articles tend to either have many incoming citations from other articles, or are cited by highly cited articles.

To compute PageRank, we first generate the transition probability matrix for our graph. Here is an example transition probability matrix, where each directed edge coming out of a node is weighted by 1/n, and n is that node's total number of outgoing edges.

     a   b   c   d
 a | 0   0  .33  0  |
 b | 1   0  .33  0  |
 c | 0   0   0   0  |
 d | 0   1  .33  0  |

This matrix represents the following graph:
 a ----> b
 ^     /^|
 |   /   |
 | /     |/
 c ----> d

Each row in the matrix corresponds to an article which is being cited. As you can see, article a cites article b, so its edge has weight 1. However, article c cites articles a, b, and c, so each edge has weight 1/3.

TFIDF

TFIDF is a query dependent method of ranking articles by relevance.

TFIDF = TF * IDF

, where TF is proportional to the number of times a term occurs in an article , and IDF is a measure of how special a term is to a given article relative to the other articles in the query.

Ranking

PageRank and TFIDF both return values from 0 to 1 for each article. By adding these values, we obtain a ranking function that considers both relevance and importance.

jwjimmy / journalgraph Goto Github PK

journalgraph's Introduction

Introduction

Setting Up

Database representation

PageRank

TFIDF

Ranking

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent