Git Product home page Git Product logo

journalgraph's Introduction

Graph search for academic articles.

Authors: Aman Kapur and Jimmy Wu

Built at Olin College of Engineering as a part of an AI class.

Graph visualization

Introduction

Academic literature is abundant, but rather opaque to the layperson. Usually only the researchers actively working in a given area know their way around the literature. A non-expert has a difficult time identifying the important authors, gauging the level of activity, or obtaining a complete survey of an area of research. Perhaps some of this information can be revealed programmatically. Papers which cite related papers can be represented as a graph. This graph can reveal many things - which authors hold the most authority in a field, which papers are most important, how many articles are clustered toward a given topic.

We built a small version of that graph and experimented with what this graph could provide for us. Our articles came from http://arxiv.org/, an online collection of research papers available to the public domain. Recursively scraping articles by their citations, we represented the articles in a graph database. We have around 2000 articles, all in the realm of high energy physics and quantum cosomology. Our scraping functions ensure a graph as complete as the information arxiv gives us.

The vision is to have an intuitive interface with which to search topics, view general information about that topic, and traverse articles within that topic along citations.

The current state of the tool is as follows:

We implemented search algorithms to allow the user to search for topics using different ranking criteria.

Front-end: a search box which allows the user to choose ranking functions

  • Simple boolean match, ranking by article id
  • Term frequency (TF) ranking
  • TFIDF ranking
  • PageRank
  • PageRank and TFIDF (most accurate and relevant)

Users can "explore" a topic by traversing our graph.

  • They can view related articles (generated by graph properties and ranked by Levenshtein distance).
  • They can also traverse by author in the following ways:
  • View all articles given by a given author.
  • View a list of authors related to a given author. (Determined by how related their publications are to each other)

Setting Up

Make sure you have the following dependencies:

  • Ruby version 1.9.3
  • Rails version 3.2.x

You can check your operating version of Ruby and Rails with the commands

ruby -v
rails -v

If you have the Ruby Version Manager (RVM), you can switch to the right version with

rvm use 1.9.3

Next, run these commands in order

git clone [email protected]:fireblade99/journalgraph.git
cd journalgraph/db
wget https://dl.dropboxusercontent.com/u/45072018/development.sqlite3
cd ..
bundle install
rake db migrate
rails s

To view the running app, open up a browser to the url http://localhost:3000

Database representation

Nodes are articles and authors. Edges are articles-article, article-author, and author-author relations.

Article nodes hold metadata such as publication date, and subject category. They also hold information to aid the search engine:

  • abstract keywords via automatic summarization
  • ranking factors, such as PageRank

PageRank

PageRank is a measure of article importance. Highly ranked articles tend to either have many incoming citations from other articles, or are cited by highly cited articles.

To compute PageRank, we first generate the transition probability matrix for our graph. Here is an example transition probability matrix, where each directed edge coming out of a node is weighted by 1/n, and n is that node's total number of outgoing edges.

     a   b   c   d
 a | 0   0  .33  0  |
 b | 1   0  .33  0  |
 c | 0   0   0   0  |
 d | 0   1  .33  0  |

This matrix represents the following graph:
 a ----> b
 ^     /^|
 |   /   |
 | /     |/
 c ----> d

Each row in the matrix corresponds to an article which is being cited. As you can see, article a cites article b, so its edge has weight 1. However, article c cites articles a, b, and c, so each edge has weight 1/3.

TFIDF

TFIDF is a query dependent method of ranking articles by relevance.

TFIDF = TF * IDF

, where TF is proportional to the number of times a term occurs in an article , and IDF is a measure of how special a term is to a given article relative to the other articles in the query.

Ranking

PageRank and TFIDF both return values from 0 to 1 for each article. By adding these values, we obtain a ranking function that considers both relevance and importance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.