Git Product home page Git Product logo

textrank's Introduction

TextRank

Description

This is an implementation of the TextRank algorithm for keyword extraction from documents. It adapts the PageRank algorithm to documents and was originally published in this article.

Intuitively, it builds a graph of words which are linked by the number of times they appear in the same context (here, same sentence). Then, it finds the words that most central in this graph, i.e. appear in context with as many other words from separate parts of the graph. The further refine, it performes part-of-speech tagging on all the debates and took into account only nouns as these are known to be most distinctive for summarization purposes. Then, a chunker identifies names like ‘Wall Street’ or ‘New York’ and collocations such as ‘ballistic missile’ or ‘coal miner’. Finally, it outputs lemmatized words in order to merge words with the same lemma such as ‘republican’ - ‘republicans’.

For the script to run, you need to install NLTK.

Usage

textrank.py

    python textrank.py folder

folder - folder with the documents to extract keywords

Output: a folder 'keywords-folder-textrank' with the keywords and their score, one per line, separated by a colon. This format can be used to generate word clouds using Wordle

Examples

Find the most central words from the US primary debate speeches.

python textrank.py candidates
Bernie Sanders' primary Debate Speeches keywords generated using Wordle: ![Sanders' keywords](http://www.sas.upenn.edu/~danielpr/sanders-trsentw.png)

Find the most central words from the NLP conferences accepted papers.

python textrank.py conferences
ACL 2015 titles ![ACL 2015](http://www.sas.upenn.edu/~danielpr/acl15.png) EMNLP 2015 titles ![EMNLP 2015](http://www.sas.upenn.edu/~danielpr/emnlp15.png) NAACL 2016 titles ![NAACL 2016](http://www.sas.upenn.edu/~danielpr/naacl16.png) ACL 2016 Short Paper titles ![ACL 2016 Short Papers](http://www.sas.upenn.edu/~danielpr/acl16short.png)

textrank's People

Contributors

danielpreotiuc avatar

Watchers

James Cloos avatar Jolin Xia avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.