Git Product home page Git Product logo

davidadamojr / textrank Goto Github PK

View Code? Open in Web Editor NEW
752.0 41.0 225.0 41 KB

Python implementation of TextRank algorithm for automatic keyword extraction and summarization using Levenshtein distance as relation between text units. This project is based on the paper "TextRank: Bringing Order into Text" by Rada Mihalcea and Paul Tarau. https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf

Python 100.00%

textrank's Introduction

TextRank

This is a python implementation of TextRank for automatic keyword and sentence extraction (summarization) as done in https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf. However, this implementation uses Levenshtein Distance as the relation between text units.

This implementation carries out automatic keyword and sentence extraction on 10 articles gotten from http://theonion.com

  • 100 word summary
  • Number of keywords extracted is relative to the size of the text (a third of the number of nodes in the graph)
  • Adjacent keywords in the text are concatenated into keyphrases

Usage

To install the library run the setup.py module located in the repository's root directory. Alternatively, if you have access to pip you may install the library directly from github:

pip install git+https://github.com/davidadamojr/TextRank.git

Use of the library requires downloading nltk resources. Use the textrank initialize command to fetch the required data. Once the data has finished downloading you may execute the following commands against the library:

textrank extract_summary <filename>
textrank extract_phrases <filename>

Contributing

Install the library as "editable" within a virtual environment.

pip install -e .

Dependencies

Dependencies are installed automatically with pip but can be installed serparately.

textrank's People

Contributors

cmanallen avatar davidadamojr avatar erikqu avatar finafiskar avatar jessexoc avatar lidalei avatar suminb avatar vinayak-mehta avatar vvsxmja avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

textrank's Issues

Order of importance changes in set()

If set() is used in extract_key_phrases() function, then the order of importance changes while manipulating the set.

Result of running on same article 2 times:
{'account', 'personal information', 'surprise', 'political', 'scandal-plagued', 'privacy', 'network', 'possible', 'access', 'company', 'Instagram', 'survey', 'Spotify', 'Tinder', 'platform', 'turkey', 'security breach', 'Cambridge Analytica', 'database', 'Facebook', 'security scandal', 'innocuous', 'Research', 'percent', 'Friday'}

{'security breach', 'Instagram', 'percent', 'privacy', 'Research', 'network', 'political', 'Spotify', 'scandal-plagued', 'account', 'possible', 'Tinder', 'Facebook', 'platform', 'personal information', 'database', 'company', 'access', 'Friday', 'surprise', 'security scandal', 'survey', 'turkey', 'innocuous', 'Cambridge Analytica'}

problem with textrank.py

when i try to build the setup.py file , it is unable to find the textrank.py file..can u please help?
error:
running build
running build_py
file textrank.py (for module textrank) not found
file textrank.py (for module textrank) not found

Installation Issue

Installation Aborted with following text:

Cleaning up...
Command /usr/bin/python -c "import setuptools, tokenize;__file__='/tmp/pip_build_root/numpy/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-HjUAJe-record/install-record.txt --single-version-externally-managed --compile failed with error code 1 in /tmp/pip_build_root/numpy
Traceback (most recent call last):
  File "/usr/bin/pip", line 9, in <module>
    load_entry_point('pip==1.5.4', 'console_scripts', 'pip')()
  File "/usr/lib/python2.7/dist-packages/pip/__init__.py", line 235, in main
    return command.main(cmd_args)
  File "/usr/lib/python2.7/dist-packages/pip/basecommand.py", line 161, in main
    text = '\n'.join(complete_log)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 72: ordinal not in range(128)

Why levenshtein distance?

Textrank algorithm requires each node from graph to be connected with it neighbours in text. This makes it works with pagerank algo. But in your implementation you connect each node with every other node and weight edge with levenshtein distance (aka words textual similarity).

In tests that I've made keywords from result wasn't actually keywords, but just random words that happens to be visually similar to the mean of other words in text (because of lev-dist).

Maybe I don't understand something? Can you explain usage of levenshtein distance? Or give some links where I can read about this. Tnx.

UnicodeDecodeError

I can't seem to be able to run this in Mac. is there any requirements not mentioned in setup.py?

๐Ÿบ  python textrank.py summarize ./articles/3.txt
Traceback (most recent call last):
  File "textrank.py", line 219, in <module>
    cli()
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "textrank.py", line 214, in summarize
    summary = extractSentences(text)
  File "textrank.py", line 163, in extractSentences
    sentenceTokens = sent_detector.tokenize(text.strip())
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1280, in _slices_from_text
    if self.text_contains_sentbreak(context):
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1325, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1460, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 577, in _annotate_first_pass
    for aug_tok in tokens:
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9: ordinal not in range(128)

Slow or not working?

I don't know if is slow or is not working. After setting the encoding and doing everything i launch the command in order to summarize it but anything happens. I'm writing this after waiting more than 30 minutes for a summary. May I do anything else like running it with sudo or anything? There is any kind of restriction on the size of the text?
The command i runned is this:

textrank extract_summary ./BlackHotelRomeLazio.txt 

What about the license?

Which library was going to be used, but this address was linked. How to deal with the license is curious and asked.

How to run the code

Hi I would like to know how to run this code for only extracting keywords.
When I run the code, I get the error
Traceback (most recent call last):
File "textrank.py", line 25, in
@click.group()
Please help me.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.