davidadamojr / textrank Goto Github PK

Python implementation of TextRank algorithm for automatic keyword extraction and summarization using Levenshtein distance as relation between text units. This project is based on the paper "TextRank: Bringing Order into Text" by Rada Mihalcea and Paul Tarau. https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf

Python 100.00%

textrank's Introduction

TextRank

This is a python implementation of TextRank for automatic keyword and sentence extraction (summarization) as done in https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf. However, this implementation uses Levenshtein Distance as the relation between text units.

This implementation carries out automatic keyword and sentence extraction on 10 articles gotten from http://theonion.com

100 word summary
Number of keywords extracted is relative to the size of the text (a third of the number of nodes in the graph)
Adjacent keywords in the text are concatenated into keyphrases

Usage

To install the library run the setup.py module located in the repository's root directory. Alternatively, if you have access to pip you may install the library directly from github:

pip install git+https://github.com/davidadamojr/TextRank.git

Use of the library requires downloading nltk resources. Use the textrank initialize command to fetch the required data. Once the data has finished downloading you may execute the following commands against the library:

textrank extract_summary <filename>
textrank extract_phrases <filename>

Contributing

Install the library as "editable" within a virtual environment.

pip install -e .

Dependencies

Dependencies are installed automatically with pip but can be installed serparately.

Networkx - https://pypi.python.org/pypi/networkx/
NLTK 3.0 - https://pypi.python.org/pypi/nltk/3.2.2
Numpy - https://pypi.python.org/pypi/numpy
Click - https://pypi.python.org/pypi/click

textrank's People

Contributors

Stargazers

Watchers

Forkers

gucasbrg achalghoum mohsenvand panyang thebennos mishless bnn2010 vsingh58 wangxiangru atassumer datascience102 likaiguo parth126 motasay mudit2013 fangyw hlavasim bolajav zhuxf0407 kirbykong snadell1 ericschles vinayak-mehta sharonanne qiucode tasha83 isnowalarm limbocode rachelludmer chsasank-iref sanchitaggarwal limeng05 stevenlol eduos lixiangnlp latuji chrisemoulton tokey66363 npow changediff alhy3410 laisun silasxue caomw ilyesdata wuafeing tropicalgeek iwhisper texpress jrn102020 ab93 anukat2015 xiangyuwei onepau liormagen c00h00g qingniufly nikhilchandran lpalova anhmike zbxzc35 myechona yindafei dongzhixiang bahuafeng soluxos harrypotterismyname cheungmankwan roygao94 codesurgeonx leezhihui cptfoobar cfwin ithinkmfallin luang008 fpereiramosqueira nazifberat shlomis vyraun aabercrombie0492 xinghudamowang guojiangwei2 intery89 wannawaiting mikeboris andrewlesson mingk24 hamidfalah tuan1101 noelinjm licaoyuan123 javelir deymm cmanallen amjha chrismychen qsong4 robertpd gopalshah1996 careercoder

textrank's Issues

Order of importance changes in set()

If set() is used in extract_key_phrases() function, then the order of importance changes while manipulating the set.

Result of running on same article 2 times:
{'account', 'personal information', 'surprise', 'political', 'scandal-plagued', 'privacy', 'network', 'possible', 'access', 'company', 'Instagram', 'survey', 'Spotify', 'Tinder', 'platform', 'turkey', 'security breach', 'Cambridge Analytica', 'database', 'Facebook', 'security scandal', 'innocuous', 'Research', 'percent', 'Friday'}

{'security breach', 'Instagram', 'percent', 'privacy', 'Research', 'network', 'political', 'Spotify', 'scandal-plagued', 'account', 'possible', 'Tinder', 'Facebook', 'platform', 'personal information', 'database', 'company', 'access', 'Friday', 'surprise', 'security scandal', 'survey', 'turkey', 'innocuous', 'Cambridge Analytica'}

problem with textrank.py

when i try to build the setup.py file , it is unable to find the textrank.py file..can u please help?
error:
running build
running build_py
file textrank.py (for module textrank) not found
file textrank.py (for module textrank) not found

when runing the init.py i get this message"Process finished with exit code 0"

Installation Issue

Installation Aborted with following text:

Cleaning up...
Command /usr/bin/python -c "import setuptools, tokenize;__file__='/tmp/pip_build_root/numpy/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-HjUAJe-record/install-record.txt --single-version-externally-managed --compile failed with error code 1 in /tmp/pip_build_root/numpy

Traceback (most recent call last):
  File "/usr/bin/pip", line 9, in <module>
    load_entry_point('pip==1.5.4', 'console_scripts', 'pip')()
  File "/usr/lib/python2.7/dist-packages/pip/__init__.py", line 235, in main
    return command.main(cmd_args)
  File "/usr/lib/python2.7/dist-packages/pip/basecommand.py", line 161, in main
    text = '\n'.join(complete_log)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 72: ordinal not in range(128)

Why levenshtein distance?

Textrank algorithm requires each node from graph to be connected with it neighbours in text. This makes it works with pagerank algo. But in your implementation you connect each node with every other node and weight edge with levenshtein distance (aka words textual similarity).

In tests that I've made keywords from result wasn't actually keywords, but just random words that happens to be visually similar to the mean of other words in text (because of lev-dist).

Maybe I don't understand something? Can you explain usage of levenshtein distance? Or give some links where I can read about this. Tnx.

UnicodeDecodeError

I can't seem to be able to run this in Mac. is there any requirements not mentioned in setup.py?

🍺  python textrank.py summarize ./articles/3.txt
Traceback (most recent call last):
  File "textrank.py", line 219, in <module>
    cli()
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "textrank.py", line 214, in summarize
    summary = extractSentences(text)
  File "textrank.py", line 163, in extractSentences
    sentenceTokens = sent_detector.tokenize(text.strip())
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1280, in _slices_from_text
    if self.text_contains_sentbreak(context):
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1325, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1460, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 577, in _annotate_first_pass
    for aug_tok in tokens:
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9: ordinal not in range(128)

Publish to PyPi

Could you publish this to the PyPi registry?

Slow or not working?

I don't know if is slow or is not working. After setting the encoding and doing everything i launch the command in order to summarize it but anything happens. I'm writing this after waiting more than 30 minutes for a summary. May I do anything else like running it with sudo or anything? There is any kind of restriction on the size of the text?
The command i runned is this:

textrank extract_summary ./BlackHotelRomeLazio.txt

What about the license?

Which library was going to be used, but this address was linked. How to deal with the license is curious and asked.

Any workaround for finding the texts including symbols such as '(' and ')' ?

The algorithm breaks when it enounters these symbols. Here is the error it displays:
bash: syntax error near unexpected token '('

Update readme for install instructions.

Add

python setup.py build
python setup install

How to run the code

Hi I would like to know how to run this code for only extracting keywords.
When I run the code, I get the error
Traceback (most recent call last):
File "textrank.py", line 25, in
@click.group()
Please help me.