Git Product home page Git Product logo

loretoparisi / keyphraseextraction Goto Github PK

View Code? Open in Web Editor NEW
4.0 5.0 0.0 551 KB

Key phrases extraction and ranking in Python with support for POS (Part of Speech) and NER (Named Entities).

License: MIT License

Jupyter Notebook 5.80% Dockerfile 0.03% Python 3.90% CSS 1.77% JavaScript 88.25% HTML 0.25%
python python3 keyphrases keyphrase-extraction keyphrase-extraction-algorithm keyphrase-extractor extract-candidate score-keyphrases textrank tfidf

keyphraseextraction's Introduction

keyphraseextraction

Key phrases extraction and ranking in Python with support for POS (Part of Speech) and NER (Named Entities).

Updates

Added a Dockerfile and a simple module to run provided apis. 🆕

Disclaimer

The source code for class KeyphrasesRanker has beed adpted and ported to Python3 from the original Python2 source code available here:

Intro to Automatic Keyphrase Extraction

The source code for classes EntitiesRanker, TextRank4Keyword has beed adapted from the original source code available here:

Understand TextRank for Keyword Extraction by Python

and here: News Graph

Dependencies

To install all dependencies please run pip install -r requirements.txt or use the Dockerfile install.

numpy
nltk
gensim
spacy
networkx

How to build and run Docker

We build the provided Dockerimage and run the code into the container to be editable on the host.

docker build -f Dockerfile -t keyphrase_extraction_python .
docker run -v $(pwd):/app --rm -it keyphrase_extraction_python bash

How to prepare the document to analyse

First, you need to retrie your document or paper Title, Abstract and Text. To convert your paper to text use a pdf converter like PDFElement. To copy the text into a string use this tool. We use this pre-print as example, EXPLOITING SYNCHRONIZED LYRICS AND VOCAL FEATURES FOR MUSIC EMOTION DETECTION

How to extract and rank key phrases

The class KeyphrasesRanker provides the following apis:

  • extract_candidate_chunks
  • extract_candidate_words
  • score_keyphrases_by_tfidf
  • score_keyphrases_by_textrank
  • extract_candidate_features
title = 'EXPLOITING SYNCHRONIZED LYRICS AND VOCAL FEATURES FOR MUSIC EMOTION DETECTION'
abstract = "One of the key points in music recommendation is authoring engaging playlists according to sentiment and emotions. While previous works were mostly based on audio for music discovery and playlists generation, we take advantage of our synchronized lyrics dataset to combine text representations and music features in a novel way; we therefore introduce the Synchronized Lyrics Emotion Dataset. Unlike other approaches that randomly exploited the audio samples and the whole text, our data is split according to the temporal information provided by the synchronization between lyrics and audio. This work shows a comparison between text-based and audio-based deep learning classification models using different techniques from Natural Language Processing and Music Information Retrieval domains. From the experiments on audio we conclude that using vocals only, instead of the whole audio data improves the overall performances of the audio classifier. In the lyrics experiments we exploit the state-ofthe-art word representations applied to the main Deep Learning architectures available in literature. In our benchmarks the results show how the Bilinear LSTM classifier with Attention based on fastText word embedding performs better than the CNN applied on audio. "
text = "EXPLOITING SYNCHRONIZED LYRICS AND VOCAL FEATURES FOR \nMUSIC EMOTION DETECTION \n\nABSTRACT \nOne of the key points in music recommendation is authoring engaging playlists according to sentiment and emotions. While previous works were mostly based on audio for music discovery and playlists generation, we take advantage of our synchronized lyrics dataset to combine text representations and music features in a novel way; we therefore introduce the Synchronized Lyrics Emotion Dataset. Unlike other approaches that randomly exploited the audio samples and the whole text, our data is split according to the temporal information provided by the synchronization between lyrics and audio. This work shows a comparison between text-based and audio-based deep learning classification models using different techniques from Natural Language Processing and Music Information Retrieval domains. From the experiments on audio we conclude that using vocals only, instead of the whole audio data improves the overall performances of the audio classifier. In the lyrics experiments we exploit the state-ofthe-art word representations applied to the main Deep Learning architectures available in literature. In our benchmarks the results show how the Bilinear LSTM classifier with Attention based on fastText word embedding performs better than the CNN applied on audio. \n\n1. INTRODUCTION \nMusic Emotion Recognition (MER) refers to the task of finding a relationship between music and human emotions [24,43]. Nowadays, this type of analysis is becoming more and more popular, music streaming providers are finding very helpful to present users with musical collections organized according to their feelings. The problem of Music Emotion Recognition was proposed for the first time in the Music Information Retrieval (MIR) community in 2007, during the annual Music Information Research Evaluation eXchange (MIREX) [14]"
  • extract_candidate_chunks
ranker = KeyphrasesRanker()
set(ranker.extract_candidate_chunks(text))
{'100-dimensional word2vec',
 '] achieves relevant results',
 '] exploit contextual information',
 '] speech',
 '] techniques',
 'account',
 'acculturation',
 'acoustical properties',
 'adjectives',
 'advantage',
...
  • extract_candidate_words
ranker = KeyphrasesRanker()
set(ranker.extract_candidate_words(text))
{'100-dimensional',
 'abstract',
 'account',
 'acculturation',
 'achieves',
 'acoustic',
 'acoustical',
 'adjective',
 'adjectives',
 'adjectives/tags/labels',
 'advantage',
 'aforementioned',
 'agents',
 'aim',
 'allocation',
 ...
  • score_keyphrases_by_tfidf
ranker = KeyphrasesRanker()
texts = [title, abstract, text]
corpus, corpus_tfidf, dictionary = ranker.score_keyphrases_by_tfidf(texts, candidates='chunks')
d = {dictionary.get(id): value for doc in corpus_tfidf for id, value in doc}
print(json.dumps(d,indent=4))
{
   "exploiting synchronized lyrics and vocal features for music emotion detection": 1.0,
   "advantage": 0.02818991144288086,
   "audio": 0.12685460149296385,
   "audio classi\ufb01er": 0.01409495572144043,
   "audio for music discovery": 0.01409495572144043,
   "audio samples": 0.01409495572144043,
   "audio-based deep learning classi\ufb01cation models": 0.01409495572144043,
   "benchmarks": 0.01409495572144043,
   "bilinear lstm classi\ufb01er with attention": 0.01409495572144043,
   "cnn": 0.02818991144288086,
   "comparison": 0.01409495572144043,
   "data": 0.01409495572144043,
   "different techniques from natural language processing": 0.01409495572144043,
   "emotions": 0.098664690050083,
   "experiments on audio": 0.01409495572144043,
   "fasttext word": 0.01409495572144043,
...
  • score_keyphrases_by_textrank
ranker = KeyphrasesRanker()
ranker.score_keyphrases_by_textrank(text)
 [('audio', 0.020772356010959545),
 ('audio features', 0.015940980040515616),
 ('music', 0.015056242947485416),
 ('lyrics', 0.014441885556177585),
 ('music emotion', 0.014074422805978502),
 ('emotion', 0.013092602664471589),
 ('music features', 0.01308292350877855),
 ('audio data', 0.012708279325535445),
 ('music information', 0.01175624019437015),
 ('emotions', 0.011338641233589263),
 ('features', 0.011109604070071685),
 ('music emotion classi fication', 0.010623139495342838),
 ('musical lyrics', 0.010621753673476389),
 ...
  • extract_candidate_features
ranker = KeyphrasesRanker()
candidates = ranker.extract_candidate_words(text)
candidates = candidates[0:5]
candidate_features = ranker.extract_candidate_features(candidates, text, abstract, title)
candidate_features = [{k[0]:k[1]} for k in sorted(ranker.candidate_features.items(), key=lambda item: item[1]['term_count'], reverse=True)]
print(json.dumps(ranker.candidate_features,indent=4))
[
    {
        "lyrics": {
            "term_count": 26,
            "term_length": 1,
            "max_word_length": 6,
            "spread": 0.9258158185840708,
            "lexical_cohesion": 0.0,
            "in_excerpt": 1,
            "in_title": 1,
            "abs_first_occurrence": 0.00165929203539823,
            "abs_last_occurrence": 0.9274751106194691
        }
    },
    {
        "features": {
            "term_count": 12,
            "term_length": 1,
            "max_word_length": 8,
            "spread": 0.9496681415929203,
            "lexical_cohesion": 0.0,
            "in_excerpt": 1,
            "in_title": 1,
            "abs_first_occurrence": 0.00283462389380531,
            "abs_last_occurrence": 0.9525027654867256
        }
    },
    {
        "synchronized": {
            "term_count": 11,
            "term_length": 1,
            "max_word_length": 12,
            "spread": 0.8889657079646018,
            "lexical_cohesion": 0.0,
            "in_excerpt": 1,
            "in_title": 1,
            "abs_first_occurrence": 0.0007605088495575221,
            "abs_last_occurrence": 0.8897262168141593
        }
    },
...

How to extract and rank keywords with POS (Part Of Speech)

Run textrank.py example.

python src/textrank.py 
lyrics - 6.243945103861534
audio - 5.414579362418893
text - 5.254783060976986
emotion - 4.95182499101852
music - 4.702983828505006
emotions - 4.5025596655266105
classification - 4.468243854418144
features - 4.433893761854115
song - 3.850250680927715
mood - 3.272875241280236
representations - 3.119033774085046
word - 3.0808253588512065
None

or use the class TextRank4Keyword:

textRank = TextRank4Keyword()
textRank.analyze(text, candidate_pos = ['NOUN', 'PROPN'], window_size=4, lower=False)
print(textRank.get_keywords(10))

How to extract and rank Entities

Run entities.py example.

# python src/entities.py 
[['United', 'keyword'], ['States', 'keyword'], ['States', 'is'], ['States', 'is place'], ['States', 'been'], ['States', 'become one'], ['United', 'frequency'], ['States', 'frequency'], ['the United States', 'Location'], ['America', 'Location'], ['George Floyd', 'Person'], ['the National Action Network', 'Organization'], ['Al Sharpton', 'Person'], ['linda thomas Greenfield', 'Person'], ['Nikki Haley', 'Person'], ['the U. N.', 'Location'], ['thomas Greenfield', 'Person'], ["Al Sharpton's", 'Person'], ['Greenfield', 'Location'], ['Klan', 'Organization'], ['Frederick douglass', 'Person'], ['Iran', 'Location'], ['Brianna', 'Person'], ['derek chauvin', 'Person'], ['Brianna Taylor', 'Person'], ['the National Action Network', "Al Sharpton's"], ['the National Action Network', 'thomas Greenfield'], ["Al Sharpton's", 'the National Action Network'], ["Al Sharpton's", 'thomas Greenfield'], ['thomas Greenfield', 'the National Action Network'], ['thomas Greenfield', "Al Sharpton's"], ['the National Action Network', 'the United States'], ['the United States', 'the National Action Network'], ['the United States', 'Iran'], ['the United States', 'America'], ['Iran', 'the United States'], ['Iran', 'America'], ['America', 'the United States'], ['America', 'Iran'], ['Brianna', 'George Floyd'], ['George Floyd', 'Brianna'], ['derek chauvin', 'George Floyd'], ['George Floyd', 'derek chauvin']]

or use the class EntitiesRanker:

ranker = EntitiesRanker()
ranks = ranker.main(text)
print(ranks)

Contributors

People who contributed with their libraries or code to this project.

keyphraseextraction's People

Contributors

loretoparisi avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.