Git Product home page Git Product logo

highway_star's Introduction

PyPI version Codacy Badge Open In Colab

highway_star

Scrap biographies from wikipedia categories and plot their life courses

The main goal of this project is to retrieve all biographies from a desired wikipedia category, and to plot the life course of those persons with a sankey diagram. Those data could then be analyzed for social purpose.
This project was made in partnership with the LEIRIS.

Installation


You can install the project via pip, or any other Pypi package manager.

pip install highway-star

Note : you may need some more packages from spacy for Natural Language Processing. This may cause error during your execution.

Please run those commands in your console, or in a python script.

pip install https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-2.0.0/fr_core_news_sm-2.0.0.tar.gz#egg=fr_core_news_sm==2.0.0
 python -m spacy download fr

How to use


Scrapping


The function above allows you to scrap biographies from every page of the categories and subcategories crawled by this one.

from highway_star.scrapping.wikipedia_scraper import scrap_wikipedia_structure_with_content

content = scrap_wikipedia_structure_with_content(
    root_category="Acteur_français",
    lang="fr")

Let's decompose what this function is doing.
Admit that you want all biographies that comes from the wikipedia category Acteurs_français.
wikipedia_category
The algorithm will get every page link in the orange rectangle, and will store information of every subcategory in the red rectangle.
Then, it will repeat this process for every subcategory, until there are no category left.
For example, in the subcategory Acteur_français_de_cinéma of the category Acteurs_français, we still have 1 subcategory, and many new pages to scrap, as shown in the figure just below.
wikipedia_subcategory
Then, when it gets to a page, it will scrap all the content within the tags

<span class="mw-headline" id="Biographie">Biographie</span>

and

</h2>

In order to select only the content that we have for example in the image just below.
biography_example

The result of this function is a python dict.
You will just have to convert this dictionary to a dataframe using pandas :

import pandas as pd
pd.DataFrame.from_dict(content)

To have an output like this
all_scrapped
Note that you have here :

  • page_links : links to the pages
  • pages_names : names of the pages
  • subcategory : category where the page was found
  • content : the content of the biography that has been scrapped

Preprocessing


Once you have retrieved your data, you may need to preprocess them.

In order to do that, we have two functions, one simple, and the other more complex.

Easy but not custom way

from highway_star.preprocessing.biography_preprocessor import sent_to_words
sent_to_words(biographies_column=dataframe_with_biographies["biographies"])


The result of this will be a python list of tokenized biographies.
just add it to your dataframe using

content["biographies_tokenized"] = sent_to_words(biographies_column=dataframe_with_biographies["biographies"])

Complex but custom way

Note : To run this function, make sure to install following packages

pip install https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-2.0.0/fr_core_news_sm-2.0.0.tar.gz#egg=fr_core_news_sm==2.0.0
python -m spacy download fr
from highway_star.preprocessing.biography_preprocessor import remove_stop_words_from_biographies
remove_stop_words_from_biographies(biographies_column=dataframe_with_biographies["biographies"], 
                                   custom_stop_words = ["ajouter", "oui", "être", "avoir"],
                                   use_lemmatization=True,
                                   allowed_postags=['NOUN', 'VERB'])

This function does the tokenization, but also :

  • allows you to choose custom stop words
  • filter biographies with stop words of the package spacy.load('fr_core_news_sm')
  • allows you to use or not lemmatization
  • allows you to filter biographies by parts of speech (e.g., 'NOUN', 'VERB').

Default instantiation of this function is

from highway_star.preprocessing.biography_preprocessor import remove_stop_words_from_biographies
remove_stop_words_from_biographies(biographies_column=dataframe_with_biographies["biographies"])

With non filled parameters set to default :

  • custom_stop_words = None
  • use_lemmatization = False
  • allowed_postags = None

Visualizing


The visualization is done using Sankey Diagram, and the algorithm prefixspan

Prefixspan

Prefixspan is an algorithm of Data Mining that retrieve the most frequent patterns in a set of data.
It was developed in 2001 by Pei, Han et. al, in Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth.
It can be implemented in python using the Pypi library prefixspan.
Considering that a set of data is a set of biographies, it will retrieve most frequent patterns in our biographies.
We can manage the length of the patterns it will search.
The more your length pattern is high, the more you have a chance that those pattern globe all biographies from the start to the end, but you may not have many patterns.

Sankey Diagram

Those are great data visualization tools to plot relational data.
sankey
An implementation could be found in javascript using Highcharts.

from highway_star.visualizing.visualizer import give_sankey_data_from_prefixspan
give_sankey_data_from_prefixspan(dataframe_with_biographies["content_tokenized"],
                                 prefixspan_minlen=15,
                                 prefixspan_topk=100)

This implementation will find the top 100 patterns of size 15.
The basic implementation of this function is :

from highway_star.visualizing.visualizer import give_sankey_data_from_prefixspan
give_sankey_data_from_prefixspan(dataframe_with_biographies["content_tokenized"])

with :

  • prefixspan_minlen = 10
  • prefixspan_topk = 50

The output of this function is already preprocessed prefixspan output for the sankey diagram.
It will count the number of relation couples of items have.
e.g. in :

born Alabama write song buy house
born Alabama buy house
born Europe write song buy house
  • born - Alabama = 2
  • buy - house = 3
  • write - song = 2

Note that :

  • Alabama - house


is not a valid item, because the two items are not next to each others.

Then, execute :

from highway_star.visualizing.visualizer import sankey_diagram_with_prefixspan_output
sankey_diagram_with_prefixspan_output(sankey_data_from_prefixspan=sankey_data_from_prefixspan, 
                                      js_filename="women", 
                                      html_filename="women",
                                    title="Life course of Women French Actress")

Where :

  • sankey_data_from_prefixspan : the output of the previous function give_sankey_data_from_prefixspan
  • js_filename : name of the js file
  • html_filename : name of the html file
  • title : title of the chart

Default implementation is :

from highway_star.visualizing.visualizer import sankey_diagram_with_prefixspan_output
sankey_diagram_with_prefixspan_output(sankey_data_from_prefixspan=sankey_data_from_prefixspan)

Where :

  • js_filename = data
  • html_filename = page
  • title = None

This will save locally two files. A html, and a Javascript.
Data of the function give_sankey_data_from_prefixspan is stocked into the Javascript file.
You will just have to open the html file to discover your plot. perso_sankey

highway_star's People

Contributors

matheo-daly avatar matheodaly avatar

Stargazers

Toan Tran avatar Estelle avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.