Git Product home page Git Product logo

jhroy / facebook-franco Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 5.55 MB

Full code and most data (in accordance with CrowdTangle’s Terms of Service) supporting an article on what would remain on French-language Facebook if news content was removed

License: GNU General Public License v3.0

Python 57.26% Jupyter Notebook 42.74%
facebook french-speaking francais france belgique suisse canada quebec python nlp

facebook-franco's Introduction

Kittens 😸 and Jesus ✝️  :
What Would Remain in a Newsless Facebook

This repository relates to an article published in the July 2022 issue of First Monday. A pre-print version was published on SSRN in November 2021.

For this article, I first proceded to extract the 300 000 posts which garnered the most attention on pages administered mainly in Belgium, Canada, France and Switzerland for each month of the year 2020. After filtering this 13.4M-post initial sample, as decribed in the article, I kept a final sample of 3.3M posts in French.

One of the steps in the filtering involved determining the language of each post. This was done with the following python script :

The pages in this final sample were then manually classified in two categories (criteria described in the article) : media and non-media. The following four CSV files show how pages were classified in each country, along with the number of posts and sum of interactions from each page (only those posts that were included in my sample) :

Those results are also summarized in the following graph.

CrowdTangle's ToS do not allow the sharing of raw data. However, a summary of interaction types by subcorpora (8 subcorpora in total; one per country and per type [media vs nonmedia]) can be found in the following CSV file :

Step 1 : n-gram extraction

To extract unigrams, bigrams and trigrams from each of the 8 subcorpora, I used this python script :

All n-grams were then cleaned-up (to remove residual punctuation or funky whitespace characters, for example) and uniformized using this script :

The 24 csv files (3 n-gram types * 2 categories * 4 countries) produced by these scripts were between 3.6M and 37.3M lines long. Each lined contained a term and the interaction figure for the post it was found in. To find the frequency of each word and to weigh it by interactions, such as described in the article, a pivot table was performed using pandas. An example of the code used in the case the Belgium corpus CSV files is found in this notebook :

Step 2 : chi-squared (χ2) residuals

I then proceeded to compare media and non-media unigrams, bigrams and trigrams for each country. This was done in a jupyter notebook for each country, producing graphs with plotly express for python. The raw notebooks are too big to be shared directly in github. They were placed on an personal server in HTML format :

This step is, IMHO, the most relevant and revealing of a newsless Facebook.

For example, the most characteristic bigrams of the media and non-media canadian subcorpora really show how different Facebook would be without news.

In the paper, compound graphs were published showing which terms were most characteristic in all four countries for media pages...

... and for non-media pages.

The numbers at the right of the bars show the number of countries (two or more) in which the terms were found.

Step 3 : topic modeling

The last step involved an exploratory topic modeling on all 8 subcorpora using BERTopic using the following script :

I used BERTopic with three different models :

The four runs were performed on each month for all 8 suborpora. Since topic modeling is extremely memory intensive, some months with a very hefty amount of material had to be cut in two (such as in the case of the non-media French subcorpus). Below are examples of the topics given for the month of June for both media and non-media subcorpora by country and by model.

Topics for media subcorpora (June 2020) :

Topics for non-media subcorpora (June 2020) :

I found that asking the models to provide either one or two lemmas per term (unigrams or bigrams) produced richer topics. I also found CamemBERT produced much more coherent, robust and easy to interpret topics with French-language text.

The following figure, in the article, presents a compound of all 384 tables produced by my topic modeling runs containing more than 5,000 topics.

I will gladly answer any question researchers wanting to reproduce these findings or replicate them in another context would have : [email protected]

facebook-franco's People

Contributors

jhroy avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.