NLP Models and Data Collection Discussion

A Discussion on the Best NLP and Data Collection Approaches

This is a place we hope we can generate discussion, with both experts and non-experts, on how we're planning on moving forward in the immediate future towards a classification model for topic modeling and sentiment analysis. We've included data collection in this, as none of this can proceed until we have some labelled data.

The scope of this discussion can include:

How we are labelling our data
How we are collecting/scraping this data
Our plans for topic modeling
Our plans for sentiment analysis
How we will be classifying the resulting models

A Brief Overview of Our Current Plan

Labelling: We're going to label our data by selecting blogs and websites (or sections of websites) that have a consistent sentiment and a coherent topic in line with our chosen topics (Full List). These will be collected
Scraping: We're thinking of using Portia, Beautiful Soup or possibly Selenium. This aspect is still being discussed and we should have a final plan within the next few days.
Topic Modeling: Our current plan is to use the Doc2Vec algorithm (specifically the gensim Python library). Each topic would be used as a tag, in addition a unique tag for each document (blog post/article). However, we're also looking into the use of labelled LDA for this stage.
Sentiment Analysis: This stage is pretty firmly decided as using doc2vec, as it's the state of the art for this sort of task. However we have not decided on general sentiment detection (across all topics), topic specific sentiment analysis (a separate sentiment model for each topic) or possibly a hybrid model. We will likely test all of the above and find what works best for our purposes.
Classification: Selecting that classification algorithm should be a pretty trivial matter. We suspect an SVM algorithm will perform best, or else Naive Bayes based on our research, but we'll try a broad range.

We welcome questions and suggestions with regards to these topics, so please feel free to drop a comment.

Incentivization Brainstorming

A Discussion on how to Subvert Our Aversion to Dissenting Opinions

A primary problem with the proposed platform as it's conceptualized now is that most people will be extremely unwilling to engage with views they disagree with. Even if we manage to employ toxicity filtering to some extent to make the experience more palatable, it's a deeply ingrained defense mechanism that will be difficult to work around. It would be beneficial then to begin a conversation revolving around how this might be approached.

As a starting point, we were thinking of using positive feedback and reward systems, similar to those employed by many mobile games and social media sites, in order to create a cycle positive feedback. A metric or score is usually a good place to start with this, and our current idea is to have that score be Viewpoint Variance. The idea behind this is that the greater the diversity of news sites you view and comments you read, the higher your score.

There are several technical challenges that would need to be addressed and capabilities that the app would need to have to make this work, but for this discussion we should keep it theoretical to start. This is probably the greatest challenge involved in the project as it's an attempt to subvert human nature, but if we can meet this challenge, it greatly opens up the potential for more widespread impact.

We would particularly love anyone with a background in behavioural psychology, reward systems or belief change to contribute, but this discussion has no prerequisites for posting. If you think you have an interesting idea or novel approach, or believe you can build on what we've already discussed, please comment.

Working Open - How to get more contributors

Here are just some suggestions:

Web Scraping

This issue is primarily to ensure organization of any web scraping efforts. If you are going to try to scrape a URL, mention which one it is so others don't do the same.

Instructions

Sign up for Portia, a free, visual web scraping tool. Portia lets you set up simple rules for how the spider (aka web crawler) will navigate the site, and then lets you visually mark what content you want to scrape. This pattern will then be utilized on other pages. Multiple patterns can be given to ensure proper scraping across multiple page formats. There ARE likely more efficient and clever methods of scraping, but this is the most feasible I've found that people who don't have any specialized knowledge will be able to use. If you have any of that specialized knowledge, please feel free to speak up and make suggestions.

Tutorial

Tutorial Video
Portia Documentation

Important Note
Make SURE that when you have the text highlighted, it's scraping text and only text. This will mean you won't have to worry about it scraping images or other undesirable content.

Also, if you are able to get all your data with only one sample (you can add to the sample by clicking the little four square icon near the minus sign), do that and name it field1. This provides a standard and makes cleaning easier. If this isn't possible though, no worries.

Running the Scraper

It's hard to tell how long the process will run for. It can take several hours to scrape one site, depending on its size, so keep that in mind when deciding how many sites you'll scrape. Once the scraper is running, it's a good idea to check the log as soon as you can to make sure that, in general, the scraper is doing what you want it to.

Uploading data

One thing that wasn't mentioned in the tutorial (woops) was how to upload. Click on the items number once it's completed, and then go to the Export button in the top right. Select "JSONL" and download the file. Then upload it to the Data folder when finished.

Thank you so much for your contribution!

Contributing to EchoBurst

How to Contribute Discussion and Questions

The README and CONTRIBUTING pages discuss how to get started contributing, but if you have any questions, comments or concerns regarding how to get started or even just about the project itself, post them here. If we get enough questions or recurring concerns, we'll add them to a FAQ page to the Wiki as well.

Topic Classification

Creating an Initial Topic Identification Model

We have created vector models in both Word2Vec and Doc2Vec and so now we are aiming to use these vectors to create features for a classification or topic model that will correctly identify when a topic from a predefined list is being discussed in a comment. We are looking at different possibilities, including custom though imperfect datasets that use subreddit names as labels (generalized into broader topics), or possibly using a classic dataset such as 20newsgroup as a proof of concept.

We will be using the gensim library to create the model and hope to have it completed by the end of the week.

Any expertise or advice on topic modeling would be appreciated.

Identification of Polarized Blog Posts

Labelling Blog Sites

We need labelled data for various topics and sentiment and we need a lot of it. We have decided on a form of labelling called distant supervision, where we use heuristics and tags in order to classify far more text than we could possibly label manually, with the idea being the cost of potentially mislabelling some data is outweighed by the far greater volume. In order to do this we have targeted opinion blogs for 3 main reasons:

They contain far more text than a single social media comment
Posts on the same site should largely hold the same sentiment or point of view for a given topic
Unlike news articles, they should be very semantically similar to comments

We will need to scrape this data meaning we first need to label potential target sites. To do this we need people to pick a topic, such as global warming, vaccination, religion/atheism or some other polarizing topic. Once that topic is decided one, try to find blogs that have to do more or less exclusively with this topic, and determine the dominant sentiment of official posts on the site (not comments). Check that the sentiment is fairly consistent between posts and authors (if there's more than one).

Once a site or domain is determined to be a good target, enter the url into a text file. The text file should be named in the format: Topic of Blog Posts - Sentiment (eg. Climate Change - Denial, Abortion - Pro Choice, etc). Each file should contain only one leaning for the sake of easily running them through any automated scraper we create. Avoid ambiguously leaning sites (those that post from both sides) or those whose topic varies significantly .

What should be in the file

The first is the domain of the website, which will be used to limit where a crawler can go and which links it can follow. It should not include 'http://' or 'www', but simply the domain name, such as realclimate.org.

The next is the URL pattern for the blog posts. By this I mean the longest consistent URL for all blog pages on that site. For example for realclimate.org, all of the blog posts can be found by year, eg. http://www.realclimate.org/index.php/archives/2017/05/ or http://www.realclimate.org/index.php/archives/2016/03/. Thus, the common URL would be http://www.realclimate.org/index.php/archives/20. This is not itself a valid URL, but all valid URLs MUST contain this sequence. This makes it easy for anyone scraping using Portia or other scrapers to simply enter this sequence into the ReGex section when designing a spider and then setting it loose. Finally, if you want you can add a subjective evaluation of how extreme you believe the site to be in their position, with 1 being centrist and 5 being extremist. A template is available in the URL Dump folder and remember to name your file with the topic and sentiment

The list of possible topics includes but is not limited to:

Climate Change - IsReal/Skeptic
Abortion - Pro-life/Pro-choice
Religion - Believers/Non-believers
Vaccines - Pro-vaccination/anti-vaccination
Guns - Pro-gun/Anti-gun
Drug Policy - Criminalization/Decriminalization and Legalization

We have deliberately stayed away from topics like Politics - Left/Right or Libertarian/Authoritarian for two reasons:

These sorts of categories are quite general and tend to encompass many of the above topics
Defining what is Left vs what is Right is more subjective and inconsistent person to person.

If you choose to create your own topic, please keep in mind that it should be clear/unambiguous as well as broad. Ie. Yankees vs. Red Sox would not be a good topic as it's very specific. If you have any doubts please comment on this issue with your suggested topic and we'll give you feedback. Also, while any self-directed initiative is encouraged, keep in mind that we'd rather have a bunch of data for just a few topics than sparser data for many topics.

Thank you for your efforts and patiences.

Compiling YouTube video playlists

We're looking to extend data collection from the captions of YouTube videos.

As a start, it would be useful to get playlists of the different topics gathered together. Currently, the most effective approach would be to curate playlists that are consistent on both topic and position ie a separate playlist for climate change vs climate change denying videos.

We are mainly interested in videos in which the caption are NOT autogenerated. However, because further down the line we might look into extracting useful data from autogenerated captions, it would also be useful to compile videos with autogenerated captions separately. So if you do come across them just add them to a separate list (no need to thematically separate that at this point)

We're open to suggestions of what the most effective approach to centralise resulting playlists. Let us know what you think. Otherwise just drop a link to any playlists you create here for the time being.

Roadmap

This is an ideal set of steps we would take. What we focus on and when things are completed is subject to change.

March and April

Fix up the repo
Collect data, particularly social media data.
Read up on the latest NLP breakthroughs such as BERT, Transformers, etc.
Read up on some of the specific sub-problems such as text summarization and topic classification

May

Develop an effective political leaning classifier
Research methods of incorporating ML into web extensions, and how they should be structured to ensure they aren't resource intensive for the user
Develop an effective topic classifier
Create a dead simple testing platform and test the effectiveness of the combined leaning/topic models.

June

Develop an event classifier and determine the general feasibility of this segment of the project, as it's subject to external factors
Layout a framework for the extension or application, determine server requirements
Continue testing real world performance of existing classifiers using local testing platform
Build up a prototype extension with the existing models for very basic functionality

July

Ideally, soft launch on the MVP, though this is likely not feasible
Develop a set of text summarizers using a variety of parameters, data subsets and techniques if necessary
Continue developing extension. I'm going to learn to hate web programming all over again this summer
Establish the framework for developing the fake news classifier. Owing to the potential politicized subjectivity of what counts as fake news, this is an important step before development for the credibility of the project

August

Develop toxicity classifier
Continue working on extension
Develop fake news classifier

code of conduct

Mind if I use your code of conduct as a template for Pi Reel?

Click here for more info on pi reel. Its still a work in progress.

tyjk / echoburst Goto Github PK

echoburst's Introduction

EchoBurst

Table of Contents

Welcome

The Revival

The Problem

Our Solution

Why It Matters