Git Product home page Git Product logo

tyjk / echoburst Goto Github PK

View Code? Open in Web Editor NEW
40.0 13.0 4.0 9.49 MB

A browser extension that utilizes sentiment analysis to find and highlight constructive comments on various social media platforms that oppose the users worldview in order to encourage them to break out of the echo chambers the internet has allowed us to construct.

License: MIT License

Python 100.00%
echo-chamber conversation social-media nlp python

echoburst's People

Contributors

annakrystalli avatar jelliotartz avatar tyjk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

echoburst's Issues

Contributing to EchoBurst

How to Contribute Discussion and Questions

The README and CONTRIBUTING pages discuss how to get started contributing, but if you have any questions, comments or concerns regarding how to get started or even just about the project itself, post them here. If we get enough questions or recurring concerns, we'll add them to a FAQ page to the Wiki as well.

Identification of Polarized Blog Posts

Labelling Blog Sites

We need labelled data for various topics and sentiment and we need a lot of it. We have decided on a form of labelling called distant supervision, where we use heuristics and tags in order to classify far more text than we could possibly label manually, with the idea being the cost of potentially mislabelling some data is outweighed by the far greater volume. In order to do this we have targeted opinion blogs for 3 main reasons:

  • They contain far more text than a single social media comment
  • Posts on the same site should largely hold the same sentiment or point of view for a given topic
  • Unlike news articles, they should be very semantically similar to comments

We will need to scrape this data meaning we first need to label potential target sites. To do this we need people to pick a topic, such as global warming, vaccination, religion/atheism or some other polarizing topic. Once that topic is decided one, try to find blogs that have to do more or less exclusively with this topic, and determine the dominant sentiment of official posts on the site (not comments). Check that the sentiment is fairly consistent between posts and authors (if there's more than one).

Once a site or domain is determined to be a good target, enter the url into a text file. The text file should be named in the format: Topic of Blog Posts - Sentiment (eg. Climate Change - Denial, Abortion - Pro Choice, etc). Each file should contain only one leaning for the sake of easily running them through any automated scraper we create. Avoid ambiguously leaning sites (those that post from both sides) or those whose topic varies significantly .

What should be in the file

The first is the domain of the website, which will be used to limit where a crawler can go and which links it can follow. It should not include 'http://' or 'www', but simply the domain name, such as realclimate.org.

The next is the URL pattern for the blog posts. By this I mean the longest consistent URL for all blog pages on that site. For example for realclimate.org, all of the blog posts can be found by year, eg. http://www.realclimate.org/index.php/archives/2017/05/ or http://www.realclimate.org/index.php/archives/2016/03/. Thus, the common URL would be http://www.realclimate.org/index.php/archives/20. This is not itself a valid URL, but all valid URLs MUST contain this sequence. This makes it easy for anyone scraping using Portia or other scrapers to simply enter this sequence into the ReGex section when designing a spider and then setting it loose. Finally, if you want you can add a subjective evaluation of how extreme you believe the site to be in their position, with 1 being centrist and 5 being extremist. A template is available in the URL Dump folder and remember to name your file with the topic and sentiment

The list of possible topics includes but is not limited to:

  • Climate Change - IsReal/Skeptic
  • Abortion - Pro-life/Pro-choice
  • Religion - Believers/Non-believers
  • Vaccines - Pro-vaccination/anti-vaccination
  • Guns - Pro-gun/Anti-gun
  • Drug Policy - Criminalization/Decriminalization and Legalization

We have deliberately stayed away from topics like Politics - Left/Right or Libertarian/Authoritarian for two reasons:

  • These sorts of categories are quite general and tend to encompass many of the above topics
  • Defining what is Left vs what is Right is more subjective and inconsistent person to person.

If you choose to create your own topic, please keep in mind that it should be clear/unambiguous as well as broad. Ie. Yankees vs. Red Sox would not be a good topic as it's very specific. If you have any doubts please comment on this issue with your suggested topic and we'll give you feedback. Also, while any self-directed initiative is encouraged, keep in mind that we'd rather have a bunch of data for just a few topics than sparser data for many topics.

Thank you for your efforts and patiences.

Roadmap

Roadmap

This is an ideal set of steps we would take. What we focus on and when things are completed is subject to change.

March and April

  • Fix up the repo
  • Collect data, particularly social media data.
  • Read up on the latest NLP breakthroughs such as BERT, Transformers, etc.
  • Read up on some of the specific sub-problems such as text summarization and topic classification

May

  • Develop an effective political leaning classifier
  • Research methods of incorporating ML into web extensions, and how they should be structured to ensure they aren't resource intensive for the user
  • Develop an effective topic classifier
  • Create a dead simple testing platform and test the effectiveness of the combined leaning/topic models.

June

  • Develop an event classifier and determine the general feasibility of this segment of the project, as it's subject to external factors
  • Layout a framework for the extension or application, determine server requirements
  • Continue testing real world performance of existing classifiers using local testing platform
  • Build up a prototype extension with the existing models for very basic functionality

July

  • Ideally, soft launch on the MVP, though this is likely not feasible
  • Develop a set of text summarizers using a variety of parameters, data subsets and techniques if necessary
  • Continue developing extension. I'm going to learn to hate web programming all over again this summer
  • Establish the framework for developing the fake news classifier. Owing to the potential politicized subjectivity of what counts as fake news, this is an important step before development for the credibility of the project

August

  • Develop toxicity classifier
  • Continue working on extension
  • Develop fake news classifier

Web Scraping

Web Scraping

This issue is primarily to ensure organization of any web scraping efforts. If you are going to try to scrape a URL, mention which one it is so others don't do the same.

Instructions

Sign up for Portia, a free, visual web scraping tool. Portia lets you set up simple rules for how the spider (aka web crawler) will navigate the site, and then lets you visually mark what content you want to scrape. This pattern will then be utilized on other pages. Multiple patterns can be given to ensure proper scraping across multiple page formats. There ARE likely more efficient and clever methods of scraping, but this is the most feasible I've found that people who don't have any specialized knowledge will be able to use. If you have any of that specialized knowledge, please feel free to speak up and make suggestions.

Tutorial

Tutorial Video
Portia Documentation

Important Note
Make SURE that when you have the text highlighted, it's scraping text and only text. This will mean you won't have to worry about it scraping images or other undesirable content.

Also, if you are able to get all your data with only one sample (you can add to the sample by clicking the little four square icon near the minus sign), do that and name it field1. This provides a standard and makes cleaning easier. If this isn't possible though, no worries.

Running the Scraper

It's hard to tell how long the process will run for. It can take several hours to scrape one site, depending on its size, so keep that in mind when deciding how many sites you'll scrape. Once the scraper is running, it's a good idea to check the log as soon as you can to make sure that, in general, the scraper is doing what you want it to.

Uploading data

One thing that wasn't mentioned in the tutorial (woops) was how to upload. Click on the items number once it's completed, and then go to the Export button in the top right. Select "JSONL" and download the file. Then upload it to the Data folder when finished.

Thank you so much for your contribution!

NLP Models and Data Collection Discussion

A Discussion on the Best NLP and Data Collection Approaches

This is a place we hope we can generate discussion, with both experts and non-experts, on how we're planning on moving forward in the immediate future towards a classification model for topic modeling and sentiment analysis. We've included data collection in this, as none of this can proceed until we have some labelled data.

The scope of this discussion can include:

  • How we are labelling our data
  • How we are collecting/scraping this data
  • Our plans for topic modeling
  • Our plans for sentiment analysis
  • How we will be classifying the resulting models

A Brief Overview of Our Current Plan

  • Labelling: We're going to label our data by selecting blogs and websites (or sections of websites) that have a consistent sentiment and a coherent topic in line with our chosen topics (Full List). These will be collected
  • Scraping: We're thinking of using Portia, Beautiful Soup or possibly Selenium. This aspect is still being discussed and we should have a final plan within the next few days.
  • Topic Modeling: Our current plan is to use the Doc2Vec algorithm (specifically the gensim Python library). Each topic would be used as a tag, in addition a unique tag for each document (blog post/article). However, we're also looking into the use of labelled LDA for this stage.
  • Sentiment Analysis: This stage is pretty firmly decided as using doc2vec, as it's the state of the art for this sort of task. However we have not decided on general sentiment detection (across all topics), topic specific sentiment analysis (a separate sentiment model for each topic) or possibly a hybrid model. We will likely test all of the above and find what works best for our purposes.
  • Classification: Selecting that classification algorithm should be a pretty trivial matter. We suspect an SVM algorithm will perform best, or else Naive Bayes based on our research, but we'll try a broad range.

We welcome questions and suggestions with regards to these topics, so please feel free to drop a comment.

code of conduct

Mind if I use your code of conduct as a template for Pi Reel?

Click here for more info on pi reel. Its still a work in progress.

Working Open - How to get more contributors

Here are just some suggestions:

  • Add a CONTRIBUTING.md file linked in the README.md with clear instructions how people can contribute and contact you
  • Move all gathered resources to the repo Wiki
  • Create a IRC or Gitter.im channel to have an open discussion
  • Move meetings notes from Google docs to an public etherpad
  • Put a link to the etherpad in the README.md
  • Don't forget these: mozillascience/WOW-2017#26
  • Maybe have the Roadmap as an issue instead of a file, so that people can discuss it
  • Put a link to the Roadmap issue in the README.md
  • Use a simple style for issues labels
  • Create a project board to manage issues and track progress with columns such as "To-Do", "Doing", "Done" (more about Kanban boards)

Incentivization Brainstorming

A Discussion on how to Subvert Our Aversion to Dissenting Opinions

A primary problem with the proposed platform as it's conceptualized now is that most people will be extremely unwilling to engage with views they disagree with. Even if we manage to employ toxicity filtering to some extent to make the experience more palatable, it's a deeply ingrained defense mechanism that will be difficult to work around. It would be beneficial then to begin a conversation revolving around how this might be approached.

As a starting point, we were thinking of using positive feedback and reward systems, similar to those employed by many mobile games and social media sites, in order to create a cycle positive feedback. A metric or score is usually a good place to start with this, and our current idea is to have that score be Viewpoint Variance. The idea behind this is that the greater the diversity of news sites you view and comments you read, the higher your score.

There are several technical challenges that would need to be addressed and capabilities that the app would need to have to make this work, but for this discussion we should keep it theoretical to start. This is probably the greatest challenge involved in the project as it's an attempt to subvert human nature, but if we can meet this challenge, it greatly opens up the potential for more widespread impact.

We would particularly love anyone with a background in behavioural psychology, reward systems or belief change to contribute, but this discussion has no prerequisites for posting. If you think you have an interesting idea or novel approach, or believe you can build on what we've already discussed, please comment.

Compiling YouTube video playlists

We're looking to extend data collection from the captions of YouTube videos.

As a start, it would be useful to get playlists of the different topics gathered together. Currently, the most effective approach would be to curate playlists that are consistent on both topic and position ie a separate playlist for climate change vs climate change denying videos.

We are mainly interested in videos in which the caption are NOT autogenerated. However, because further down the line we might look into extracting useful data from autogenerated captions, it would also be useful to compile videos with autogenerated captions separately. So if you do come across them just add them to a separate list (no need to thematically separate that at this point)

We're open to suggestions of what the most effective approach to centralise resulting playlists. Let us know what you think. Otherwise just drop a link to any playlists you create here for the time being.

Topic Classification

Creating an Initial Topic Identification Model

We have created vector models in both Word2Vec and Doc2Vec and so now we are aiming to use these vectors to create features for a classification or topic model that will correctly identify when a topic from a predefined list is being discussed in a comment. We are looking at different possibilities, including custom though imperfect datasets that use subreddit names as labels (generalized into broader topics), or possibly using a classic dataset such as 20newsgroup as a proof of concept.

We will be using the gensim library to create the model and hope to have it completed by the end of the week.

Any expertise or advice on topic modeling would be appreciated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.