Git Product home page Git Product logo

jannis-baum / weblock Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 4.91 MB

An ad-blocker that secretly censors content at the deployer's disposal. HPI based project aiming to raise awareness about possible adverse effects of AI.

License: MIT License

Python 80.46% JavaScript 10.90% Shell 7.73% HTML 0.91%
btm nlp research text-manipulation text-matching topic-modeling short-text-semantic-similarity

weblock's Introduction

weBlock

weBlock claims to be an ad-blocker that runs on a server to save client-side processing power, but could secretly censor data at the deployer's disposal.

This is a collaborative project with lucasliebe and tiny-fish-T and is being developed within the scope of a course named DarkAI at HPI. The course's goal is to raise awareness about possible harms and threats of artificial intelligence and to provoke more critical thinking in handling software products.

Installation

This project consists of a client and server, which can be installed and run separately. Both actors require a working installation of Mozilla Firefox and Zsh or Bash. In addition, the client requires Node.js and Yarn and the server requires Python 3 together with python3-venv, Geckodriver (which is installed by downloading the operating system's respective executable and moving it into the $PATH), Make and GCC.

With the above requirements met, the install script can be run in bash or zsh, e.g../install.sh to install both client as well as server or with an argument client or server to install only the respective actor.

Usage

Client

To run the client, go the the client/ directory and run yarn start. This will open an instance of Firefox with the ad-blocking / censoring extension loaded. The toolbar will show weBlock's icon, where the address of your server that has weBlock's server side deployed can be set. By default, this is assumed to be localhost.

Browsing with the extension loaded will behave normally, but you will notice a circle icon show up on the right side of the URL field for tabs with supported webpages (http(s), html).Clicking the icon once will put the ad-blocker to work doing it's best to remove advertisements and give you a preview of what content will be censored by coloring it red. Clicking it a second time will engage censoring and replace the red text with content the censorer (server) deems as friendly but still contextually relevant.

Server

Configure your censorship

After installation you should change the variables in server/.env to match your desired censoring configuration. By default, you will find CHANGEME values so you can customize it to your desired content.

NEGATIVE_QUERIES should hold a comma seperated list of Google News search queries of articles with negative opinions about your topic. For instance, an option to promote opinions of the flat earth society could be round earth site:news.com, earth globe.

Similarly, POSITIVE_QUERIES should also hold a list of Google News queries but about articles that support your view, e.g. site:flatearthsociety.com when:7d.

More details about queries can be found inside the detailed guide below.

To make your censoring more precise, it is also recommended to add a list of CENSOR_REQUIREMENTS. These words (or synonyms of them) will be required to be included in a paragraph for it to be censored like earth, planet, sphere.

Quick start with example

Run source server/activate, then server/scrape-postive -t && server/scrape-negative && server/run-backend and use the client as described above when scripts are ready (i.e. the prompt

setup done, waiting for connection

has shown up).

Detailed guide

Before using any of the server's functionality, source the activate file in server/ in your bash or zsh shell. This will load the virtual environment the server lives in and greet you with the (weBlock-server) message in your shell's prompt.

weBlock's server side is managed by three executable scripts in the server/ directory, namely scrape-positive, scrape-negative and run-backend.

Data collection & building models: scraping & training

For the collection of data, weBlock relies on Google News to scrape recently published articles. Those articles in turn are then scraped for information used to train its natural language processing models and build a database of paragraphs used to replace censored content.

scrape-negative is used to collect examples for what is undesired by the censorer. It searches Google News with the comma-separated queries defined in the environment variable NEGATIVE_QUERIES in server/.env.
scrape-negative also clusters the scraped summaries with an implementation of random search. Random search requires a number of iterations, which can be set with -i and a sample size, set with -s, i.e. the number of clusters that will be in the resulting database. Both arguments have default values when omitted. Default for -i is 50 and default for -s is 5. The scraped articles are then used as negative examples in censoring, where the Word Mover's Distance of a paragraph to the scraped article's summaries plays a role in determining wether that paragraph should be censored.

scrape-positive is used to collect examples for what is desired by the censorer. It, analogously to scrape-negative, searches Google News with the comma-separated queries defined in the environment variable POSITIVE_QUERIES in server/.env.
If scrape-positive is run with the argument -t or --train, the resulting articles from this scraping are used to train a Biterm Topic Model with the parameters defined by the environment variables TRAINING_* as given in server/.env. Leave these parameters unchanged for fast but far-from-optimal results. Training is necessary on the first run, but can later be skipped to reuse the existing BT Model.
The Biterm Topic Model makes the key decision in finding which of the scraped positive, desired examples in the database will be used to replace a paragraph that is marked for censorship.

Both scrape-positive as well as scrape-negative have an optional argument -n or --narticles that can be used to define an upper limit for how many arguments are scraped per query. This argument defaults to 10 if omitted.

Note that Google search operators such as the site: or when: modifiers can strongly refine and empower defined search queries (e.g. when:7d constrains results to articles published in the past week). See this incomplete list of operators.

Running the server

With data collection and model training done, the server now has sufficient data to act as weBlock's backend. To run the backend, execute server/run-backend. Once it's ready, use the client as described above.

Scalability

Since this is a proof-of-concept prototype and focussed on the natural language processing side of the project, some features that would be significant for scalability and real-world use have been left unattended for ease of use, ease of installation, project size and human resource prioritization. These include, but are not limited to

  • database & RAM: this project does not use a real database but instead simple text files and is strongly constrained by the RAM's size (e.g. the entire databases' contents may be loaded in RAM at times)
  • server: the current architecture uses a simple socket for serving the (singular) client
  • full performance optimization
  • censorship of non-textual & non-html content (i.e. images, videos, documents)

Disclaimer

As stated above, this project is aimed at raising awareness about possible harms and threats AI can pose and is therefore not intended for any malicous use or use diverging from this intention. This is also the reason why censoring does not happen in secret and "behind the scenes" as it could, but is implemented as a two step and manually triggered process on the client side and why censored and modified paragraphs are colored in red.

weblock's People

Contributors

dependabot[bot] avatar jannis-baum avatar lucasliebe avatar tiny-fish-t avatar

Stargazers

 avatar

Watchers

 avatar

weblock's Issues

Assert being in `venv` when starting scripts

  • create function that checks if environment variable VIRTUAL_ENV is set to path of our venv
  • call function in beginning of run_backend, scrape_positive and scrape_negative
  • throw error and exit if not in venv

Polish scraping

  • add argparse to scrape_negative and offer option to specify number of articles analogous to scrape_positive
    • scraping with narticles = 0 doesn't terminate in scrape_positive
    • provide default narticles > 0
  • print progress updates (positive & negative)
  • clean / filter data before insertion into positive database
    • the paragraph \_ has recently appeared in the database, this leads to no words being left after normalization and a topic vector with components nan (nan > 1 โ†’ will be the text matcher's favorite text to match)
    • "copyright", "published on date x", etc paragraphs are undesired

Todo-List

  • use google news for summaries and clean database
  • using summaries for similarity
  • Check/ Make more reliable on multiple sites
  • Try to use NonSql DB and compare
  • prevent to use one source twice
  • include OpenAI into sentence generation
  • maybe pre generate sentences?

Clean up `NLProcessor`

  • reduce redundancy
  • add download for nltk components to install script to remove from nlp.py

Project website

  • example screenshots / video
  • take viewer through scenario, tell story
  • technical overview

Finish summarization, similarity & censorship

summarization

  • clean up articles before summarizing
  • prevent using same url twice
  • give option to scrape multiple queries
  • give option to set sources

similarity

  • give option to set censoring topic and constrain used summaries to corresponding database entries

censorship

  • find new threshold value (everybody)
  • improve performance
    • evaluating similarity to all summarizations in database takes too long (limit number of summarizations / pick newest or most relevant ones)
    • shorten summarizations significantly
  • if enough time: consider improving censorship formula (everybody)

Get project ready for submission

  • installation script(s)
    • mock local database
    • use Python venvs
    • requirements.txt
    • download nltk stuff (and remove downloads from nlp.py)
  • make client-server interaction work on localhost (@lucasliebe)
    • give choice to user in extension for server to use
  • documentation (@jannis-baum)
    • add document to describe what would have to be done to make this scalable (e.g. database & RAM, server setup, program for easy modification (instead of venv), etc)
    • instructions for how to use (how to run install, scraping workflow and how to start client / server)
  • clean out real database
  • check performance on generic websites (everybody)
  • code consistency (@lucasliebe)
    • kebab vs camel case in non-.py-filenames
    • consider linter

Find solution for `censoring-requirements`

info

  • currently hard-coded in run-backend
  • used by NLProcessor as constraint; a phrase has to have a word in common with synonyms of censoring-requirements to be considered similar

todo

  • do we still need this?
  • if so,
    • add it to .env or
    • infer it from provided search queries or
    • different idea

Text generation

- [ ] use OpenAI

  • pre-generation
    • scrape desired sentences from positive media
    • train BTM on scraping results
    • infer topics for scraping results
    • infer topics for client page
    • match given text with desired scraped content
  • #10

Logging consistency

  • create dedicated helper module for server/backend/validator.py, e.g. server/helpers/validator.py
  • add logging.py
    • create functions such as bold that wraps strings for bold printing
    • single source of truth for all strings that are printed anywhere
    • etc

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.