Git Product home page Git Product logo

chinese-keywords's Introduction

chinese-keywords

Contained is a set of sensitive Chinese keywords (that is, keywords related to the Chinese Communist party, pornography, dissident material, violence/terrorism, censorship, etc). These keywords may be helpful to researchers who are searching for sensitive content in Chinese or testing for network interference.

As of Dec 9, there are 9,054 sensitive keywords collected from 13 different lists (see below for detailed info on the lists). To get a sense of what data is included in these CSV files, you can view a Google Doc spreadsheet of these 9,054 keywords sorted by the number of lists they appear on: https://docs.google.com/spreadsheets/d/19eS47Dg086vR1jh9oo51pXstYVT2wft13JGCrnAeU7A/edit?usp=sharing

The CSV files contain machine translations (from Google) and human translations/notes for most of the keywords. Many also have theme and category variables included as well thanks to various sources which have previously tagged their keyword lists. Currently, there are three different versions:

The thirteen lists this collection contains are:

Creator/source Tested on/found from + original use of terms # of keywords Year Method found + source
UNM/The Citizen Lab Sina UC, triggers censorship of message in app 1,818 2013 reverse engineered from the client; more analysis here; download link
UNM/The Citizen Lab Tom-Skype, triggers either surveillance or censorship in app 2,574 2013 reverse engineered from the client; more analysis here; download link
The Citizen Lab LINE, triggers censorship of message in chat app 673 2014 reverse engineered from the client; more analysis here; download link
Jason Q. Ng (Blocked on Weibo) Sina Weibo, triggers blocked searches on Weibo 839 2013 running Wikipedia China article titles through Sina Weibo search; more analysis and book
Xia Chu Great Firewall, triggers blocking of webpages 669 2014 HTTP request scans of Wikipedia China articles to see if they'd trigger GFW block; more analysis here; download link (removed duplicates and keywords related to meta and user pages)
China Digital Times Sina Weibo, triggers blocked searches on Weibo 2,448 2014 crowdsourced testing of suspected sensitive keywords on Sina Weibo; more analysis on CDT and in CDT's Grass Mud Horse Lexicon e-book; download link
GreatFire.org Wikipedia, names of Wikipedia articles blocked from access 488 2013 testing to see if Wikipedia pages are available in China; more info; download link
Google/ATGFW.org Google/Great Firewall, see 'Method found' 456 2012 ATGFW.org and GreatFire.org reverse engineered the keywords Google was using to warn users of potential censorship while using their service in China; download link
Jeffrey Knockel Sina Show, keywords found in application's binary files 910 2014 extracted list from Sina Show app; of the 910 unique keywords, only 108 are used for censoring chat messages; download link
Unknown 163.com, unclear 376 2008 archived by Nart Villeneuve; circulated on 163.com, a Chinese portal website download link
Omnitalk BBS users? Tencent QQ, keywords blocked by chat app in 2004 863 2004 archived by Nart Villeneuve; extracted from Tencent QQ app, more info and analysis from CDT download link
Jed Crandall et al / "ConceptDoppler" Great Firewall, "keywords found to be censored at the 'gateway' level" 669 2008 archived by Nart Villeneuve; "HTTP keyword filtering by Internet routers"; website; paper; download link
Unknown a "blog provider", unclear 844 2005 archived by Nart Villeneuve; according to Villeneuve: "This is a keyword list from a blog provider in China." download link
This project was started at The Citizen Lab's 2014 Connaught Summer Institute workshop.

chinese-keywords's People

Contributors

cioccolato avatar jasonqng avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.