Git Product home page Git Product logo

trump_campaign_corpus's Introduction

Trump Campaign Corpus

The Trump Campaign Corpus consists of Donald Trump's speeches, interviews, debates, town halls, press conferences, written statements and tweets. It covers the period from the announcement of his candidacy on June 16, 2015 through Election Day, November 8, 2016. So far, we've gathered all or part of over 1k communications on top of 4.4k tweets. Together, they represent nearly 3m Trump words as well as 1m words from interviewers and debate opponents. All transcript are human-made, though style and quality varies.

The corpus is intended to support text analytics and criticism, with a focus on temporal patterns. It is under active development, so your input, especially new or improved transcripts, is appreciated.

Formats

The corpus is available as a single json file or individual text files, except tweets, which are only in the json. Text files are named date-first (see more info in Metadata section below). Speakers are identified consistently by full name, uppercase, for example:

DONALD TRUMP: I would have a very, very good relationship with Putin ...

Paragraph breaks usually follow the original and we retain them more for legibility than meaning. Transcription notes, such as (CHANTING) or (INAUDIBLE), appear uppercase, in parentheses.

Metadata and document structure

The metadata, what little there is, should be largely self-explanatory.

Name Value In text filename?
date_published Date and time (EDT) the communication became public. For tweets this is precise. For written statements, only the day is known. Other genres fall in between. Yes
date_created Usually null. Present and, in a few cases, filled-in to account for pre-recorded interviews. No
genre debate, interview, press conference, speech, statement, or town hall Yes
people List of people in document: their name, role (speaker, interviewer, moderator, quoted), and number of speaking turns names, up to 100 chars
event Used mostly for speaking occasions (e.g., Rally, NRA conference) Yes
title Used exclusively for written communications that include title Yes
is_as_spoken Boolean. False for written statements and speech scripts. In some cases, we have both transcript and prepared script, in which case they appear separately but with otherwise similar metadata. Yes
completeness complete, almost complete, partial, or missing. Missing is used primarily for speeches, where there's a schedule to go off of. Yes, when not 'missing'
location venue, city, state, and country. Used primarily for speeches. Refers to Trump's location in the case of remote inteviews. city, state, country (when not US)
publisher show and network or newspaper. Used primarily for interviews. Yes
is_public_domain Almost always null. See copyright info below. No
text_filename Recorded in json for cross-referencing By definition
doc The actual text, structured as an array of speaking turns, each containing name of person and array of paragraphs. Paragraph consists of text or trancription notes. No
doc_orig Currently used only for original text of tweets which we sanitize relatively aggressively as docs, removing symbols and links, expanding abbreviations and entities. No
twitter_source Twitter Web Client, Twitter for Android, Twitter for Blackberry, Twitter for iPad, or Twitter for iPhone (other devices are filtered out) Yes

Sources and copyright

The vast majority of transcripts come from media sites, particularly CNN and Fox. Statements and prepared speeches are from donaldjtrump.com or its Internet Archive versions. The rest we've scoured from the likes of the WikiLeaks DNC email archive (ironically) and whatthefolly.com. Finally, in a handful of cases โ€” three Breitbart interviews, Alex Jones interview, Rochester and Houston rallies, etc. โ€” we've created the transcripts.

In sum, most of the material is copyrighted, with no specially permitted use beyond Fair Use. Please employ it solely for research in the public interest. Hooray!

trump_campaign_corpus's People

Contributors

unendin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.