Git Product home page Git Product logo

rjokesdata's Introduction

The r/Jokes Dataset: a Large Scale Humor Collection

Code and Datasets from the paper, "The r/Jokes Dataset: a Large Scale Humor Collection" by Orion Weller and Kevin Seppi

Dataset files are located in data/{train/dev/test}.tsv for the regression task, while the full unsplit data can be found in data/preprocessed.tsv. These files will need to be unzipped after cloning the repo.

For related projects, see our work on Humor Detection (separating the humorous jokes from the non-humorous) or generating humor automatically.

** We do not endorse these jokes. Please view at your own risk **

License

The data is under the Reddit License and Terms of Service and users must follow the Reddit User Agreement and Privacy Policy, as well as remove any posts if asked to by the original user. For more details on this, please see the link above.

Usage

Load the Required Packages

  1. Run pip3 install -r requirements.txt
  2. Gather the NLTK packages by running bash download_nltk_packages.sh. This downloads the packages averaged_perceptron_tagger, words, stopwords, maxent_ne_chunker, used for analysis/preprocessing.

Reproduce the current dataset (updated to Jan 1st 2020)

We chunk this process into three parts to avoid networking errors

  1. Run python3 gather_reddit_pushshift.py after cd prepare_data to gather the Reddit post ids.
  2. Run python3 preprocess.py --update to update the Reddit post IDs with the full post.
  3. Run python3 preprocess.py --preprocess to preprocess the Reddit posts into final datasets

Reproduce plots and analysis from the paper

  1. Run cd analysis
  2. Run python3 time_statistics.py to gather the statistics that display over time
  3. Run python3 dataset_statistics.py to gather the overall dataset statistics
  4. See plots in the ./plots folder

Re-gather All Jokes and Extend With Newer Jokes

  1. Run the first two commands in the Reproduce section above
  2. Update the code in the preprocess function of the preprocess.py file to NOT remove all jokes after 2020 (line 89). Then run python3 preprocess.py --preprocess

Reference:

If you found this repository helpful, please cite the following paper:

@ARTICLE{rjokesData2020,
  title={The r/Jokes Dataset: a Large Scale Humor Collection},
  author={Weller, Orion and Seppi, Kevin},
  journal={"Proceedings of the 2020 Conference of Language Resources and Evaluation"},
  month=May,
  year = "2020",
}

rjokesdata's People

Contributors

orionw avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.