Git Product home page Git Product logo

cnn-dailymail's Introduction

This code produces the non-anonymized version of the CNN / Daily Mail summarization dataset, as used in the ACL 2017 paper Get To The Point: Summarization with Pointer-Generator Networks. It processes the dataset into the binary format expected by the code for the Tensorflow model.

Python 3 version: This code is in Python 2. If you want a Python 3 version, see @becxer's fork.

Option 1: download the processed data

User @JafferWilson has provided the processed data, which you can download here. (See discussion here about why we do not provide it ourselves).

Option 2: process the data yourself

1. Download data

Download and unzip the stories directories from here for both CNN and Daily Mail.

Warning: These files contain a few (114, in a dataset of over 300,000) examples for which the article text is missing - see for example cnn/stories/72aba2f58178f2d19d3fae89d5f3e9a4686bc4bb.story. The Tensorflow code has been updated to discard these examples.

2. Download Stanford CoreNLP

We will need Stanford CoreNLP to tokenize the data. Download it here and unzip it. Then add the following command to your bash_profile:

export CLASSPATH=/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar

replacing /path/to/ with the path to where you saved the stanford-corenlp-full-2016-10-31 directory. You can check if it's working by running

echo "Please tokenize this text." | java edu.stanford.nlp.process.PTBTokenizer

You should see something like:

Please
tokenize
this
text
.
PTBTokenizer tokenized 5 tokens at 68.97 tokens per second.

3. Process into .bin and vocab files

Run

python make_datafiles.py /path/to/cnn/stories /path/to/dailymail/stories

replacing /path/to/cnn/stories with the path to where you saved the cnn/stories directory that you downloaded; similarly for dailymail/stories.

This script will do several things:

  • The directories cnn_stories_tokenized and dm_stories_tokenized will be created and filled with tokenized versions of cnn/stories and dailymail/stories. This may take some time. Note: you may see several Untokenizable: warnings from Stanford Tokenizer. These seem to be related to Unicode characters in the data; so far it seems OK to ignore them.
  • For each of the url lists all_train.txt, all_val.txt and all_test.txt, the corresponding tokenized stories are read from file, lowercased and written to serialized binary files train.bin, val.bin and test.bin. These will be placed in the newly-created finished_files directory. This may take some time.
  • Additionally, a vocab file is created from the training data. This is also placed in finished_files.
  • Lastly, train.bin, val.bin and test.bin will be split into chunks of 1000 examples per chunk. These chunked files will be saved in finished_files/chunked as e.g. train_000.bin, train_001.bin, ..., train_287.bin. This should take a few seconds. You can use either the single files or the chunked files as input to the Tensorflow code (see considerations here).

cnn-dailymail's People

Contributors

abisee avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.