Git Product home page Git Product logo

degen's Introduction

Degen

The Official Repository for "The Curious Case of Neural Text Degeneration"

If you want to use Nucleus Sampling, you can use the implementation in Hugging Face Transformers with many pretrained models, including GPT-2!

Generations

All conditional and unconditional generations are available here.

Requirements

pytorch must be installed.

The other required modules are in requirements.txt and can be installed with:

pip install -r requirements.txt

Generating Your Own

Use gen.py to generate:

python gen.py --model_name gpt2-large --batch_size 10 -n 50 --output_path output.jsonl --gpu 0 -p 0.95 --seed 0

--help will show options for decoding strategies and parameters

Formatting Data

For conditional generations, you'll need to format contexts.

Use encode_jsonl.py to tokenize data from https://github.com/openai/gpt-2-output-dataset so it can be be used for conditional generation:

python encode_jsonl.py raw.jsonl tokenized.jsonl

--help will show more options

Use filter_for_conditional.py for creating contexts for conditional generations:

python filter_for_conditional.py tokenized.jsonl filtered.jsonl

--help will show more options

Now use sort_jsonl_by_length.py to sort things for more efficient batching:

python sort_jsonl_by_length.py filtered.jsonl sorted.jsonl

Finally, if you'd like to use Beam Search or Stochastic Beam Search. First create a cache file by generating for another algorithm for a non-beam decoding algorithm with the --cache flag:

python gen.py --model_name gpt2-large --batch_size 10 -n 50 --context_path sorted.jsonl --output_path output.jsonl --gpu 0 -k 40 --seed 0 --cache first.cache

Next, we'll reprocess the cache file for Beam Search:

python rebatch_inits_for_beamsearch.py first.cache --batch_size 4 --out bs_4.cache

Now we can decode with Beam Search:

python gen.py --model_name gpt2-large --batch_size 4 -n 40 --context_path sorted.jsonl --cache bs_4.cache --output_path output.jsonl --gpu 0 -w 4

MTurk

mturk_form.html is the Amazon Mechanical Turk template we used for experiments.

Use from_jsonl.py to extract strings (and other attributes from jsonl files):

python from_jsonl.py output.jsonl string output_string.txt

--help will show more options

chunk4turk.py can be used to batch texts to be used with the above MTurk template:

python chunk4turk.py output_string.txt output_mturk.csv

--help will show more options

degen's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.