Git Product home page Git Product logo

npm's Introduction

Nonparametric Masked Language Modeling

This repo contains the original implementation of the paper "Nonparametric Masked Language Modeling".

@article{ min2022nonparametric,
    title={ Nonparametric Masked Language Modeling },
    author={ Min, Sewon and Shi, Weijia and Lewis, Mike and Chen, Xilun and Yih, Wen-tau and Hajishirzi, Hannaneh and Zettlemoyer, Luke },
    year={ 2022 }
}

Models are available from Huggingface Hub:hugs:! Check out npm (for phrase retrieval) and npm-single (for token retrieval).

We are working on a simple demo where you can simply download all the resources and deploy on your machine. Stay tuned!

Updates

  • 01/02/2023: The code for training is released. See train.md for instructions.
  • 12/22/2022: The code for inference is released. Stay tuned for the code for training.

Content

  1. Requirements
  2. Download Data
  3. Closed-set Experiments
  4. Open-set Experiments
  5. License
  6. Contact

Requirements

conda create -n npm python=3.7
conda activate npm
pip3 install -r requirements.txt --user

If you will use open-set tasks, make sure to install java as well.

conda install -c conda-forge openjdk

Note that multi-gpu inference is not supported for now.

Download Data

Evaluation datasets and reference corpora can be downloaded via

# To run evaluation on closed-set tasks
bash scripts/download_data.sh closed
bash scripts/download_corpus.sh closed

# To run evaluation on open-set tasks
bash scripts/download_data.sh open
bash scripts/download_corpus.sh enwiki

# To run evaluation on TempLAMA (need Wikipedia 2022)
bash scripts/download_data.sh templama
bash scripts/download_corpus.sh new-enwiki

The corpus data is required for NPM and the retrieve-and-generate baselines. If you will only run parametric baselines, you can skip downloading the corpus.

All reference corpus files are saved under corpus/ and evaluation datasets are saved under data/.

Closed-set Experiments

Baselines on closed-set tasks

The following is the script for runing the RoBERTA-large baseline on all 9 datasets used in the paper.

python -m scripts.prompt \
    --checkpoint_path roberta-large \
    --eval_dataset agn+yahoo+rte+subj+sst2+mr+rt+cr+amazon \
    --save_dir save/roberta \
    --single

NPM on closed-set tasks

# To run on AGN, Yahoo and RTE:
bash scripts/save_embeddings.sh npm enwiki-0 false 320
bash scripts/save_embeddings.sh npm cc_news false 320
python -m scripts.prompt \
    --corpus_data enwiki-0+cc_news \
    --checkpoint_path npm \
    --eval_dataset agn+yahoo+rte \
    --temperature 5.0 \
    --save_dir save/npm

# To run on Subj:
bash scripts/save_embeddings.sh npm subj false 320
python -m scripts.prompt \
    --corpus_data subj \
    --checkpoint_path npm \
    --eval_dataset subj \
    --temperature 5.0 \
    --save_dir save/npm

# To run on SST-2, MR, RT, CR and Amazon:
bash scripts/save_embeddings.sh npm imdb false 320
bash scripts/save_embeddings.sh npm amazon false 320
python -m scripts.prompt \
    --corpus_data imdb+amazon \
    --checkpoint_path npm \
    --eval_dataset sst2+mr+rt+cr+amazon \
    --temperature 5.0 \
    --save_dir save/npm

Note that scripts/save_embeddings.sh takes

  • model name (npm or npm-single)
  • corpus name
  • whether it is an open-set task (true or false)
  • batch size (320 is good for a 32gb GPU; if trainer.precision=16 is used, 400 is good for a 32gb GPU) as arguments. Embeddings are saved under save/{model_name}/dstore.

NPM Single on closed-set tasks

# To run on AGN, Yahoo and RTE:
bash scripts/save_embeddings.sh npm-single enwiki-0 false 320
bash scripts/save_embeddings.sh npm-single cc_news false 320
python -m scripts.prompt \
    --corpus_data enwiki-0+cc_news \
    --checkpoint_path npm-single \
    --eval_dataset agn+yahoo+rte \
    --temperature 5.0 \
    --single \
    --save_dir save/npm-single

# To run on Subj:
bash scripts/save_embeddings.sh npm-single subj false 320
python -m scripts.prompt \
    --corpus_data subj \
    --checkpoint_path npm-single \
    --eval_dataset subj \
    --temperature 5.0 \
    --single \
    --save_dir save/npm-single

# To run on SST-2, MR, RT, CR and Amazon:
bash scripts/save_embeddings.sh npm-single imdb false 320
bash scripts/save_embeddings.sh npm-single amazon false 320
python -m scripts.prompt \
    --corpus_data imdb+amazon \
    --checkpoint_path npm-single \
    --eval_dataset sst2+mr+rt+cr+amazon \
    --temperature 5.0 \
    --single \
    --save_dir save/npm-single

Open-set Experiments

Baselines on open-set tasks

Run the following to run causal language model baselines (T5 baselines are TBA!).

python -m scripts.clm_prompt \
    --eval_dataset {lama-trex|lama-google_re|kamel|triviaqa|nq|entity_translation} \
    --model_name {j-6b|neo-1.3b|neo-2.7b|neox-20b|opt-1.3b|opt-2.7b|opt-6.7b|opt-13b|opt-30b|bloom-1b7|bloom-3b|bloom-7b1} \
    --save_dir save

By default, this does not use any passages from an external corpus. Specify --ret bm25 if use BM25 passages from Wikipedia 2019, and --ret bm25_2022 to use BM25 passages from Wikipedia 2022 (for TempLAMA).

NPM on open-set tasks

Please note that running open-set tasks requires around 70GB of RAM and 1.4TB of disk memory. If you want to reduce the RAM usage, you can specify --keep_uint8 while running python -m scripts.prompt below, which reduces the RAM usage from 70GB to 40GB while increasing the datastore setting time. We will explore further optimizing RAM/disk usage in the future version of the code (PR is also welcome!).

# Note that this can be executed in parallel with up to 20 GPUs. In total, it takes about 10 GPU hours and 1.4TB of disk memory.
for i in {0..19} ; do
    bash scripts/save_embeddings.sh npm enwiki-${i} true 320
done

# Loading the model takes about 40min, and 70GB of RAM (specify `--keep_uint8` to reduce RAM usage to 40GB which increases the model loading time to 60-80min).
python -m scripts.prompt \
    --corpus_data enwiki \
    --checkpoint_path npm \
    --eval_dataset lama-trex+lama-google_re+kamel+triviaqa+nq+entity_translation \
    --save_dir save/npm \
    --remove_stopwords \
    --restricted \
    --open

To evaluate on TempLAMA, use new-enwiki instead of enwiki, and use --eval_dataset {templama|unchanged_templama}.

License

NPM is CC-BY-NC 4.0 licensed.

Contact

Please leave Github issues or contact Sewon Min [email protected] for any questions.

npm's People

Contributors

shmsw25 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

npm's Issues

dumping embeddings for closed-set experiments

Hello!

I've been trying to dump the embeddings for the closed-set experiments. Unfortunately except for enwiki-0 corpus, which has a 0_valid file, other corpora seem to have a problem with the dumping process. The problem seems to occur while collating the 'is_valid' attribute in the datamodule part(dimension errors). If I'm not mistaken this particular line seems to be the source of trouble.

is_valid = [i for i, _id in enumerate(input_ids) if _id not in [0, 2]]

The non-uniform lengths generated for each instance seem to be causing a dimension error.

At your earliest convenience, could check whether this is a valid error to be fixed??

Thank you

Question about NPM performance

Hi~ . Thank you for introducing an interesting research.
I have a few questions about npm's performance.

  1. In the performance of NPM, is there a result of distinguishing the effect of only aligning the context of the batch or the effect of configuring the contrastive learning? There would be a benefit from learning similar documents together, but I am curious about the effect of excluding it.

  2. Is there any result of learning npm from scratch without post-training in roberta? I think it would be difficult to learn NPM without understanding the basic language.

  3. About npm fine tuning (you said it was a future work),
    It showed amazing performance in zero-shot, but fine tuning is not expected to improve the performance of the part that retrieves similar syntax to the reference corpus or the part that matches the mask token based on the retrieved reference.
    What do you think?

Thanks~!

How to create corpus npy files from scratch

Hi!

I am working on your repository and have a few questions:

  1. Is the code that creates the corpus npy files (e.g., enwiki and new-enwiki) available in this repository?
  2. It looks like preprocess_wiki.py with --save_flatten_data outputs the similar JSON file with input_ids. Is it possible to create these npy files directly from this JSON file?
  3. kilt_knowledgesource.json corresponding to new-wiki corpus is unavailable in this repository and the KILT repository. Furthermore, it seems that the preprocessing code is currently not available in the kilt repository. It is possible to get the knowledge source file anywhere?

Thank you very much in advance!

requirements.txt is missing

Hi!
Thank you for sharing your amazing work!
I am now working on reproducing your experiments, but it seems that requirements.txt is currently missing in this repository.
I am sorry if you are still working on this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.