allenai / dont-stop-pretraining Goto Github PK

View Code? Open in Web Editor NEW

521.0 9.0 73.0 567 KB

Code associated with the Don't Stop Pretraining ACL 2020 paper

Python 90.76% Shell 4.32% Jsonnet 4.92%

natural-language-processing pretrained-language-model

dont-stop-pretraining's People

Contributors

Stargazers

Watchers

dont-stop-pretraining's Issues

Reproduce the result of Chemprot using RoBERTa

Is there anyone who tried to produce the result on Chemprot using RoBERta?

I used the command it provided, but I only got as half of the F-score as it shown on the paper.

-----------Command I used----------------
python -m scripts.train
--config training_config/classifier.jsonnet
--serialization_dir model_logs/chemprot-ROBERTA_CLASSIFIER_BIG-202010271621
--hyperparameters ROBERTA_CLASSIFIER_BIG
--dataset chemprot
--model roberta-base
--device 0
--perf +f1
--evaluate_on_test
--seed 0

-------------Result I got---------------------
2020-10-28 15:47:32,735 - INFO - allennlp.models.archival - archiving weights and vocabulary to model_logs/chemprot-ROBERTA_CLASSIFIER_BIG-202010271621/model.tar.gz
2020-10-28 15:48:00,526 - INFO - allennlp.common.util - Metrics: {
"best_epoch": 2,
"peak_cpu_memory_MB": 4431.752,
"peak_gpu_0_memory_MB": 13629,
"peak_gpu_1_memory_MB": 10,
"training_duration": "0:05:36.203710",
"training_start_epoch": 0,
"training_epochs": 2,
"epoch": 2,
"training_f1": 0.5388954075483176,
"training_accuracy": 0.8424082513792276,
"training_loss": 0.528517140297649,
"training_cpu_memory_MB": 4431.752,
"training_gpu_0_memory_MB": 13629,
"training_gpu_1_memory_MB": 10,
"validation_f1": 0.5084102337176983,
"validation_accuracy": 0.8026370004120313,
"validation_loss": 0.6763799888523001,
"best_validation_f1": 0.5084102337176983,
"best_validation_accuracy": 0.8026370004120313,
"best_validation_loss": 0.6763799888523001,
"test_f1": 0.4786599434625644,
"test_accuracy": 0.7999423464975497,
"test_loss": 0.679223679412495
}

-----The result shown on the paper---------

Accessing data: 403 Forbidden

It seems to me that the program cannot retrieve specified dataset.
I am not sure if it is Amazon s3 problem.

botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

branch: allennlp-latest
command:

python -m scripts.train \
        --config training_config/classifier.jsonnet \
        --serialization_dir model_logs/citation-intent-base \
        --hyperparameters ROBERTA_CLASSIFIER_SMALL \
        --dataset citation_intent \
        --model roberta-base \
        --device 0 \
        --evaluate_on_test

Dataset for DAPT

In environments/datasets.py, it seems that datasets for DAPT are missing.

Can you provide external links to download datasets for DAPT?

How to preprocess the data ?

Hi, after downloading the dataset. I want to know is there any post processing about it ?

This is the keys of each dataset. In ag dataset should the text and headline be concatenated for classification ?

MemoryError

When I run the DAPT, MemoryError occurs.
It seems that it runs out of my memory.
My data file is 48GB (biomed) filtered by myself and my memory is 128 GB.
Could you give me some hints for solving this problem?
Thanks!

`File "/import/home/X/dont-stop-pretraining/scripts/run_language_modeling.py", line 133, in load_and_cache_examples
return LineByLineTextDataset(tokenizer, args, file_path=file_path, block_size=args.block_size)

File "/import/home/X/dont-stop-pretraining/scripts/run_language_modeling.py", line 119, in init
lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]

File "/home/X/.conda/envs/domains/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
MemoryError`

How to develop the code and add other tricks?

Hi, thank you for the interesting work!
I understand that an interface is provided by the conda packages and the conda environment, which is very convenient for reproducing results. Now I desire to develop the code and add other tricks to the model. Is there a natural way to do so? I mean I can definitely dig into the conda packages and work on them but that's against common develop habits...

About RCT dataset

I notice that in your paper, you mentioned that "for RCT, we represent all sentences in one long sequence for
simultaneous prediction". What do you mean by this? I did not find a specifical treatment when dealing with RCT dataset in your code.
Looking forward to your reply, Thanks!

Does DAPT lead to forgetting over the original LM domain or overfitting over the target domain?

Further DAPT was implemented on each domain for 12.5K steps with unlabeled data from target domain only. I am wondering whether not adding unlabeled data from original LM domain leads to detrimental forgetting or overfitting.

How is CS and BioMed corpus filtered from S2ORC dataset

Hi Team, I'm wondering how CS/BioMed corpus is filtered from S2ORC dataset? I didn't find details on this in the original paper, could you share some light on this? Thanks!

what's the correct version of allennlp and vampire?

Hi,
When I was running the vampire training, I found there are some issues due to the consistent version of allennlp and vampire.
If allennlp==0.9.0, it says "no import module and submodule"
If allennlp==1.0.0, it says "from_files need 2 arguments but 4 are given"

Add readme for the mlm study

We want to report issues that could affect the reproducibility of the masked LM loss calculation at test time.

First, we do not get exactly the same results reported in Table 3 of paper when we use the fairseq library instead of the transformers library, after we convert the transformers checkpoint to a fairseq checkpoint.

A related pull request was opened and closed, but did not fix our problem. Second, the results in Table 3 are calculated using the batch size of 1. With the batch sizes larger than 1, we do not get the same results. In particular, the results change for a sample of reviews. As we have already mentioned, reviews are much shorter than documents from other domains. Therefore, unlike documents in other domains that are usually of the maximum length, reviews need to be padded to the maximum length. For this reason, we suspect that padding somehow influences the masked LM loss calculation. However, with the batch size of 1 we do not need to pad, and therefore we find results in Table 3 reliable.

pre-train commands，where is `ADAPTIVE_PRETRAINING.md`file for DAPT/TAPT commands？

Hi there, check the ADAPTIVE_PRETRAINING.mdfile for DAPT/TAPT commands

Originally posted by @kernelmachine in #10 (comment)
I cannot find the 'ADAPTIVE_PRETRAINING.md' file, thank you!

您好！我运行时为啥老出现各种奇葩问题？显示 /bin/sh:1: allennlp:not found “Command allenlp train --include-apckage dont_stop_pretraining training_config/classifier.jsonnet -s model_logs/citation-intent-base” returned non-zero exit status 127

显示 /bin/sh:1: allennlp:not found “Command allenlp train --include-apckage dont_stop_pretraining training_config/classifier.jsonnet -s model_logs/citation-intent-base” returned non-zero exit status 127

Error due to "AllenNLP" library.

Hi, I was trying to run the command:

python -m scripts.train \
        --config training_config/classifier.jsonnet \
        --serialization_dir model_logs/citation_intent_base \
        --hyperparameters ROBERTA_CLASSIFIER_SMALL \
        --dataset citation_intent \
        --model $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 \
        --device 0 \
        --perf +f1 \
        --evaluate_on_test

Before running this command, I did the following steps in the following order.
1.

     pip install pytorch-transformers
     pip install transformers
     pip install git+https://github.com/kernelmachine/allennlp.git@4ae123d2c3bfb1ea3ce7362cb6c5bca3d094ffa7

      python -m scripts.download_model \
        --model allenai/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 \
        --serialization_dir $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688

After these two steps, when I run the scripts.train command, I get the error shown below.

2020-07-29 10:13:11,360 - INFO - pytorch_pretrained_bert.modeling - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
2020-07-29 10:13:12,114 - INFO - pytorch_transformers.modeling_bert - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
2020-07-29 10:13:12,117 - INFO - pytorch_transformers.modeling_xlnet - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
2020-07-29 10:13:12,709 - INFO - allennlp.common.params - random_seed = 278011
2020-07-29 10:13:12,710 - INFO - allennlp.common.params - numpy_seed = 278011
2020-07-29 10:13:12,710 - INFO - allennlp.common.params - pytorch_seed = 278011
2020-07-29 10:13:12,780 - INFO - allennlp.common.checks - Pytorch version: 1.5.1+cu101
2020-07-29 10:13:12,782 - INFO - allennlp.common.params - evaluate_on_test = True
2020-07-29 10:13:12,782 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.dataset_readers.dataset_reader.DatasetReader'> from params {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': ['</s>'], 'model_name': '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': ['<s>'], 'type': 'pretrained_transformer'}, 'type': 'text_classification_json_with_sampling'} and extras set()
2020-07-29 10:13:12,782 - INFO - allennlp.common.params - dataset_reader.type = text_classification_json_with_sampling
2020-07-29 10:13:12,782 - INFO - allennlp.common.from_params - instantiating class <class 'dont_stop_pretraining.data.dataset_readers.text_classification_json_reader_with_sampling.TextClassificationJsonReaderWithSampling'> from params {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': ['</s>'], 'model_name': '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': ['<s>'], 'type': 'pretrained_transformer'}} and extras set()
2020-07-29 10:13:12,783 - INFO - allennlp.common.from_params - instantiating class allennlp.data.token_indexers.token_indexer.TokenIndexer from params {'do_lowercase': False, 'model_name': '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'} and extras set()
2020-07-29 10:13:12,783 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.type = pretrained_transformer
2020-07-29 10:13:12,783 - INFO - allennlp.common.from_params - instantiating class allennlp.data.token_indexers.pretrained_transformer_indexer.PretrainedTransformerIndexer from params {'do_lowercase': False, 'model_name': '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688'} and extras set()
2020-07-29 10:13:12,783 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.model_name = /content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688
2020-07-29 10:13:12,783 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.do_lowercase = False
2020-07-29 10:13:12,784 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.namespace = tags
2020-07-29 10:13:12,784 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.token_min_padding_length = 0
2020-07-29 10:13:12,784 - INFO - pytorch_transformers.tokenization_utils - Model name '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc). Assuming '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688' is a path or url to a directory containing tokenizer files.
2020-07-29 10:13:12,784 - INFO - pytorch_transformers.tokenization_utils - Didn't find file /content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688/vocab.txt. We won't load it.
2020-07-29 10:13:12,784 - INFO - pytorch_transformers.tokenization_utils - Didn't find file /content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688/added_tokens.json. We won't load it.
2020-07-29 10:13:12,784 - INFO - pytorch_transformers.tokenization_utils - loading file None
2020-07-29 10:13:12,784 - INFO - pytorch_transformers.tokenization_utils - loading file None
2020-07-29 10:13:12,785 - INFO - pytorch_transformers.tokenization_utils - loading file /content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688/special_tokens_map.json
Traceback (most recent call last):
  File "/usr/local/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/usr/local/lib/python3.6/dist-packages/allennlp/run.py", line 18, in run
    main(prog="allennlp")
  File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/__init__.py", line 120, in main
    args.func(args)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 150, in train_model_from_args
    args.cache_prefix,
  File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 199, in train_model_from_file
    cache_prefix,
  File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 257, in train_model
    params, serialization_dir, recover, cache_directory, cache_prefix
  File "/usr/local/lib/python3.6/dist-packages/allennlp/training/trainer_pieces.py", line 45, in from_params
    all_datasets = training_util.datasets_from_params(params, cache_directory, cache_prefix)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/training/util.py", line 169, in datasets_from_params
    dataset_reader = DatasetReader.from_params(dataset_reader_params)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 377, in from_params
    return subclass.from_params(params=params, **extras)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 398, in from_params
    kwargs = create_kwargs(cls, params, **extras)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 140, in create_kwargs
    kwargs[name] = construct_arg(cls, name, annotation, param.default, params, **extras)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 265, in construct_arg
    value_dict[key] = value_cls.from_params(params=value_params, **subextras)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 377, in from_params
    return subclass.from_params(params=params, **extras)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 400, in from_params
    return cls(**kwargs)  # type: ignore
  File "/usr/local/lib/python3.6/dist-packages/allennlp/data/token_indexers/pretrained_transformer_indexer.py", line 58, in __init__
    self.tokenizer = AutoTokenizer.from_pretrained(model_name, do_lower_case=do_lowercase)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_transformers/tokenization_auto.py", line 89, in from_pretrained
    return BertTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_transformers/tokenization_bert.py", line 216, in from_pretrained
    return super(BertTokenizer, cls)._from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_transformers/tokenization_utils.py", line 327, in _from_pretrained
    tokenizer = cls(*inputs, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_transformers/tokenization_bert.py", line 128, in __init__
    if not os.path.isfile(vocab_file):
  File "/usr/lib/python3.6/genericpath.py", line 30, in isfile
    st = os.stat(path)
TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/content/dont-stop-pretraining/scripts/train.py", line 143, in <module>
    main()
  File "/content/dont-stop-pretraining/scripts/train.py", line 140, in main
    subprocess.run(" ".join(allennlp_command), shell=True, check=True)
  File "/usr/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'allennlp train --include-package dont_stop_pretraining training_config/classifier.jsonnet -s model_logs/citation_intent_base' returned non-zero exit status 1.

I am running the code in Google Colab. I will be very grateful if anyone can help me in understanding where I am going wrong. Thanks.

Does more steps of pretraining lead to better encoder for downstream tasks?

Thank you for your contributions in pretraining. You trained the encoder for 12.5K steps for each domain in pretraining phase before applying the encoder to supervised downstream tasks. Is it possible that the checkpoints that are most suitable for downstream tasks might appear in the middle of the pretraining phase? This phenomenon is obvious in many real applications. Under the circumstances, we might not know if it is a good choice to directly train the model to the maximum step with all of the corpus and take the final checkpoint. Is there any suggestion on that?

about macro_f1 score

Hi,
I was wondering when we are calculating the macro_f1 score, is it weighted macro_f1 or unweighted macro_f1?
'weighted' here means whether we take care of the support of instances.

Thanks!

About data selection

Thanks for your excellent work. I select data following DATA_SELECTION.md. During extracting VAMPIRE embeddings, one should run 'python -m scripts.run_vampire ......'. However, the file 'scripts/run_vampire.py' does not exist in the current VAMPIRE project (http://github.com/allenai/vampire). Is the file removed in the current version? Any suggestions? Thanks.

Will you use the unlabelled valid/test set in TAPT?

I was wondering will you use the unlabelled valid/test set in TAPT?
@kyleclo

when do domain-adaptive pretraining, seems can not extend the vocabulary？

After use my own corpus to do domain-adaptive pretraining, the vocab.txt is the same size with the initialized model(BERT-base). In short, the domain-adaptive pretraining does not extend the vocabulary of the new domain? Therefore same specific
vocabulary of the new domain still not exist in the domain-adaptive pretraining result vocab.txt. Is that?

Extend to T5 models

How to extend this for T5 models? Any ideas?

Pytorch-transformer and Allennlp Compatibility

As the authors mentioned in README.md,
pytorch-transformers 1.20 is not compatible with the specified branch in environment.yml.
I tried:

Downgrading pytorch-transformers to 1.10
install master branch of allennlp with pytorch-transformers 1.20

However, I couldn't get this working.
Has anyone been able to run the basic model(Roberta) recently?

Datasets for DAPT

In environments/datasets.py, it seems that datasets for DAPT are missing.

Can you provide external links to download datasets for DAPT?

IMDB train/dev split

How did you split the IMDB dataset into train and dev parts (25.000 -> 20.000 + 5.000)? Is this some kind of standard split or did you randomly split?

How long does it take for the training process?

Hi,
I am doing DAPT on CS domain with 38 GB CS data on a single TPU V3-8.
It is estimated that will cost 20-24 hours for one epoch.
I see from the paper you use TPU V3-8 as well but I do not find the time information in the paper. Would you like to share how much time you need for pretraining?
Thanks!

This problem occurs when running the script：allennlp.common.checks.ConfigurationError:key 'data_loader' is required

Traceback (most recent call last):
File "/.virtualenvs/bort_test/lib/python3.6/site-packages/allennlp/common/params.py", line 239, in pop
value = self.params.pop(key)
KeyError: 'data_loader'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/.virtualenvs/bort_test/bin/allennlp", line 9, in
sys.exit(run())
File "/.virtualenvs/bort_test/lib/python3.6/site-packages/allennlp/main.py", line 34, in run
main(prog="allennlp")
File "/.virtualenvs/bort_test/lib/python3.6/site-packages/allennlp/commands/init.py", line 94, in main
args.func(args)
File "/.virtualenvs/bort_test/lib/python3.6/site-packages/allennlp/commands/train.py", line 112, in train_model_from_args
file_friendly_logging=args.file_friendly_logging,
File "/.virtualenvs/bort_test/lib/python3.6/site-packages/allennlp/commands/train.py", line 171, in train_model_from_file
file_friendly_logging=file_friendly_logging,
File "/.virtualenvs/bort_test/lib/python3.6/site-packages/allennlp/commands/train.py", line 232, in train_model
file_friendly_logging=file_friendly_logging,
File "/.virtualenvs/bort_test/lib/python3.6/site-packages/allennlp/commands/train.py", line 423, in _train_worker
params=params, serialization_dir=serialization_dir, local_rank=process_rank,
File "/.virtualenvs/bort_test/lib/python3.6/site-packages/allennlp/common/from_params.py", line 583, in from_params
**extras,
File "/.virtualenvs/bort_test/lib/python3.6/site-packages/allennlp/common/from_params.py", line 612, in from_params
kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
File "/.virtualenvs/bort_test/lib/python3.6/site-packages/allennlp/common/from_params.py", line 182, in create_kwargs
cls.name, param_name, annotation, param.default, params, **extras
File "/.virtualenvs/bort_test/lib/python3.6/site-packages/allennlp/common/from_params.py", line 283, in pop_and_construct_arg
popped_params = params.pop(name, default) if default != _NO_DEFAULT else params.pop(name)
File "/.virtualenvs/bort_test/lib/python3.6/site-packages/allennlp/common/params.py", line 247, in pop
raise ConfigurationError(msg)
allennlp.common.checks.ConfigurationError: key "data_loader" is required
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/administrator/dont-stop-pretraining/scripts/train.py", line 145, in
main()
File "/home/administrator/dont-stop-pretraining/scripts/train.py", line 141, in main
subprocess.run(" ".join(allennlp_command), shell=True, check=True)
File "/usr/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'allennlp train --include-package dont_stop_pretraining training_config/classifier.jsonnet -s model_logs/citation_intent_base' returned non-zero exit status 1.

the script is :
python -m scripts.train
--config training_config/classifier.jsonnet
--serialization_dir model_logs/citation_intent_base
--hyperparameters ROBERTA_CLASSIFIER_SMALL
--dataset citation_intent
--model roberta-base
--device 0
--perf +f1
--evaluate_on_test

How is News corpus filtered from RealNews dataset

Hi @kernelmachine / @kyleclo , I'm wondering how News corpus is filtered from RealNews dataset? I have tried to extract docs from RealNews dataset, but got 32.80M docs instead of 11.90M docs as mentioned in the paper. Is there any additional filtering applied? Thanks!

TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

Hi,

I am trying to train the biomed_roberta_base model on the chemprot dataset using the provided scripts.train python command and encounter the below issues.

The above dataset and models have been downloaded as stated in the README of the master branch. Also since the mentioned environment wasn't working for me I am using the below conda environment

Please let me know how to solve the above issue. It seems like the tokenizer asks for a vocab file but I am not sure how to provide one.

Error when downloading the data.

When I run the following command:
curl -Lo train.jsonl https://allennlp.s3-us-west-2.amazonaws.com/dont_stop_pretraining/data/chemprot/train.jsonl
I got this error:
<?xml version="1.0" encoding="UTF-8"?> <Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>A49C5B9A8511C5A7</RequestId><HostId>O3qjG2TZgfQV6vxVOsMVUz5rvnaOESnSnKNnZXuAcJ/JKYlGcdlS3azlOZgg2+V0bWq2sNx7wgY=</HostId></Error>

Can you see what might be wrong with the AWS server?

How to get the pretraining corpus?

Dear authors,
I was wondering how to get the pretraining corpus?
Is it possible to open-source them?
Thanks!

Regarding replicating results

Are codes for pretraining available?

It seems that this repository only contains the code to perform finetuning pretrained RoBERTa. Are code for pretraining available now? Can you possibly add some command example for doing TAPT? Any advice or explanation will be highly appreciated. Thanks in advance!

why do you train only 12.5 K steps?

Any justifications?

ImportError SpacyTokenizer on vampire branch allennlp-1.0

Hi,

When following the instructions in DATA_SELECTION.md, upon running the command ""python -m scripts.train --config training_config/vampire.jsonnet --serialization-dir model_logs/vampire-world --environment VAMPIRE --device 0 -o", I get the following error:

ImportError: cannot import name 'SpacyTokenizer' from 'allennlp.data.tokenizers' (/path/to/python3.7/site-packages/allennlp/data/tokenizers/init.py)

If I upgrade to alllennlp==1.0, it states:

Something went wrong during jsonnet_evaluate_file, please report this: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - invalid literal; last read: 'Z'

I notice scripts/train.py works on the vampire master branch, so this seems to be an issue related to the vampire branch the DSP DATA_SELECTION.md requires.

I've tried staying on master branch, copying 'run_vampire.py' from dont-stop-pretraining/scripts/tapt_selection and then running "parallel --ungroup python -m scripts.run_vampire ${VAMPIRE_DIR}/model_logs/vampire-world/model.tar.gz {1} --batch 64 --include-package vampire --predictor vampire --output-file ${ROOT_DIR}/task_emb/{1/.} --silent ::: ${ROOT_DIR}/task_shards/*". This gives:

ImportError: cannot import name 'import_module_and_submodules' from 'allennlp.common.util' (/home/mitarb/vdberg/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/util.py)

Could you perhaps help out with this? Thank you!

How to use pretrained TAPT with classification head(s)?

Hi, very impressive work. A quick question, how to use the pretrained TAPT (e.g., allenai/dsp_roberta_base_tapt_chemprot_4169) with the corresponding pretrained classification head? I can only get a sentence embedding via 'all_hidden_states = model(input_ids)'.

Regarding MLM Loss of Lrob and Ldapt

hi, can you tell me how this Lrob and Ldapt are calculated, coz biomed, cs, news, reviews domains are same for Lrob as well as can be seen from the table

About Datasets

Hi. First of all, thank you for your great work on task adaptation!

Since I want to do some researches about task adaptation of language model,

I think that it will be cool if I can use the dataset that you used.

As far as I saw, the s3 link of datasets is set as private, then other people cannot download it.

Am I miss something even if I can download the dataset from the given link?

If not, do you have any plan to open the dataset you used to the public?

As I suppose, I think it may be difficult since some datasets have copyright...

Thank you for reading my issues!

TAPT dataset

I am trying to understand the method for TAPT. For chemprot, for example, are you using the same train dataset that is being used for fine-tuning? This chemprot dataset was just augmented with "randomly masking different tokens across epochs, using the masking probability of 0.15". Or is there some other UNLABELED dataset used for chemprot when doing the TAPT, and then this labeled chemprot data that is open sourced was only used for fine tuning on the downstream task?

Fail to reproduce the work

Could you please check the implementation steps you provided in the README file?

I followed your instructions but find it very hard to reproduce this work, someerrors would come out like version inconsistency between allennlp and transformers, then lead to error like:

subprocess.CalledProcessError: Command 'allennlp train training_config/classifier.jsonnet --include-package dont_stop_pretraining -s model_logs\citation_intent_base' returned non-zero exit status 1.

Or just there are some wrong steps during my implementation? It is really confusing and frustrating.

allennlp.common.checks.ConfigurationError: Extra parameters passed to PretrainedTransformerIndexer: {'do_lowercase': False}

Hi, I have setup the conda environment and ran the scripts, i.e. running firstly

python -m scripts.download_model
--model allenai/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688
--serialization_dir $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688

and then

python -m scripts.train
--config training_config/classifier.jsonnet
--serialization_dir model_logs/citation-intent-dapt-dapt
--hyperparameters ROBERTA_CLASSIFIER_SMALL
--dataset citation_intent
--model $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688
--device 0
--perf +f1
--evaluate_on_test

but I am getting this error now:

/home/mikeleatila/anaconda3/envs/domains/bin/python /home/mikeleatila/dont_stop_pretraining_master/scripts/train.py --config training_config/classifier.jsonnet --serialization_dir model_logs/citation-intent-dapt-dapt --hyperparameters ROBERTA_CLASSIFIER_SMALL --dataset citation_intent --model /home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 --device 0 --perf +f1 --evaluate_on_test
2022-11-24 09:59:10,204 - INFO - transformers.file_utils - PyTorch version 1.13.0 available.
2022-11-24 09:59:10,816 - INFO - pytorch_pretrained_bert.modeling - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
Traceback (most recent call last):
File "/home/mikeleatila/anaconda3/envs/domains/bin/allennlp", line 8, in
sys.exit(run())
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/commands/init.py", line 93, in main
args.func(args)
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/commands/train.py", line 144, in train_model_from_args
dry_run=args.dry_run,
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/commands/train.py", line 203, in train_model_from_file
dry_run=dry_run,
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/commands/train.py", line 266, in train_model
dry_run=dry_run,
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/commands/train.py", line 450, in _train_worker
batch_weight_key=batch_weight_key,
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 555, in from_params
**extras,
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 583, in from_params
kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 188, in create_kwargs
cls.name, param_name, annotation, param.default, params, **extras
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 294, in pop_and_construct_arg
return construct_arg(class_name, name, popped_params, annotation, default, **extras)
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 329, in construct_arg
return annotation.from_params(params=popped_params, **subextras)
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 555, in from_params
**extras,
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 583, in from_params
kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 188, in create_kwargs
cls.name, param_name, annotation, param.default, params, **extras
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 294, in pop_and_construct_arg
return construct_arg(class_name, name, popped_params, annotation, default, **extras)
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 372, in construct_arg
**extras,
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 329, in construct_arg
return annotation.from_params(params=popped_params, **subextras)
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 555, in from_params
**extras,
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 583, in from_params
kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 199, in create_kwargs
params.assert_empty(cls.name)
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/params.py", line 421, in assert_empty
"Extra parameters passed to {}: {}".format(class_name, self.params)
allennlp.common.checks.ConfigurationError: Extra parameters passed to PretrainedTransformerIndexer: {'do_lowercase': False}
2022-11-24 09:59:10,850 - INFO - allennlp.common.params - random_seed = 58860
2022-11-24 09:59:10,850 - INFO - allennlp.common.params - numpy_seed = 58860
2022-11-24 09:59:10,850 - INFO - allennlp.common.params - pytorch_seed = 58860
2022-11-24 09:59:10,851 - INFO - allennlp.common.checks - Pytorch version: 1.13.0
2022-11-24 09:59:10,851 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.commands.train.TrainModel'> from params {'validation_data_path': 'https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/dev.jsonl', 'evaluate_on_test': True, 'model': {'dropout': '0.1', 'feedforward_layer': {'activations': 'tanh', 'hidden_dims': 768, 'input_dim': 768, 'num_layers': 1}, 'seq2vec_encoder': {'embedding_dim': 768, 'type': 'cls_pooler_x'}, 'text_field_embedder': {'roberta': {'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'type': 'basic_classifier_with_f1'}, 'iterator': {'batch_size': 16, 'sorting_keys': [['tokens', 'num_tokens']], 'type': 'bucket'}, 'validation_iterator': {'batch_size': 64, 'sorting_keys': [['tokens', 'num_tokens']], 'type': 'bucket'}, 'train_data_path': 'https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/train.jsonl', 'test_data_path': 'https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/test.jsonl', 'trainer': {'cuda_device': 0, 'gradient_accumulation_batch_size': 16, 'num_epochs': 10, 'num_serialized_models_to_keep': 0, 'optimizer': {'b1': 0.9, 'b2': 0.98, 'e': 1e-06, 'lr': '2e-05', 'max_grad_norm': 1, 'parameter_groups': [[['bias', 'LayerNorm.bias', 'LayerNorm.weight', 'layer_norm.weight'], {'weight_decay': 0}, []]], 'schedule': 'warmup_linear', 't_total': -1, 'type': 'bert_adam', 'warmup': 0.06, 'weight_decay': 0.1}, 'patience': 3, 'validation_metric': '+f1'}, 'validation_dataset_reader': {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': [''], 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': [''], 'type': 'pretrained_transformer'}, 'type': 'text_classification_json_with_sampling'}, 'dataset_reader': {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': [''], 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': [''], 'type': 'pretrained_transformer'}, 'type': 'text_classification_json_with_sampling'}} and extras {'batch_weight_key', 'local_rank', 'serialization_dir'}
2022-11-24 09:59:10,851 - INFO - allennlp.common.params - type = default
2022-11-24 09:59:10,851 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.commands.train.TrainModel'> from params {'validation_data_path': 'https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/dev.jsonl', 'evaluate_on_test': True, 'model': {'dropout': '0.1', 'feedforward_layer': {'activations': 'tanh', 'hidden_dims': 768, 'input_dim': 768, 'num_layers': 1}, 'seq2vec_encoder': {'embedding_dim': 768, 'type': 'cls_pooler_x'}, 'text_field_embedder': {'roberta': {'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'type': 'basic_classifier_with_f1'}, 'iterator': {'batch_size': 16, 'sorting_keys': [['tokens', 'num_tokens']], 'type': 'bucket'}, 'validation_iterator': {'batch_size': 64, 'sorting_keys': [['tokens', 'num_tokens']], 'type': 'bucket'}, 'train_data_path': 'https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/train.jsonl', 'test_data_path': 'https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/test.jsonl', 'trainer': {'cuda_device': 0, 'gradient_accumulation_batch_size': 16, 'num_epochs': 10, 'num_serialized_models_to_keep': 0, 'optimizer': {'b1': 0.9, 'b2': 0.98, 'e': 1e-06, 'lr': '2e-05', 'max_grad_norm': 1, 'parameter_groups': [[['bias', 'LayerNorm.bias', 'LayerNorm.weight', 'layer_norm.weight'], {'weight_decay': 0}, []]], 'schedule': 'warmup_linear', 't_total': -1, 'type': 'bert_adam', 'warmup': 0.06, 'weight_decay': 0.1}, 'patience': 3, 'validation_metric': '+f1'}, 'validation_dataset_reader': {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': [''], 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': [''], 'type': 'pretrained_transformer'}, 'type': 'text_classification_json_with_sampling'}, 'dataset_reader': {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': [''], 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': [''], 'type': 'pretrained_transformer'}, 'type': 'text_classification_json_with_sampling'}} and extras {'batch_weight_key', 'local_rank', 'serialization_dir'}
2022-11-24 09:59:10,851 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.dataset_readers.dataset_reader.DatasetReader'> from params {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': [''], 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': [''], 'type': 'pretrained_transformer'}, 'type': 'text_classification_json_with_sampling'} and extras {'batch_weight_key', 'local_rank', 'serialization_dir'}
2022-11-24 09:59:10,851 - INFO - allennlp.common.params - dataset_reader.type = text_classification_json_with_sampling
2022-11-24 09:59:10,851 - INFO - allennlp.common.from_params - instantiating class <class 'dont_stop_pretraining.data.dataset_readers.text_classification_json_reader_with_sampling.TextClassificationJsonReaderWithSampling'> from params {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': [''], 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': [''], 'type': 'pretrained_transformer'}} and extras {'batch_weight_key', 'local_rank', 'serialization_dir'}
2022-11-24 09:59:10,851 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.token_indexers.token_indexer.TokenIndexer'> from params {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'} and extras {'batch_weight_key', 'local_rank', 'serialization_dir'}
2022-11-24 09:59:10,852 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.type = pretrained_transformer
2022-11-24 09:59:10,852 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.token_indexers.pretrained_transformer_indexer.PretrainedTransformerIndexer'> from params {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688'} and extras {'batch_weight_key', 'local_rank', 'serialization_dir'}
2022-11-24 09:59:10,852 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.token_min_padding_length = 0
2022-11-24 09:59:10,852 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.model_name = /home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688
2022-11-24 09:59:10,852 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.namespace = tags
2022-11-24 09:59:10,852 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.max_length = None
Traceback (most recent call last):
File "/home/mikeleatila/dont_stop_pretraining_master/scripts/train.py", line 142, in
main()
File "/home/mikeleatila/dont_stop_pretraining_master/scripts/train.py", line 139, in main
subprocess.run(" ".join(allennlp_command), shell=True, check=True)
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'allennlp train --include-package dont_stop_pretraining training_config/classifier.jsonnet -s model_logs/citation-intent-dapt-dapt' returned non-zero exit status @1.

~~Many thanks in advance!~~

allenai / dont-stop-pretraining Goto Github PK

dont-stop-pretraining's People

Contributors

Stargazers

Watchers

Forkers

dont-stop-pretraining's Issues

Recommend Projects

Recommend Topics

Recommend Org