allenai / dont-stop-pretraining Goto Github PK
View Code? Open in Web Editor NEWCode associated with the Don't Stop Pretraining ACL 2020 paper
Code associated with the Don't Stop Pretraining ACL 2020 paper
It seems to me that the program cannot retrieve specified dataset.
I am not sure if it is Amazon s3 problem.
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
branch: allennlp-latest
command:
python -m scripts.train \
--config training_config/classifier.jsonnet \
--serialization_dir model_logs/citation-intent-base \
--hyperparameters ROBERTA_CLASSIFIER_SMALL \
--dataset citation_intent \
--model roberta-base \
--device 0 \
--evaluate_on_test
Is there anyone who tried to produce the result on Chemprot using RoBERta?
I used the command it provided, but I only got as half of the F-score as it shown on the paper.
-----------Command I used----------------
python -m scripts.train
--config training_config/classifier.jsonnet
--serialization_dir model_logs/chemprot-ROBERTA_CLASSIFIER_BIG-202010271621
--hyperparameters ROBERTA_CLASSIFIER_BIG
--dataset chemprot
--model roberta-base
--device 0
--perf +f1
--evaluate_on_test
--seed 0
-------------Result I got---------------------
2020-10-28 15:47:32,735 - INFO - allennlp.models.archival - archiving weights and vocabulary to model_logs/chemprot-ROBERTA_CLASSIFIER_BIG-202010271621/model.tar.gz
2020-10-28 15:48:00,526 - INFO - allennlp.common.util - Metrics: {
"best_epoch": 2,
"peak_cpu_memory_MB": 4431.752,
"peak_gpu_0_memory_MB": 13629,
"peak_gpu_1_memory_MB": 10,
"training_duration": "0:05:36.203710",
"training_start_epoch": 0,
"training_epochs": 2,
"epoch": 2,
"training_f1": 0.5388954075483176,
"training_accuracy": 0.8424082513792276,
"training_loss": 0.528517140297649,
"training_cpu_memory_MB": 4431.752,
"training_gpu_0_memory_MB": 13629,
"training_gpu_1_memory_MB": 10,
"validation_f1": 0.5084102337176983,
"validation_accuracy": 0.8026370004120313,
"validation_loss": 0.6763799888523001,
"best_validation_f1": 0.5084102337176983,
"best_validation_accuracy": 0.8026370004120313,
"best_validation_loss": 0.6763799888523001,
"test_f1": 0.4786599434625644,
"test_accuracy": 0.7999423464975497,
"test_loss": 0.679223679412495
}
Could you please check the implementation steps you provided in the README file?
I followed your instructions but find it very hard to reproduce this work, someerrors would come out like version inconsistency between allennlp and transformers, then lead to error like:
subprocess.CalledProcessError: Command 'allennlp train training_config/classifier.jsonnet --include-package dont_stop_pretraining -s model_logs\citation_intent_base' returned non-zero exit status 1.
Or just there are some wrong steps during my implementation? It is really confusing and frustrating.
Hi @kernelmachine / @kyleclo , I'm wondering how News corpus is filtered from RealNews dataset? I have tried to extract docs from RealNews dataset, but got 32.80M docs instead of 11.90M docs as mentioned in the paper. Is there any additional filtering applied? Thanks!
I notice that in your paper, you mentioned that "for RCT, we represent all sentences in one long sequence for
simultaneous prediction". What do you mean by this? I did not find a specifical treatment when dealing with RCT dataset in your code.
Looking forward to your reply, Thanks!
Hi,
I was wondering when we are calculating the macro_f1 score, is it weighted macro_f1 or unweighted macro_f1?
'weighted' here means whether we take care of the support of instances.
Thanks!
Any justifications?
Hi. First of all, thank you for your great work on task adaptation!
Since I want to do some researches about task adaptation of language model,
I think that it will be cool if I can use the dataset that you used.
As far as I saw, the s3 link of datasets is set as private, then other people cannot download it.
Am I miss something even if I can download the dataset from the given link?
If not, do you have any plan to open the dataset you used to the public?
As I suppose, I think it may be difficult since some datasets have copyright...
Thank you for reading my issues!
When I run the DAPT, MemoryError occurs.
It seems that it runs out of my memory.
My data file is 48GB (biomed) filtered by myself and my memory is 128 GB.
Could you give me some hints for solving this problem?
Thanks!
`File "/import/home/X/dont-stop-pretraining/scripts/run_language_modeling.py", line 133, in load_and_cache_examples
return LineByLineTextDataset(tokenizer, args, file_path=file_path, block_size=args.block_size)
File "/import/home/X/dont-stop-pretraining/scripts/run_language_modeling.py", line 119, in init
lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]
File "/home/X/.conda/envs/domains/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
MemoryError`
Thank you for your contributions in pretraining. You trained the encoder for 12.5K steps for each domain in pretraining phase before applying the encoder to supervised downstream tasks. Is it possible that the checkpoints that are most suitable for downstream tasks might appear in the middle of the pretraining phase? This phenomenon is obvious in many real applications. Under the circumstances, we might not know if it is a good choice to directly train the model to the maximum step with all of the corpus and take the final checkpoint. Is there any suggestion on that?
It seems that this repository only contains the code to perform finetuning pretrained RoBERTa. Are code for pretraining available now? Can you possibly add some command example for doing TAPT? Any advice or explanation will be highly appreciated. Thanks in advance!
显示 /bin/sh:1: allennlp:not found “Command allenlp train --include-apckage dont_stop_pretraining training_config/classifier.jsonnet -s model_logs/citation-intent-base” returned non-zero exit status 127
Hi there, check the ADAPTIVE_PRETRAINING.md
file for DAPT/TAPT commands
Originally posted by @kernelmachine in #10 (comment)
I cannot find the 'ADAPTIVE_PRETRAINING.md' file, thank you!
Hi, I have setup the conda environment and ran the scripts, i.e. running firstly
python -m scripts.download_model
--model allenai/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688
--serialization_dir $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688
and then
python -m scripts.train
--config training_config/classifier.jsonnet
--serialization_dir model_logs/citation-intent-dapt-dapt
--hyperparameters ROBERTA_CLASSIFIER_SMALL
--dataset citation_intent
--model $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688
--device 0
--perf +f1
--evaluate_on_test
but I am getting this error now:
/home/mikeleatila/anaconda3/envs/domains/bin/python /home/mikeleatila/dont_stop_pretraining_master/scripts/train.py --config training_config/classifier.jsonnet --serialization_dir model_logs/citation-intent-dapt-dapt --hyperparameters ROBERTA_CLASSIFIER_SMALL --dataset citation_intent --model /home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 --device 0 --perf +f1 --evaluate_on_test
2022-11-24 09:59:10,204 - INFO - transformers.file_utils - PyTorch version 1.13.0 available.
2022-11-24 09:59:10,816 - INFO - pytorch_pretrained_bert.modeling - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
Traceback (most recent call last):
File "/home/mikeleatila/anaconda3/envs/domains/bin/allennlp", line 8, in
sys.exit(run())
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/commands/init.py", line 93, in main
args.func(args)
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/commands/train.py", line 144, in train_model_from_args
dry_run=args.dry_run,
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/commands/train.py", line 203, in train_model_from_file
dry_run=dry_run,
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/commands/train.py", line 266, in train_model
dry_run=dry_run,
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/commands/train.py", line 450, in _train_worker
batch_weight_key=batch_weight_key,
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 555, in from_params
**extras,
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 583, in from_params
kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 188, in create_kwargs
cls.name, param_name, annotation, param.default, params, **extras
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 294, in pop_and_construct_arg
return construct_arg(class_name, name, popped_params, annotation, default, **extras)
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 329, in construct_arg
return annotation.from_params(params=popped_params, **subextras)
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 555, in from_params
**extras,
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 583, in from_params
kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 188, in create_kwargs
cls.name, param_name, annotation, param.default, params, **extras
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 294, in pop_and_construct_arg
return construct_arg(class_name, name, popped_params, annotation, default, **extras)
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 372, in construct_arg
**extras,
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 329, in construct_arg
return annotation.from_params(params=popped_params, **subextras)
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 555, in from_params
**extras,
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 583, in from_params
kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 199, in create_kwargs
params.assert_empty(cls.name)
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/params.py", line 421, in assert_empty
"Extra parameters passed to {}: {}".format(class_name, self.params)
allennlp.common.checks.ConfigurationError: Extra parameters passed to PretrainedTransformerIndexer: {'do_lowercase': False}
2022-11-24 09:59:10,850 - INFO - allennlp.common.params - random_seed = 58860
2022-11-24 09:59:10,850 - INFO - allennlp.common.params - numpy_seed = 58860
2022-11-24 09:59:10,850 - INFO - allennlp.common.params - pytorch_seed = 58860
2022-11-24 09:59:10,851 - INFO - allennlp.common.checks - Pytorch version: 1.13.0
2022-11-24 09:59:10,851 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.commands.train.TrainModel'> from params {'validation_data_path': 'https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/dev.jsonl', 'evaluate_on_test': True, 'model': {'dropout': '0.1', 'feedforward_layer': {'activations': 'tanh', 'hidden_dims': 768, 'input_dim': 768, 'num_layers': 1}, 'seq2vec_encoder': {'embedding_dim': 768, 'type': 'cls_pooler_x'}, 'text_field_embedder': {'roberta': {'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'type': 'basic_classifier_with_f1'}, 'iterator': {'batch_size': 16, 'sorting_keys': [['tokens', 'num_tokens']], 'type': 'bucket'}, 'validation_iterator': {'batch_size': 64, 'sorting_keys': [['tokens', 'num_tokens']], 'type': 'bucket'}, 'train_data_path': 'https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/train.jsonl', 'test_data_path': 'https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/test.jsonl', 'trainer': {'cuda_device': 0, 'gradient_accumulation_batch_size': 16, 'num_epochs': 10, 'num_serialized_models_to_keep': 0, 'optimizer': {'b1': 0.9, 'b2': 0.98, 'e': 1e-06, 'lr': '2e-05', 'max_grad_norm': 1, 'parameter_groups': [[['bias', 'LayerNorm.bias', 'LayerNorm.weight', 'layer_norm.weight'], {'weight_decay': 0}, []]], 'schedule': 'warmup_linear', 't_total': -1, 'type': 'bert_adam', 'warmup': 0.06, 'weight_decay': 0.1}, 'patience': 3, 'validation_metric': '+f1'}, 'validation_dataset_reader': {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': [''], 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': [''], 'type': 'pretrained_transformer'}, 'type': 'text_classification_json_with_sampling'}, 'dataset_reader': {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': [''], 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': [''], 'type': 'pretrained_transformer'}, 'type': 'text_classification_json_with_sampling'}} and extras {'batch_weight_key', 'local_rank', 'serialization_dir'}'], 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': ['
2022-11-24 09:59:10,851 - INFO - allennlp.common.params - type = default
2022-11-24 09:59:10,851 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.commands.train.TrainModel'> from params {'validation_data_path': 'https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/dev.jsonl', 'evaluate_on_test': True, 'model': {'dropout': '0.1', 'feedforward_layer': {'activations': 'tanh', 'hidden_dims': 768, 'input_dim': 768, 'num_layers': 1}, 'seq2vec_encoder': {'embedding_dim': 768, 'type': 'cls_pooler_x'}, 'text_field_embedder': {'roberta': {'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'type': 'basic_classifier_with_f1'}, 'iterator': {'batch_size': 16, 'sorting_keys': [['tokens', 'num_tokens']], 'type': 'bucket'}, 'validation_iterator': {'batch_size': 64, 'sorting_keys': [['tokens', 'num_tokens']], 'type': 'bucket'}, 'train_data_path': 'https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/train.jsonl', 'test_data_path': 'https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/test.jsonl', 'trainer': {'cuda_device': 0, 'gradient_accumulation_batch_size': 16, 'num_epochs': 10, 'num_serialized_models_to_keep': 0, 'optimizer': {'b1': 0.9, 'b2': 0.98, 'e': 1e-06, 'lr': '2e-05', 'max_grad_norm': 1, 'parameter_groups': [[['bias', 'LayerNorm.bias', 'LayerNorm.weight', 'layer_norm.weight'], {'weight_decay': 0}, []]], 'schedule': 'warmup_linear', 't_total': -1, 'type': 'bert_adam', 'warmup': 0.06, 'weight_decay': 0.1}, 'patience': 3, 'validation_metric': '+f1'}, 'validation_dataset_reader': {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': [''], 'type': 'pretrained_transformer'}, 'type': 'text_classification_json_with_sampling'}, 'dataset_reader': {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': [''], 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': [''], 'type': 'pretrained_transformer'}, 'type': 'text_classification_json_with_sampling'}} and extras {'batch_weight_key', 'local_rank', 'serialization_dir'}'], 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': ['
2022-11-24 09:59:10,851 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.dataset_readers.dataset_reader.DatasetReader'> from params {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': [''], 'type': 'pretrained_transformer'}, 'type': 'text_classification_json_with_sampling'} and extras {'batch_weight_key', 'local_rank', 'serialization_dir'}'], 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': ['
2022-11-24 09:59:10,851 - INFO - allennlp.common.params - dataset_reader.type = text_classification_json_with_sampling
2022-11-24 09:59:10,851 - INFO - allennlp.common.from_params - instantiating class <class 'dont_stop_pretraining.data.dataset_readers.text_classification_json_reader_with_sampling.TextClassificationJsonReaderWithSampling'> from params {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': [''], 'type': 'pretrained_transformer'}} and extras {'batch_weight_key', 'local_rank', 'serialization_dir'}
2022-11-24 09:59:10,851 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.token_indexers.token_indexer.TokenIndexer'> from params {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'} and extras {'batch_weight_key', 'local_rank', 'serialization_dir'}
2022-11-24 09:59:10,852 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.type = pretrained_transformer
2022-11-24 09:59:10,852 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.token_indexers.pretrained_transformer_indexer.PretrainedTransformerIndexer'> from params {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688'} and extras {'batch_weight_key', 'local_rank', 'serialization_dir'}
2022-11-24 09:59:10,852 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.token_min_padding_length = 0
2022-11-24 09:59:10,852 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.model_name = /home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688
2022-11-24 09:59:10,852 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.namespace = tags
2022-11-24 09:59:10,852 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.max_length = None
Traceback (most recent call last):
File "/home/mikeleatila/dont_stop_pretraining_master/scripts/train.py", line 142, in
main()
File "/home/mikeleatila/dont_stop_pretraining_master/scripts/train.py", line 139, in main
subprocess.run(" ".join(allennlp_command), shell=True, check=True)
File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'allennlp train --include-package dont_stop_pretraining training_config/classifier.jsonnet -s model_logs/citation-intent-dapt-dapt' returned non-zero exit status @1.
Many thanks in advance!
After use my own corpus to do domain-adaptive pretraining, the vocab.txt
is the same size with the initialized model(BERT-base). In short, the domain-adaptive pretraining does not extend the vocabulary of the new domain? Therefore same specific
vocabulary of the new domain still not exist in the domain-adaptive pretraining result vocab.txt
. Is that?
Hi, thank you for the interesting work!
I understand that an interface is provided by the conda packages and the conda environment, which is very convenient for reproducing results. Now I desire to develop the code and add other tricks to the model. Is there a natural way to do so? I mean I can definitely dig into the conda packages and work on them but that's against common develop habits...
I am trying to understand the method for TAPT. For chemprot, for example, are you using the same train dataset that is being used for fine-tuning? This chemprot dataset was just augmented with "randomly masking different tokens across epochs, using the masking probability of 0.15". Or is there some other UNLABELED dataset used for chemprot when doing the TAPT, and then this labeled chemprot data that is open sourced was only used for fine tuning on the downstream task?
I was wondering will you use the unlabelled valid/test set in TAPT?
@kyleclo
Further DAPT was implemented on each domain for 12.5K steps with unlabeled data from target domain only. I am wondering whether not adding unlabeled data from original LM domain leads to detrimental forgetting or overfitting.
When I run the following command:
curl -Lo train.jsonl https://allennlp.s3-us-west-2.amazonaws.com/dont_stop_pretraining/data/chemprot/train.jsonl
I got this error:
<?xml version="1.0" encoding="UTF-8"?> <Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>A49C5B9A8511C5A7</RequestId><HostId>O3qjG2TZgfQV6vxVOsMVUz5rvnaOESnSnKNnZXuAcJ/JKYlGcdlS3azlOZgg2+V0bWq2sNx7wgY=</HostId></Error>
Can you see what might be wrong with the AWS server?
How to extend this for T5 models? Any ideas?
Thanks for your excellent work. I select data following DATA_SELECTION.md. During extracting VAMPIRE embeddings, one should run 'python -m scripts.run_vampire ......'. However, the file 'scripts/run_vampire.py' does not exist in the current VAMPIRE project (http://github.com/allenai/vampire). Is the file removed in the current version? Any suggestions? Thanks.
Hi Team, I'm wondering how CS/BioMed corpus is filtered from S2ORC dataset? I didn't find details on this in the original paper, could you share some light on this? Thanks!
Hi,
I am doing DAPT on CS domain with 38 GB CS data on a single TPU V3-8.
It is estimated that will cost 20-24 hours for one epoch.
I see from the paper you use TPU V3-8 as well but I do not find the time information in the paper. Would you like to share how much time you need for pretraining?
Thanks!
In environments/datasets.py, it seems that datasets for DAPT are missing.
Can you provide external links to download datasets for DAPT?
We want to report issues that could affect the reproducibility of the masked LM loss calculation at test time.
First, we do not get exactly the same results reported in Table 3 of paper when we use the fairseq library instead of the transformers library, after we convert the transformers checkpoint to a fairseq checkpoint.
A related pull request was opened and closed, but did not fix our problem. Second, the results in Table 3 are calculated using the batch size of 1. With the batch sizes larger than 1, we do not get the same results. In particular, the results change for a sample of reviews. As we have already mentioned, reviews are much shorter than documents from other domains. Therefore, unlike documents in other domains that are usually of the maximum length, reviews need to be padded to the maximum length. For this reason, we suspect that padding somehow influences the masked LM loss calculation. However, with the batch size of 1 we do not need to pad, and therefore we find results in Table 3 reliable.
Hi, I was trying to run the command:
python -m scripts.train \
--config training_config/classifier.jsonnet \
--serialization_dir model_logs/citation_intent_base \
--hyperparameters ROBERTA_CLASSIFIER_SMALL \
--dataset citation_intent \
--model $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 \
--device 0 \
--perf +f1 \
--evaluate_on_test
Before running this command, I did the following steps in the following order.
1.
pip install pytorch-transformers
pip install transformers
pip install git+https://github.com/kernelmachine/allennlp.git@4ae123d2c3bfb1ea3ce7362cb6c5bca3d094ffa7
python -m scripts.download_model \
--model allenai/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 \
--serialization_dir $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688
After these two steps, when I run the scripts.train command, I get the error shown below.
2020-07-29 10:13:11,360 - INFO - pytorch_pretrained_bert.modeling - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
2020-07-29 10:13:12,114 - INFO - pytorch_transformers.modeling_bert - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
2020-07-29 10:13:12,117 - INFO - pytorch_transformers.modeling_xlnet - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
2020-07-29 10:13:12,709 - INFO - allennlp.common.params - random_seed = 278011
2020-07-29 10:13:12,710 - INFO - allennlp.common.params - numpy_seed = 278011
2020-07-29 10:13:12,710 - INFO - allennlp.common.params - pytorch_seed = 278011
2020-07-29 10:13:12,780 - INFO - allennlp.common.checks - Pytorch version: 1.5.1+cu101
2020-07-29 10:13:12,782 - INFO - allennlp.common.params - evaluate_on_test = True
2020-07-29 10:13:12,782 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.dataset_readers.dataset_reader.DatasetReader'> from params {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': ['</s>'], 'model_name': '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': ['<s>'], 'type': 'pretrained_transformer'}, 'type': 'text_classification_json_with_sampling'} and extras set()
2020-07-29 10:13:12,782 - INFO - allennlp.common.params - dataset_reader.type = text_classification_json_with_sampling
2020-07-29 10:13:12,782 - INFO - allennlp.common.from_params - instantiating class <class 'dont_stop_pretraining.data.dataset_readers.text_classification_json_reader_with_sampling.TextClassificationJsonReaderWithSampling'> from params {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': ['</s>'], 'model_name': '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': ['<s>'], 'type': 'pretrained_transformer'}} and extras set()
2020-07-29 10:13:12,783 - INFO - allennlp.common.from_params - instantiating class allennlp.data.token_indexers.token_indexer.TokenIndexer from params {'do_lowercase': False, 'model_name': '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'} and extras set()
2020-07-29 10:13:12,783 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.type = pretrained_transformer
2020-07-29 10:13:12,783 - INFO - allennlp.common.from_params - instantiating class allennlp.data.token_indexers.pretrained_transformer_indexer.PretrainedTransformerIndexer from params {'do_lowercase': False, 'model_name': '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688'} and extras set()
2020-07-29 10:13:12,783 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.model_name = /content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688
2020-07-29 10:13:12,783 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.do_lowercase = False
2020-07-29 10:13:12,784 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.namespace = tags
2020-07-29 10:13:12,784 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.token_min_padding_length = 0
2020-07-29 10:13:12,784 - INFO - pytorch_transformers.tokenization_utils - Model name '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc). Assuming '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688' is a path or url to a directory containing tokenizer files.
2020-07-29 10:13:12,784 - INFO - pytorch_transformers.tokenization_utils - Didn't find file /content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688/vocab.txt. We won't load it.
2020-07-29 10:13:12,784 - INFO - pytorch_transformers.tokenization_utils - Didn't find file /content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688/added_tokens.json. We won't load it.
2020-07-29 10:13:12,784 - INFO - pytorch_transformers.tokenization_utils - loading file None
2020-07-29 10:13:12,784 - INFO - pytorch_transformers.tokenization_utils - loading file None
2020-07-29 10:13:12,785 - INFO - pytorch_transformers.tokenization_utils - loading file /content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688/special_tokens_map.json
Traceback (most recent call last):
File "/usr/local/bin/allennlp", line 8, in <module>
sys.exit(run())
File "/usr/local/lib/python3.6/dist-packages/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/__init__.py", line 120, in main
args.func(args)
File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 150, in train_model_from_args
args.cache_prefix,
File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 199, in train_model_from_file
cache_prefix,
File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 257, in train_model
params, serialization_dir, recover, cache_directory, cache_prefix
File "/usr/local/lib/python3.6/dist-packages/allennlp/training/trainer_pieces.py", line 45, in from_params
all_datasets = training_util.datasets_from_params(params, cache_directory, cache_prefix)
File "/usr/local/lib/python3.6/dist-packages/allennlp/training/util.py", line 169, in datasets_from_params
dataset_reader = DatasetReader.from_params(dataset_reader_params)
File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 377, in from_params
return subclass.from_params(params=params, **extras)
File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 398, in from_params
kwargs = create_kwargs(cls, params, **extras)
File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 140, in create_kwargs
kwargs[name] = construct_arg(cls, name, annotation, param.default, params, **extras)
File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 265, in construct_arg
value_dict[key] = value_cls.from_params(params=value_params, **subextras)
File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 377, in from_params
return subclass.from_params(params=params, **extras)
File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 400, in from_params
return cls(**kwargs) # type: ignore
File "/usr/local/lib/python3.6/dist-packages/allennlp/data/token_indexers/pretrained_transformer_indexer.py", line 58, in __init__
self.tokenizer = AutoTokenizer.from_pretrained(model_name, do_lower_case=do_lowercase)
File "/usr/local/lib/python3.6/dist-packages/pytorch_transformers/tokenization_auto.py", line 89, in from_pretrained
return BertTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/pytorch_transformers/tokenization_bert.py", line 216, in from_pretrained
return super(BertTokenizer, cls)._from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/pytorch_transformers/tokenization_utils.py", line 327, in _from_pretrained
tokenizer = cls(*inputs, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/pytorch_transformers/tokenization_bert.py", line 128, in __init__
if not os.path.isfile(vocab_file):
File "/usr/lib/python3.6/genericpath.py", line 30, in isfile
st = os.stat(path)
TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/content/dont-stop-pretraining/scripts/train.py", line 143, in <module>
main()
File "/content/dont-stop-pretraining/scripts/train.py", line 140, in main
subprocess.run(" ".join(allennlp_command), shell=True, check=True)
File "/usr/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'allennlp train --include-package dont_stop_pretraining training_config/classifier.jsonnet -s model_logs/citation_intent_base' returned non-zero exit status 1.
I am running the code in Google Colab. I will be very grateful if anyone can help me in understanding where I am going wrong. Thanks.
How did you split the IMDB dataset into train and dev parts (25.000 -> 20.000 + 5.000)? Is this some kind of standard split or did you randomly split?
In environments/datasets.py, it seems that datasets for DAPT are missing.
Can you provide external links to download datasets for DAPT?
Traceback (most recent call last):
File "/.virtualenvs/bort_test/lib/python3.6/site-packages/allennlp/common/params.py", line 239, in pop
value = self.params.pop(key)
KeyError: 'data_loader'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/.virtualenvs/bort_test/bin/allennlp", line 9, in
sys.exit(run())
File "/.virtualenvs/bort_test/lib/python3.6/site-packages/allennlp/main.py", line 34, in run
main(prog="allennlp")
File "/.virtualenvs/bort_test/lib/python3.6/site-packages/allennlp/commands/init.py", line 94, in main
args.func(args)
File "/.virtualenvs/bort_test/lib/python3.6/site-packages/allennlp/commands/train.py", line 112, in train_model_from_args
file_friendly_logging=args.file_friendly_logging,
File "/.virtualenvs/bort_test/lib/python3.6/site-packages/allennlp/commands/train.py", line 171, in train_model_from_file
file_friendly_logging=file_friendly_logging,
File "/.virtualenvs/bort_test/lib/python3.6/site-packages/allennlp/commands/train.py", line 232, in train_model
file_friendly_logging=file_friendly_logging,
File "/.virtualenvs/bort_test/lib/python3.6/site-packages/allennlp/commands/train.py", line 423, in _train_worker
params=params, serialization_dir=serialization_dir, local_rank=process_rank,
File "/.virtualenvs/bort_test/lib/python3.6/site-packages/allennlp/common/from_params.py", line 583, in from_params
**extras,
File "/.virtualenvs/bort_test/lib/python3.6/site-packages/allennlp/common/from_params.py", line 612, in from_params
kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
File "/.virtualenvs/bort_test/lib/python3.6/site-packages/allennlp/common/from_params.py", line 182, in create_kwargs
cls.name, param_name, annotation, param.default, params, **extras
File "/.virtualenvs/bort_test/lib/python3.6/site-packages/allennlp/common/from_params.py", line 283, in pop_and_construct_arg
popped_params = params.pop(name, default) if default != _NO_DEFAULT else params.pop(name)
File "/.virtualenvs/bort_test/lib/python3.6/site-packages/allennlp/common/params.py", line 247, in pop
raise ConfigurationError(msg)
allennlp.common.checks.ConfigurationError: key "data_loader" is required
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/administrator/dont-stop-pretraining/scripts/train.py", line 145, in
main()
File "/home/administrator/dont-stop-pretraining/scripts/train.py", line 141, in main
subprocess.run(" ".join(allennlp_command), shell=True, check=True)
File "/usr/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'allennlp train --include-package dont_stop_pretraining training_config/classifier.jsonnet -s model_logs/citation_intent_base' returned non-zero exit status 1.
the script is :
python -m scripts.train
--config training_config/classifier.jsonnet
--serialization_dir model_logs/citation_intent_base
--hyperparameters ROBERTA_CLASSIFIER_SMALL
--dataset citation_intent
--model roberta-base
--device 0
--perf +f1
--evaluate_on_test
Hi,
I am trying to train the biomed_roberta_base model on the chemprot dataset using the provided scripts.train python command and encounter the below issues.
The above dataset and models have been downloaded as stated in the README of the master branch. Also since the mentioned environment wasn't working for me I am using the below conda environment
Please let me know how to solve the above issue. It seems like the tokenizer asks for a vocab file but I am not sure how to provide one.
Hi, very impressive work. A quick question, how to use the pretrained TAPT (e.g., allenai/dsp_roberta_base_tapt_chemprot_4169) with the corresponding pretrained classification head? I can only get a sentence embedding via 'all_hidden_states = model(input_ids)'.
Hi,
When I was running the vampire training, I found there are some issues due to the consistent version of allennlp and vampire.
If allennlp==0.9.0, it says "no import module and submodule"
If allennlp==1.0.0, it says "from_files need 2 arguments but 4 are given"
As the authors mentioned in README.md,
pytorch-transformers 1.20 is not compatible with the specified branch in environment.yml.
I tried:
However, I couldn't get this working.
Has anyone been able to run the basic model(Roberta) recently?
Hi,
When following the instructions in DATA_SELECTION.md, upon running the command ""python -m scripts.train --config training_config/vampire.jsonnet --serialization-dir model_logs/vampire-world --environment VAMPIRE --device 0 -o", I get the following error:
ImportError: cannot import name 'SpacyTokenizer' from 'allennlp.data.tokenizers' (/path/to/python3.7/site-packages/allennlp/data/tokenizers/init.py)
If I upgrade to alllennlp==1.0, it states:
Something went wrong during jsonnet_evaluate_file, please report this: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - invalid literal; last read: 'Z'
I notice scripts/train.py works on the vampire master branch, so this seems to be an issue related to the vampire branch the DSP DATA_SELECTION.md requires.
I've tried staying on master branch, copying 'run_vampire.py' from dont-stop-pretraining/scripts/tapt_selection and then running "parallel --ungroup python -m scripts.run_vampire ${VAMPIRE_DIR}/model_logs/vampire-world/model.tar.gz {1} --batch 64 --include-package vampire --predictor vampire --output-file ${ROOT_DIR}/task_emb/{1/.} --silent ::: ${ROOT_DIR}/task_shards/*". This gives:
ImportError: cannot import name 'import_module_and_submodules' from 'allennlp.common.util' (/home/mitarb/vdberg/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/util.py)
Could you perhaps help out with this? Thank you!
Dear authors,
I was wondering how to get the pretraining corpus?
Is it possible to open-source them?
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.