Comments (4)
Hi @aahad5555 ,
Could you specify which code version were you working with (fairseq or huggingface) and provide the full error log? We have not encountered similar errors before so it probably was due to some problems with your package version.
Thanks,
Yu
from coco-lm.
Thanks for replying @yumeng5
I am trying to reproduce results on fairseq code and here is the error:
RuntimeError: Internal: src/sentencepiece_processor.cc(890) [model_proto->ParseFromArray(serialized.data(), serialized.size())] Process ForkPoolWorker-380: Traceback (most recent call last): File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(*self._args, **self._kwargs) File "/usr/lib/python3.7/multiprocessing/pool.py", line 105, in worker initializer(*initargs) File "/content/COCO-LM/fairseq/preprocess/glue/multiprocessing_sp_encoder.py", line 42, in initializer bpe = SentencepieceBPE(self.args) File "/usr/local/lib/python3.7/dist-packages/fairseq/data/encoders/sentencepiece_bpe.py", line 25, in __init__ self.sp.Load(sentencepiece_model) File "/usr/local/lib/python3.7/dist-packages/sentencepiece/__init__.py", line 367, in Load return self.LoadFromFile(model_file) File "/usr/local/lib/python3.7/dist-packages/sentencepiece/__init__.py", line 171, in LoadFromFile return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
and upon running script from Huggingface the following error occurs:
ModuleNotFoundError: No module named 'cocolm'
even if I full filled every requirement. I might be missing something, please point out my mistake.
Thanks.
Jamal
from coco-lm.
Hi @aahad5555 ,
I've run the preprocessing code (the fairseq version) on my machine and it worked well for me. The command I used and the terminal outputs are shown below (~/glue_data/
is the GLUE raw data directory and ~/test_combo_dict/
is the directory containing the two files downloaded from here). I would suggest double-checking your sentencepiece package version (I was using 0.1.95
) and your downloaded dictionary files.
~/COCO-LM/fairseq/preprocess/glue$ bash process.sh ~/glue_data/ CoLA ~/test_combo_dict/ glue_processed
Preprocessing CoLA
Raw data as downloaded from glue website: /home/yumeng5/glue_data//CoLA
BPE encoding train/input0
BPE encoding dev/input0
BPE encoding test/input0
2022-03-24 11:45:05 | INFO | fairseq_cli.preprocess | Namespace(no_progress_bar=False, log_interval=100, log_format=None, tensorboard_logdir=None, wandb_project=None, azureml_logging=False, seed=1, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, quantization_config_path=None, profile=False, reset_logging=False, suppress_crashes=False, use_plasma_view=False, plasma_path='/tmp/plasma', criterion='cross_entropy', simul_type=None, tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang=None, target_lang=None, trainpref='/home/yumeng5/glue_data//CoLA/processed/train.input0', validpref='/home/yumeng5/glue_data//CoLA/processed/dev.input0', testpref='/home/yumeng5/glue_data//CoLA/processed/test.input0', align_suffix=None, destdir='glue_processed/CoLA-bin/input0', thresholdtgt=0, thresholdsrc=0, tgtdict=None, srcdict='/home/yumeng5/test_combo_dict//dict.txt', nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=False, only_source=True, padding_factor=8, workers=8)
2022-03-24 11:45:05 | INFO | fairseq_cli.preprocess | [None] Dictionary: 64000 types
2022-03-24 11:45:05 | INFO | fairseq_cli.preprocess | [None] /home/yumeng5/glue_data//CoLA/processed/train.input0: 8551 sents, 87105 tokens, 0.0% replaced by <unk>
2022-03-24 11:45:05 | INFO | fairseq_cli.preprocess | [None] Dictionary: 64000 types
2022-03-24 11:45:05 | INFO | fairseq_cli.preprocess | [None] /home/yumeng5/glue_data//CoLA/processed/dev.input0: 1043 sents, 11052 tokens, 0.0% replaced by <unk>
2022-03-24 11:45:05 | INFO | fairseq_cli.preprocess | [None] Dictionary: 64000 types
2022-03-24 11:45:05 | INFO | fairseq_cli.preprocess | [None] /home/yumeng5/glue_data//CoLA/processed/test.input0: 1063 sents, 11119 tokens, 0.0% replaced by <unk>
2022-03-24 11:45:05 | INFO | fairseq_cli.preprocess | Wrote preprocessed data to glue_processed/CoLA-bin/input0
2022-03-24 11:45:07 | INFO | fairseq_cli.preprocess | Namespace(no_progress_bar=False, log_interval=100, log_format=None, tensorboard_logdir=None, wandb_project=None, azureml_logging=False, seed=1, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, quantization_config_path=None, profile=False, reset_logging=False, suppress_crashes=False, use_plasma_view=False, plasma_path='/tmp/plasma', criterion='cross_entropy', simul_type=None, tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang=None, target_lang=None, trainpref='/home/yumeng5/glue_data//CoLA/processed/train.label', validpref='/home/yumeng5/glue_data//CoLA/processed/dev.label', testpref=None, align_suffix=None, destdir='glue_processed/CoLA-bin/label', thresholdtgt=0, thresholdsrc=0, tgtdict=None, srcdict=None, nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=False, only_source=True, padding_factor=8, workers=8)
2022-03-24 11:45:07 | INFO | fairseq_cli.preprocess | [None] Dictionary: 8 types
2022-03-24 11:45:07 | INFO | fairseq_cli.preprocess | [None] /home/yumeng5/glue_data//CoLA/processed/train.label: 8551 sents, 17102 tokens, 0.0% replaced by <unk>
2022-03-24 11:45:07 | INFO | fairseq_cli.preprocess | [None] Dictionary: 8 types
2022-03-24 11:45:07 | INFO | fairseq_cli.preprocess | [None] /home/yumeng5/glue_data//CoLA/processed/dev.label: 1043 sents, 2086 tokens, 0.0% replaced by <unk>
2022-03-24 11:45:07 | INFO | fairseq_cli.preprocess | Wrote preprocessed data to glue_processed/CoLA-bin/label
Regarding the error ModuleNotFoundError: No module named 'cocolm'
from the Huggingface version, actually the module cocolm
simply refers to the subdirectory cocolm
, so if you are running the code under the huggingface
directory, Python should automatically detect the module and you should be able to get rid of the error.
Best,
Yu
from coco-lm.
Hi @yumeng5 ,
Thank you a lot for your help, I was making a bunch of mistakes due to which I was unable to reproduce. I figured it out now and the code runs just fine.
Thanks,
Jamal
from coco-lm.
Related Issues (7)
- coco-LM post-pretrain? HOT 2
- Fairseq to HF converter HOT 2
- how to test?
- code for pre-training HOT 28
- support of Torch > 1.1 HOT 3
- Pretrain script HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from coco-lm.