having this issue in GLUE preprocessing, may be due to issues in pretrained sentencepi

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks for replying <a class="user-mention notranslate" data-hovercard-type="user" dat

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

RuntimeError: Internal: src/sentencepiece_processor.cc(890) about coco-lm HOT 4 CLOSED

microsoft commented on July 4, 2024

RuntimeError: Internal: src/sentencepiece_processor.cc(890)

from coco-lm.

Comments (4)

yumeng5 commented on July 4, 2024

Hi @aahad5555 ,

Could you specify which code version were you working with (fairseq or huggingface) and provide the full error log? We have not encountered similar errors before so it probably was due to some problems with your package version.

Thanks,
Yu

from coco-lm.

aahad5555 commented on July 4, 2024

Thanks for replying @yumeng5

I am trying to reproduce results on fairseq code and here is the error:

RuntimeError: Internal: src/sentencepiece_processor.cc(890) [model_proto->ParseFromArray(serialized.data(), serialized.size())] Process ForkPoolWorker-380: Traceback (most recent call last): File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(*self._args, **self._kwargs) File "/usr/lib/python3.7/multiprocessing/pool.py", line 105, in worker initializer(*initargs) File "/content/COCO-LM/fairseq/preprocess/glue/multiprocessing_sp_encoder.py", line 42, in initializer bpe = SentencepieceBPE(self.args) File "/usr/local/lib/python3.7/dist-packages/fairseq/data/encoders/sentencepiece_bpe.py", line 25, in __init__ self.sp.Load(sentencepiece_model) File "/usr/local/lib/python3.7/dist-packages/sentencepiece/__init__.py", line 367, in Load return self.LoadFromFile(model_file) File "/usr/local/lib/python3.7/dist-packages/sentencepiece/__init__.py", line 171, in LoadFromFile return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)

and upon running script from Huggingface the following error occurs:

ModuleNotFoundError: No module named 'cocolm'

even if I full filled every requirement. I might be missing something, please point out my mistake.

Thanks.
Jamal

from coco-lm.

yumeng5 commented on July 4, 2024

Hi @aahad5555 ,

I've run the preprocessing code (the fairseq version) on my machine and it worked well for me. The command I used and the terminal outputs are shown below (~/glue_data/ is the GLUE raw data directory and ~/test_combo_dict/ is the directory containing the two files downloaded from here). I would suggest double-checking your sentencepiece package version (I was using 0.1.95) and your downloaded dictionary files.

~/COCO-LM/fairseq/preprocess/glue$ bash process.sh ~/glue_data/ CoLA ~/test_combo_dict/ glue_processed
Preprocessing CoLA
Raw data as downloaded from glue website: /home/yumeng5/glue_data//CoLA
BPE encoding train/input0
BPE encoding dev/input0
BPE encoding test/input0
2022-03-24 11:45:05 | INFO | fairseq_cli.preprocess | Namespace(no_progress_bar=False, log_interval=100, log_format=None, tensorboard_logdir=None, wandb_project=None, azureml_logging=False, seed=1, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, quantization_config_path=None, profile=False, reset_logging=False, suppress_crashes=False, use_plasma_view=False, plasma_path='/tmp/plasma', criterion='cross_entropy', simul_type=None, tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang=None, target_lang=None, trainpref='/home/yumeng5/glue_data//CoLA/processed/train.input0', validpref='/home/yumeng5/glue_data//CoLA/processed/dev.input0', testpref='/home/yumeng5/glue_data//CoLA/processed/test.input0', align_suffix=None, destdir='glue_processed/CoLA-bin/input0', thresholdtgt=0, thresholdsrc=0, tgtdict=None, srcdict='/home/yumeng5/test_combo_dict//dict.txt', nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=False, only_source=True, padding_factor=8, workers=8)
2022-03-24 11:45:05 | INFO | fairseq_cli.preprocess | [None] Dictionary: 64000 types
2022-03-24 11:45:05 | INFO | fairseq_cli.preprocess | [None] /home/yumeng5/glue_data//CoLA/processed/train.input0: 8551 sents, 87105 tokens, 0.0% replaced by <unk>
2022-03-24 11:45:05 | INFO | fairseq_cli.preprocess | [None] Dictionary: 64000 types
2022-03-24 11:45:05 | INFO | fairseq_cli.preprocess | [None] /home/yumeng5/glue_data//CoLA/processed/dev.input0: 1043 sents, 11052 tokens, 0.0% replaced by <unk>
2022-03-24 11:45:05 | INFO | fairseq_cli.preprocess | [None] Dictionary: 64000 types
2022-03-24 11:45:05 | INFO | fairseq_cli.preprocess | [None] /home/yumeng5/glue_data//CoLA/processed/test.input0: 1063 sents, 11119 tokens, 0.0% replaced by <unk>
2022-03-24 11:45:05 | INFO | fairseq_cli.preprocess | Wrote preprocessed data to glue_processed/CoLA-bin/input0
2022-03-24 11:45:07 | INFO | fairseq_cli.preprocess | Namespace(no_progress_bar=False, log_interval=100, log_format=None, tensorboard_logdir=None, wandb_project=None, azureml_logging=False, seed=1, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, quantization_config_path=None, profile=False, reset_logging=False, suppress_crashes=False, use_plasma_view=False, plasma_path='/tmp/plasma', criterion='cross_entropy', simul_type=None, tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang=None, target_lang=None, trainpref='/home/yumeng5/glue_data//CoLA/processed/train.label', validpref='/home/yumeng5/glue_data//CoLA/processed/dev.label', testpref=None, align_suffix=None, destdir='glue_processed/CoLA-bin/label', thresholdtgt=0, thresholdsrc=0, tgtdict=None, srcdict=None, nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=False, only_source=True, padding_factor=8, workers=8)
2022-03-24 11:45:07 | INFO | fairseq_cli.preprocess | [None] Dictionary: 8 types
2022-03-24 11:45:07 | INFO | fairseq_cli.preprocess | [None] /home/yumeng5/glue_data//CoLA/processed/train.label: 8551 sents, 17102 tokens, 0.0% replaced by <unk>
2022-03-24 11:45:07 | INFO | fairseq_cli.preprocess | [None] Dictionary: 8 types
2022-03-24 11:45:07 | INFO | fairseq_cli.preprocess | [None] /home/yumeng5/glue_data//CoLA/processed/dev.label: 1043 sents, 2086 tokens, 0.0% replaced by <unk>
2022-03-24 11:45:07 | INFO | fairseq_cli.preprocess | Wrote preprocessed data to glue_processed/CoLA-bin/label

Regarding the error ModuleNotFoundError: No module named 'cocolm' from the Huggingface version, actually the module cocolm simply refers to the subdirectory cocolm, so if you are running the code under the huggingface directory, Python should automatically detect the module and you should be able to get rid of the error.

Best,
Yu

from coco-lm.

aahad5555 commented on July 4, 2024

Hi @yumeng5 ,

Thank you a lot for your help, I was making a bunch of mistakes due to which I was unable to reproduce. I figured it out now and the code runs just fine.

Thanks,
Jamal

from coco-lm.

RuntimeError: Internal: src/sentencepiece_processor.cc(890) about coco-lm HOT 4 CLOSED

Comments (4)

Related Issues (7)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent