Git Product home page Git Product logo

Comments (4)

yumeng5 avatar yumeng5 commented on July 4, 2024

Hi @aahad5555 ,

Could you specify which code version were you working with (fairseq or huggingface) and provide the full error log? We have not encountered similar errors before so it probably was due to some problems with your package version.

Thanks,
Yu

from coco-lm.

aahad5555 avatar aahad5555 commented on July 4, 2024

Thanks for replying @yumeng5

I am trying to reproduce results on fairseq code and here is the error:

RuntimeError: Internal: src/sentencepiece_processor.cc(890) [model_proto->ParseFromArray(serialized.data(), serialized.size())] Process ForkPoolWorker-380: Traceback (most recent call last): File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(*self._args, **self._kwargs) File "/usr/lib/python3.7/multiprocessing/pool.py", line 105, in worker initializer(*initargs) File "/content/COCO-LM/fairseq/preprocess/glue/multiprocessing_sp_encoder.py", line 42, in initializer bpe = SentencepieceBPE(self.args) File "/usr/local/lib/python3.7/dist-packages/fairseq/data/encoders/sentencepiece_bpe.py", line 25, in __init__ self.sp.Load(sentencepiece_model) File "/usr/local/lib/python3.7/dist-packages/sentencepiece/__init__.py", line 367, in Load return self.LoadFromFile(model_file) File "/usr/local/lib/python3.7/dist-packages/sentencepiece/__init__.py", line 171, in LoadFromFile return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)

and upon running script from Huggingface the following error occurs:

ModuleNotFoundError: No module named 'cocolm'

even if I full filled every requirement. I might be missing something, please point out my mistake.

Thanks.
Jamal

from coco-lm.

yumeng5 avatar yumeng5 commented on July 4, 2024

Hi @aahad5555 ,

I've run the preprocessing code (the fairseq version) on my machine and it worked well for me. The command I used and the terminal outputs are shown below (~/glue_data/ is the GLUE raw data directory and ~/test_combo_dict/ is the directory containing the two files downloaded from here). I would suggest double-checking your sentencepiece package version (I was using 0.1.95) and your downloaded dictionary files.

~/COCO-LM/fairseq/preprocess/glue$ bash process.sh ~/glue_data/ CoLA ~/test_combo_dict/ glue_processed
Preprocessing CoLA
Raw data as downloaded from glue website: /home/yumeng5/glue_data//CoLA
BPE encoding train/input0
BPE encoding dev/input0
BPE encoding test/input0
2022-03-24 11:45:05 | INFO | fairseq_cli.preprocess | Namespace(no_progress_bar=False, log_interval=100, log_format=None, tensorboard_logdir=None, wandb_project=None, azureml_logging=False, seed=1, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, quantization_config_path=None, profile=False, reset_logging=False, suppress_crashes=False, use_plasma_view=False, plasma_path='/tmp/plasma', criterion='cross_entropy', simul_type=None, tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang=None, target_lang=None, trainpref='/home/yumeng5/glue_data//CoLA/processed/train.input0', validpref='/home/yumeng5/glue_data//CoLA/processed/dev.input0', testpref='/home/yumeng5/glue_data//CoLA/processed/test.input0', align_suffix=None, destdir='glue_processed/CoLA-bin/input0', thresholdtgt=0, thresholdsrc=0, tgtdict=None, srcdict='/home/yumeng5/test_combo_dict//dict.txt', nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=False, only_source=True, padding_factor=8, workers=8)
2022-03-24 11:45:05 | INFO | fairseq_cli.preprocess | [None] Dictionary: 64000 types
2022-03-24 11:45:05 | INFO | fairseq_cli.preprocess | [None] /home/yumeng5/glue_data//CoLA/processed/train.input0: 8551 sents, 87105 tokens, 0.0% replaced by <unk>
2022-03-24 11:45:05 | INFO | fairseq_cli.preprocess | [None] Dictionary: 64000 types
2022-03-24 11:45:05 | INFO | fairseq_cli.preprocess | [None] /home/yumeng5/glue_data//CoLA/processed/dev.input0: 1043 sents, 11052 tokens, 0.0% replaced by <unk>
2022-03-24 11:45:05 | INFO | fairseq_cli.preprocess | [None] Dictionary: 64000 types
2022-03-24 11:45:05 | INFO | fairseq_cli.preprocess | [None] /home/yumeng5/glue_data//CoLA/processed/test.input0: 1063 sents, 11119 tokens, 0.0% replaced by <unk>
2022-03-24 11:45:05 | INFO | fairseq_cli.preprocess | Wrote preprocessed data to glue_processed/CoLA-bin/input0
2022-03-24 11:45:07 | INFO | fairseq_cli.preprocess | Namespace(no_progress_bar=False, log_interval=100, log_format=None, tensorboard_logdir=None, wandb_project=None, azureml_logging=False, seed=1, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, quantization_config_path=None, profile=False, reset_logging=False, suppress_crashes=False, use_plasma_view=False, plasma_path='/tmp/plasma', criterion='cross_entropy', simul_type=None, tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang=None, target_lang=None, trainpref='/home/yumeng5/glue_data//CoLA/processed/train.label', validpref='/home/yumeng5/glue_data//CoLA/processed/dev.label', testpref=None, align_suffix=None, destdir='glue_processed/CoLA-bin/label', thresholdtgt=0, thresholdsrc=0, tgtdict=None, srcdict=None, nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=False, only_source=True, padding_factor=8, workers=8)
2022-03-24 11:45:07 | INFO | fairseq_cli.preprocess | [None] Dictionary: 8 types
2022-03-24 11:45:07 | INFO | fairseq_cli.preprocess | [None] /home/yumeng5/glue_data//CoLA/processed/train.label: 8551 sents, 17102 tokens, 0.0% replaced by <unk>
2022-03-24 11:45:07 | INFO | fairseq_cli.preprocess | [None] Dictionary: 8 types
2022-03-24 11:45:07 | INFO | fairseq_cli.preprocess | [None] /home/yumeng5/glue_data//CoLA/processed/dev.label: 1043 sents, 2086 tokens, 0.0% replaced by <unk>
2022-03-24 11:45:07 | INFO | fairseq_cli.preprocess | Wrote preprocessed data to glue_processed/CoLA-bin/label

Regarding the error ModuleNotFoundError: No module named 'cocolm' from the Huggingface version, actually the module cocolm simply refers to the subdirectory cocolm, so if you are running the code under the huggingface directory, Python should automatically detect the module and you should be able to get rid of the error.

Best,
Yu

from coco-lm.

aahad5555 avatar aahad5555 commented on July 4, 2024

Hi @yumeng5 ,

Thank you a lot for your help, I was making a bunch of mistakes due to which I was unable to reproduce. I figured it out now and the code runs just fine.

Thanks,
Jamal

from coco-lm.

Related Issues (7)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.