Git Product home page Git Product logo

hmnet's Issues

How to solve cuda out of memory error?

I have encountered errors like this
" RuntimeError: CUDA out of memory. Tried to allocate 2.40 GiB (GPU 3; 15.78 GiB total capacity; 12.06 GiB already allocated; 2.39 GiB free; 212.27 MiB cached) "
when trying to fine tune the model on both the data sets.The same error occurs if I try to evaluate the model with the fine tuned weights downloaded from the link given in the repo. Can you specify the hardware specifications to reproduce this project?

tokenizer.convert_ids_to_tokens not generating special tokens with predefined position offset

self.tokenizer = self.tokenizer_class.from_pretrained(self.pretrained_tokenizer_path)
special_tokens_tuple_list = [("eos_token", 128), ("unk_token", 129), ("pad_token", 130), ("bos_token", 131)]
for special_token_name, special_token_id_offset in special_tokens_tuple_list:
if getattr(self.tokenizer, special_token_name) == None:
setattr(self.tokenizer, special_token_name, self.tokenizer.convert_ids_to_tokens(len(self.tokenizer)-special_token_id_offset))
self.config[special_token_name] = self.tokenizer.convert_ids_to_tokens(len(self.tokenizer)-special_token_id_offset)
self.config[special_token_name+'_id'] = len(self.tokenizer)-special_token_id_offset

In this snippet of code, it set up a default special_token_name with offset. Then later, the special token (pad_token, bos_token are not exist in pretrained_tokenizer) need to be added into tokenizer. I tried to load pretrained tokenizer from transof-xl-wt103 under ExampleInitModel and generate tokens from ids base on the predefined offset.

tokenizer.convert_ids_to_tokens(len(self.tokenizer)-special_token_id_offset))

The returned tokens turn out to be specific words, not '<pad>' or '<bos>' tokens.

When the token_name is "pad_token" or "bos_token" with offset of "130", "131":
'The return: Islahul 267605,McShan 267604'

May I ask how did you setup the offset value of these special tokens? Is it normal that the 'transof-xl-wt103' doesn't need pad_token and bos_token or these special tokens actually should be set up somewhere else?

The order of token_attn and sent_attn in decoder is different between the code and the paper, in MeetingNet_Transformer.py

@xrc10
In the paper, src-tgt attention on sentences is after the src-tgt attention on tokens. However, in the code, the order is opposite.
At line 1000 in MeetingNet_Transformer.py,

def forward(self, y, token_enc_key, token_enc_value, sent_enc_key, sent_enc_value):
query, key, value = self.decoder_splitter(y)
# batch x len x n_state

    # self-attention
    a = self.attn(query, key, value, None, one_dir_visible=True)
    # batch x len x n_state

    n = self.ln_1(y + a) # residual

    if 'NO_HIERARCHY' in self.opt:
        q = y
        r = n
    else:
        # src-tgt attention on sentences
        q = self.sent_attn(n, sent_enc_key, sent_enc_value, None)
        r = self.ln_3(n + q) # residual
        # batch x len x n_state

    # src-tgt attention on tokens
    o = self.token_attn(r, token_enc_key, token_enc_value, None)
    p = self.ln_2(r + o) # residual
    # batch x len x n_state


    m = self.mlp(p)
    h = self.ln_4(p + m)
    return 

I would like to confirm Is this intended code or not?

Docker building, Tensor Size issues, may be related to package versions.

Hi, I've tried to build your docker container using the provided Dockerfile and it fails upon python -m spacy download en. It couldn't link to libcuda.so.1. To Fix I changed the dockerfile to link to the stub for compile time with:

ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8
RUN export LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:$LD_LIBRARY_PATH && \
        ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1 && \
        python -m spacy download en

The build then works. The next issue comes in Models/Networks/MeetingNet_Transformer.py where
spacy.load('en', parser = False) fails because the parser keyword has been removed. I fixed by changing to
nlp = spacy.load('en_core_web_sm', exclude=['parser']). That also fixed the warning that en shortcut is deprecated.

The last thing I had to change to get things working was that the Language object from Spacy no longer has tagger and entity fields. I had to access the pipeline to add them as below.

tagger = [x[1] for x in nlp.pipeline if x[0] == 'tagger']
assert len(tagger) == 1
tagger = tagger[0]

entity = [x[1] for x in nlp.pipeline if x[0] == 'ner']
assert len(entity) == 1
entity = entity[0]

POS = {w: i for i, w in enumerate([''] + list(tagger.labels))}
ENT = {w: i for i, w in enumerate([''] + list(entity.move_names))}

Finally, after the code was able to execute the code ran into a tensor size issue with the linked finetuned ami model which can be seen below:

Error(s) in loading state_dict for MeetingNet_Transformer:
	size mismatch for encoder.pos_embed.weight: copying a param with shape torch.Size([51, 16]) from checkpoint, the shape in current model is torch.Size([50, 16])

I think this may be due to a spacy model change since the code was compiled against a different version.

Could you provide a requirements.txt with versions or tell me if I'm wrong and the tensor size error is unrelated to the spacy tags?

Thanks!

Problems while building docker

When I ran sudo docker build . -t hmnet,errors occurred in Step 10/35 : RUN apt-get update && apt-get install -y --allow-change-held-packages --no-install-recommends software-properties-common openssh-client openssh-server pdsh curl sudo net-tools vim iputils-ping wget perl libxml-parser-perl libcudnn7=${CUDNN_VERSION} libnccl2=${NCCL_VERSION} libnccl-dev=${NCCL_VERSION} --allow-downgrades:

E: Unable to locate package libcudnn7
E: Version '2.4.7-1+cuda10.0' for 'libnccl2' was not found
E: Version '2.4.7-1+cuda10.0' for 'libnccl-dev' was not found

It seems that in the docker apt can't find the package. Is it my fault somewhere? or the docker may exist some bugs?

Modules Versions are not specified

Hello. I am trying to run the experiments. Unfortunately, since the pip modules' versions are not specified in the Docker file, I am getting hierarchical errors. Could you please specify the versions in the Docker file? For example, your code is not compatible with latest Spacy version (3.0.50). I guess you should have used version 2.3.5 in you code.

How to build a new data set with the same format

Hi, I have successfully run through your code, and the effect is quite good. Is it convenient for you to open source pre training data? Or can you tell me how to get a POS_ ID and ENT_ ID? Thanks.

Cuda out of memory

Hello. I am trying to reproduce the paper results. I am currently running the code on 2 Tesla V100 GPUs each containing 16GB of memory, but still I am getting out-of-memory error. I also tried to decrease MAX_TRANSCRIPT_WORD to 1000, but it did not help. Could you please let me what hardware and GPU it requires to run?

cublas runtime error

I'm following the readme to try and Finetune HMNet on the AMI dataset. My only modification to the instructions is that I have only 1 visible device (my full command thus becomes CUDA_VISIBLE_DEVICES="0" mpirun -np 1 --allow-run-as-root python PyLearn.py train ExampleConf/conf_hmnet_AMI).

The process exits with an error.

Here's the full output.

{'MODEL': 'MeetingNet_Transformer', 'TASK': 'HMNet', 'CRITERION': 'MLECriterion', 'SEED': 1033, 'RESUME': True, 'MAX_NUM_EPOCHS': 20, 'SAVE_PER_UPDATE_NUM': 400, 'UPDATES_PER_EPOCH': 2000, 'OPTIMIZER': 'RAdam', 'NO_AUTO_LR_SCALING': True, 'START_LEARNING_RATE': 0.001, 'LR_SCHEDULER': 'LnrWrmpInvSqRtDcyScheduler', 'WARMUP_STEPS': 16000, 'WARMUP_INIT_LR': 0.0001, 'WARMUP_END_LR': 0.001, 'GRADIENT_ACCUMULATE_STEP': 20, 'GRAD_CLIPPING': 2, 'USE_REL_DATA_PATH': True, 'TRAIN_FILE': '../ExampleRawData/meeting_summarization/AMI_proprec/train_ami.json', 'DEV_FILE': '../ExampleRawData/meeting_summarization/AMI_proprec/valid_ami.json', 'TEST_FILE': '../ExampleRawData/meeting_summarization/AMI_proprec/test_ami.json', 'ROLE_DICT_FILE': '../ExampleRawData/meeting_summarization/role_dict_ext.json', 'MINI_BATCH': 1, 'MAX_PADDING_RATIO': 1, 'BATCH_READ_AHEAD': 10, 'DOC_SHUFFLE_BUF_SIZE': 10, 'SAMPLE_SHUFFLE_BUFFER_SIZE': 10, 'BATCH_SHUFFLE_BUFFER_SIZE': 10, 'MAX_TRANSCRIPT_WORD': 8300, 'MAX_SENT_LEN': 30, 'MAX_SENT_NUM': 300, 'DROPOUT': 0.1, 'VOCAB_DIM': 512, 'ROLE_SIZE': 32, 'ROLE_DIM': 16, 'POS_DIM': 16, 'ENT_DIM': 16, 'USE_ROLE': True, 'USE_POSENT': True, 'USE_BOS_TOKEN': True, 'USE_EOS_TOKEN': True, 'TRANSFORMER_EMBED_DROPOUT': 0.1, 'TRANSFORMER_RESIDUAL_DROPOUT': 0.1, 'TRANSFORMER_ATTENTION_DROPOUT': 0.1, 'TRANSFORMER_LAYER': 6, 'TRANSFORMER_HEAD': 8, 'TRANSFORMER_POS_DISCOUNT': 80, 'PRE_TOKENIZER': 'TransfoXLTokenizer', 'PRE_TOKENIZER_PATH': '../ExampleInitModel/transfo-xl-wt103', 'PYLEARN_MODEL': '../ExampleInitModel/HMNet-pretrained', 'EXTRA_IDS': 1000, 'BEAM_WIDTH': 6, 'MAX_GEN_LENGTH': 512, 'MIN_GEN_LENGTH': 320, 'EVAL_TOKENIZED': True, 'EVAL_LOWERCASE': True, 'NO_REPEAT_NGRAM_SIZE': 3, 'cuda': True, 'confFile': 'ExampleConf/conf_hmnet_AMI', 'datadir': 'ExampleConf', 'basename': 'conf_hmnet_AMI', 'command': 'train', 'conf_file': 'ExampleConf/conf_hmnet_AMI', 'cluster': 'local', 'dist_init_path': './tmp', 'fp16': False, 'fp16_opt_level': 'O1', 'no_cuda': False}
Using Cuda

Saving logs, model, checkpoint, and evaluation in ExampleConf/conf_hmnet_AMI_conf~/run_2
 1.2.0  is high
Number of GPUs is  1 
Effective batch size is increased from  1  to  1 
Gradient accumulation steps =  20 
Effective batch size =  20 
[9d66c296629d:03515] pml_ucx.c:285  Error: UCP worker does not support MPI_THREAD_MULTIPLE
Select command: train
train on rank 0
-----------------------------------------------
Initializing model...
Loading Tokenizer from ExampleConf/../ExampleInitModel/transfo-xl-wt103...
Using pad_token, but it is not set yet.
Using bos_token, but it is not set yet.
Use POS and ENT
USE_ROLE

Total trainable parameters: 204488240
Loaded data on rank 0.
Using custom optimizer: RAdam
Optimizer parameters: {'lr': 0.001}
Using custom lr scheduler: LnrWrmpInvSqRtDcyScheduler
Lr scheduler parameters: {'warmup_steps': 16000, 'warmup_init_lr': 0.0001, 'warmup_end_lr': 0.001}
Cannot find checkpoint path from conf_hmnet_AMI_resume_checkpoint.json.
Make sure ExampleConf/conf_hmnet_AMI_resume_checkpoint.json exists.
Continue without loading checkpoint
Epoch 0
Traceback (most recent call last):
  File "PyLearn.py", line 71, in <module>
    trainer.train()
  File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 273, in train
    self.update(batch)
  File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 358, in update
    loss = self.network(batch)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 38, in forward
    output = self.model(batch)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 100, in forward
    outputs = self._forward(**batch)
  File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 125, in _forward
    token_encoder_outputs, sent_encoder_outputs = self.encoder(encoder_input_ids, encoder_input_roles, encoder_input_pos, encoder_input_ent)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 1130, in forward
    embedded = self.embedder(vocab_x.view(batch_size, -1))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/Transformer.py", line 387, in forward
    x_pos = self.pos_emb(torch.arange(x_len).type(torch.cuda.FloatTensor)) # len x n_state
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/Transformer.py", line 86, in forward
    sinusoid_inp = torch.ger(pos_seq, self.inv_freq)
RuntimeError: cublas runtime error : the GPU program failed to execute at /pytorch/aten/src/THC/THCBlas.cu:120
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[42501,1],0]
  Exit code:    1
--------------------------------------------------------------------------

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.