microsoft / hmnet Goto Github PK
View Code? Open in Web Editor NEWOfficial Implementation of "A Hierarchical Network for Abstractive Meeting Summarization with Cross-Domain Pretraining""
License: Other
Official Implementation of "A Hierarchical Network for Abstractive Meeting Summarization with Cross-Domain Pretraining""
License: Other
I have encountered errors like this
" RuntimeError: CUDA out of memory. Tried to allocate 2.40 GiB (GPU 3; 15.78 GiB total capacity; 12.06 GiB already allocated; 2.39 GiB free; 212.27 MiB cached) "
when trying to fine tune the model on both the data sets.The same error occurs if I try to evaluate the model with the fine tuned weights downloaded from the link given in the repo. Can you specify the hardware specifications to reproduce this project?
HMNet/Models/Networks/MeetingNet_Transformer.py
Lines 50 to 58 in 1f5a24d
tokenizer.convert_ids_to_tokens(len(self.tokenizer)-special_token_id_offset))
The returned tokens turn out to be specific words, not '<pad>' or '<bos>' tokens.
When the token_name is "pad_token" or "bos_token" with offset of "130", "131":
'The return: Islahul 267605,McShan 267604'
May I ask how did you setup the offset value of these special tokens? Is it normal that the 'transof-xl-wt103' doesn't need pad_token and bos_token or these special tokens actually should be set up somewhere else?
@xrc10
In the paper, src-tgt attention on sentences is after the src-tgt attention on tokens. However, in the code, the order is opposite.
At line 1000 in MeetingNet_Transformer.py,
def forward(self, y, token_enc_key, token_enc_value, sent_enc_key, sent_enc_value):
query, key, value = self.decoder_splitter(y)
# batch x len x n_state
# self-attention
a = self.attn(query, key, value, None, one_dir_visible=True)
# batch x len x n_state
n = self.ln_1(y + a) # residual
if 'NO_HIERARCHY' in self.opt:
q = y
r = n
else:
# src-tgt attention on sentences
q = self.sent_attn(n, sent_enc_key, sent_enc_value, None)
r = self.ln_3(n + q) # residual
# batch x len x n_state
# src-tgt attention on tokens
o = self.token_attn(r, token_enc_key, token_enc_value, None)
p = self.ln_2(r + o) # residual
# batch x len x n_state
m = self.mlp(p)
h = self.ln_4(p + m)
return
I would like to confirm Is this intended code or not?
How to train models with mine own data sets?
Hi, I've tried to build your docker container using the provided Dockerfile and it fails upon python -m spacy download en
. It couldn't link to libcuda.so.1. To Fix I changed the dockerfile to link to the stub for compile time with:
ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8
RUN export LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:$LD_LIBRARY_PATH && \
ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1 && \
python -m spacy download en
The build then works. The next issue comes in Models/Networks/MeetingNet_Transformer.py
where
spacy.load('en', parser = False)
fails because the parser keyword has been removed. I fixed by changing to
nlp = spacy.load('en_core_web_sm', exclude=['parser'])
. That also fixed the warning that en
shortcut is deprecated.
The last thing I had to change to get things working was that the Language object from Spacy no longer has tagger and entity fields. I had to access the pipeline to add them as below.
tagger = [x[1] for x in nlp.pipeline if x[0] == 'tagger']
assert len(tagger) == 1
tagger = tagger[0]
entity = [x[1] for x in nlp.pipeline if x[0] == 'ner']
assert len(entity) == 1
entity = entity[0]
POS = {w: i for i, w in enumerate([''] + list(tagger.labels))}
ENT = {w: i for i, w in enumerate([''] + list(entity.move_names))}
Finally, after the code was able to execute the code ran into a tensor size issue with the linked finetuned ami model which can be seen below:
Error(s) in loading state_dict for MeetingNet_Transformer:
size mismatch for encoder.pos_embed.weight: copying a param with shape torch.Size([51, 16]) from checkpoint, the shape in current model is torch.Size([50, 16])
I think this may be due to a spacy model change since the code was compiled against a different version.
Could you provide a requirements.txt with versions or tell me if I'm wrong and the tensor size error is unrelated to the spacy tags?
Thanks!
When I ran sudo docker build . -t hmnet
,errors occurred in Step 10/35 : RUN apt-get update && apt-get install -y --allow-change-held-packages --no-install-recommends software-properties-common openssh-client openssh-server pdsh curl sudo net-tools vim iputils-ping wget perl libxml-parser-perl libcudnn7=${CUDNN_VERSION} libnccl2=${NCCL_VERSION} libnccl-dev=${NCCL_VERSION} --allow-downgrades
:
E: Unable to locate package libcudnn7
E: Version '2.4.7-1+cuda10.0' for 'libnccl2' was not found
E: Version '2.4.7-1+cuda10.0' for 'libnccl-dev' was not found
It seems that in the docker apt can't find the package. Is it my fault somewhere? or the docker may exist some bugs?
Hello. I am trying to run the experiments. Unfortunately, since the pip modules' versions are not specified in the Docker file, I am getting hierarchical errors. Could you please specify the versions in the Docker file? For example, your code is not compatible with latest Spacy version (3.0.50). I guess you should have used version 2.3.5 in you code.
Hi, I have successfully run through your code, and the effect is quite good. Is it convenient for you to open source pre training data? Or can you tell me how to get a POS_ ID and ENT_ ID? Thanks.
Hello. I am trying to reproduce the paper results. I am currently running the code on 2 Tesla V100 GPUs each containing 16GB of memory, but still I am getting out-of-memory error. I also tried to decrease MAX_TRANSCRIPT_WORD to 1000, but it did not help. Could you please let me what hardware and GPU it requires to run?
Could you please share the steps and the files to preprocess my own data for the model ?
I'm following the readme to try and Finetune HMNet on the AMI dataset. My only modification to the instructions is that I have only 1 visible device (my full command thus becomes CUDA_VISIBLE_DEVICES="0" mpirun -np 1 --allow-run-as-root python PyLearn.py train ExampleConf/conf_hmnet_AMI
).
The process exits with an error.
Here's the full output.
{'MODEL': 'MeetingNet_Transformer', 'TASK': 'HMNet', 'CRITERION': 'MLECriterion', 'SEED': 1033, 'RESUME': True, 'MAX_NUM_EPOCHS': 20, 'SAVE_PER_UPDATE_NUM': 400, 'UPDATES_PER_EPOCH': 2000, 'OPTIMIZER': 'RAdam', 'NO_AUTO_LR_SCALING': True, 'START_LEARNING_RATE': 0.001, 'LR_SCHEDULER': 'LnrWrmpInvSqRtDcyScheduler', 'WARMUP_STEPS': 16000, 'WARMUP_INIT_LR': 0.0001, 'WARMUP_END_LR': 0.001, 'GRADIENT_ACCUMULATE_STEP': 20, 'GRAD_CLIPPING': 2, 'USE_REL_DATA_PATH': True, 'TRAIN_FILE': '../ExampleRawData/meeting_summarization/AMI_proprec/train_ami.json', 'DEV_FILE': '../ExampleRawData/meeting_summarization/AMI_proprec/valid_ami.json', 'TEST_FILE': '../ExampleRawData/meeting_summarization/AMI_proprec/test_ami.json', 'ROLE_DICT_FILE': '../ExampleRawData/meeting_summarization/role_dict_ext.json', 'MINI_BATCH': 1, 'MAX_PADDING_RATIO': 1, 'BATCH_READ_AHEAD': 10, 'DOC_SHUFFLE_BUF_SIZE': 10, 'SAMPLE_SHUFFLE_BUFFER_SIZE': 10, 'BATCH_SHUFFLE_BUFFER_SIZE': 10, 'MAX_TRANSCRIPT_WORD': 8300, 'MAX_SENT_LEN': 30, 'MAX_SENT_NUM': 300, 'DROPOUT': 0.1, 'VOCAB_DIM': 512, 'ROLE_SIZE': 32, 'ROLE_DIM': 16, 'POS_DIM': 16, 'ENT_DIM': 16, 'USE_ROLE': True, 'USE_POSENT': True, 'USE_BOS_TOKEN': True, 'USE_EOS_TOKEN': True, 'TRANSFORMER_EMBED_DROPOUT': 0.1, 'TRANSFORMER_RESIDUAL_DROPOUT': 0.1, 'TRANSFORMER_ATTENTION_DROPOUT': 0.1, 'TRANSFORMER_LAYER': 6, 'TRANSFORMER_HEAD': 8, 'TRANSFORMER_POS_DISCOUNT': 80, 'PRE_TOKENIZER': 'TransfoXLTokenizer', 'PRE_TOKENIZER_PATH': '../ExampleInitModel/transfo-xl-wt103', 'PYLEARN_MODEL': '../ExampleInitModel/HMNet-pretrained', 'EXTRA_IDS': 1000, 'BEAM_WIDTH': 6, 'MAX_GEN_LENGTH': 512, 'MIN_GEN_LENGTH': 320, 'EVAL_TOKENIZED': True, 'EVAL_LOWERCASE': True, 'NO_REPEAT_NGRAM_SIZE': 3, 'cuda': True, 'confFile': 'ExampleConf/conf_hmnet_AMI', 'datadir': 'ExampleConf', 'basename': 'conf_hmnet_AMI', 'command': 'train', 'conf_file': 'ExampleConf/conf_hmnet_AMI', 'cluster': 'local', 'dist_init_path': './tmp', 'fp16': False, 'fp16_opt_level': 'O1', 'no_cuda': False}
Using Cuda
Saving logs, model, checkpoint, and evaluation in ExampleConf/conf_hmnet_AMI_conf~/run_2
1.2.0 is high
Number of GPUs is 1
Effective batch size is increased from 1 to 1
Gradient accumulation steps = 20
Effective batch size = 20
[9d66c296629d:03515] pml_ucx.c:285 Error: UCP worker does not support MPI_THREAD_MULTIPLE
Select command: train
train on rank 0
-----------------------------------------------
Initializing model...
Loading Tokenizer from ExampleConf/../ExampleInitModel/transfo-xl-wt103...
Using pad_token, but it is not set yet.
Using bos_token, but it is not set yet.
Use POS and ENT
USE_ROLE
Total trainable parameters: 204488240
Loaded data on rank 0.
Using custom optimizer: RAdam
Optimizer parameters: {'lr': 0.001}
Using custom lr scheduler: LnrWrmpInvSqRtDcyScheduler
Lr scheduler parameters: {'warmup_steps': 16000, 'warmup_init_lr': 0.0001, 'warmup_end_lr': 0.001}
Cannot find checkpoint path from conf_hmnet_AMI_resume_checkpoint.json.
Make sure ExampleConf/conf_hmnet_AMI_resume_checkpoint.json exists.
Continue without loading checkpoint
Epoch 0
Traceback (most recent call last):
File "PyLearn.py", line 71, in <module>
trainer.train()
File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 273, in train
self.update(batch)
File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 358, in update
loss = self.network(batch)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 38, in forward
output = self.model(batch)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 100, in forward
outputs = self._forward(**batch)
File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 125, in _forward
token_encoder_outputs, sent_encoder_outputs = self.encoder(encoder_input_ids, encoder_input_roles, encoder_input_pos, encoder_input_ent)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 1130, in forward
embedded = self.embedder(vocab_x.view(batch_size, -1))
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/root/HMNet/Models/Networks/Transformer.py", line 387, in forward
x_pos = self.pos_emb(torch.arange(x_len).type(torch.cuda.FloatTensor)) # len x n_state
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/root/HMNet/Models/Networks/Transformer.py", line 86, in forward
sinusoid_inp = torch.ger(pos_seq, self.inv_freq)
RuntimeError: cublas runtime error : the GPU program failed to execute at /pytorch/aten/src/THC/THCBlas.cu:120
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[42501,1],0]
Exit code: 1
--------------------------------------------------------------------------
There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.
Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.