moon-hotel / bertwithpretrained Goto Github PK

View Code? Open in Web Editor NEW

544.0 5.0 106.0 71.31 MB

An implementation of the BERT model and its related downstream tasks based on the PyTorch framework

Python 100.00%

bert pytorch nlp swag deep-learning pretrained-models question-answering squad text-classification

bertwithpretrained's People

Contributors

Stargazers

Watchers

Forkers

ming-maker only-dong wangleigit001 liuxiaoqun angelonly bonbonxuan songweiwei nkzql xyz-dev-max greatheart1000 kelichloe 850135878 ucfxj quanlr applese233 xurui314 pony-m kobedeshow zhang-ke-cell tangweiyi1 kilochips lukas-mo tiffen buluxianfeng scu-jjkinging creazyfan lxf555 whoam-challenge wbz5087 wenjie123 droliven huoshanfei xinyizhixin deeplearning00 xuejuny kunlun-zhu curious-chen paddlepaddle-gardener wangfeng0621 chz367 tanghsh5 zhaoxjmail orangerfun zhuxb realjunshi summetime zephyr0703 changguiyang moonriver2002 diafasenong leeann65 lizhuoranget jwxiang honey-xin onlysixpence nicoyang-21 chenlongxiabc quduoduo yongquan-he inno-inspire iceeverness circlezd wwww6662003 victooorrr cnsawyer dizzyrogue00 juniorpan guolisen weixiongdi einyboycode aiwq2 lemon1201 dyhs dyh1998 yang-example lihan-hub todochenxi hnkfwhw mjtao beiluomi zhikangniu jeffyuan wuqiangxjtu wuhanlt 13484835805 qinghuachen007 zhangfaen lofty11 roadtorusuanjun lcg22 ihasit huohua0314 bofan-tunning xymyeah muzlatan liuzhaofeng123 zhaohobby bemcliu zhu1971 dongxianzhe

bertwithpretrained's Issues

env

你好torch==1.5.0最低要求py3.7
3.6的环境下无法安装torch==1.5.0

可以出一个用自己的数据集在自己构建的bert上训练的教学么

每个实验都是用预训练好的参数直接导入模型中么，有那种用自己的数据集去训练一个bert的实验么，就例如第七个模型你是导入了预训练好的Bert的参数然后再训练么？

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm

sorry, i'd ask when i run TaskForSingleSentenceClassification.py, it always make an error.

Someone face the same problem and have solved it?

BertForPretrainingModel模型相关问题

CE里面ignore_index设的是-1，送入的mlm_label没有被mask的位置是0，是不是应该把ignore_index改为0

ValueError: num_samples should be a positive integer value, but got num_samples=0

机器配置：torch1.12.0+python3.8
运行原仓库的pretrain出错，报错如下：
Traceback (most recent call last):
File "/root/Tasks/TaskForPretraining.py", line 305, in
train(config)
File "/root/Tasks/TaskForPretraining.py", line 108, in train
data_loader.load_train_val_test_data(test_file_path=config.test_file_path,
File "/root/utils/create_pretraining_data.py", line 334, in load_train_val_test_data
train_iter = DataLoader(train_data, batch_size=self.batch_size,
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 347, in init
sampler = RandomSampler(dataset, generator=generator) # type: ignore[arg-type]
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/sampler.py", line 107, in init
raise ValueError("num_samples should be a positive integer "
ValueError: num_samples should be a positive integer value, but got num_samples=0
是不是torch版本有问题，3090的cuda只能11，torch好像最低1.9.0

你好，请问在用中文数据集进行预处理的时候，需要把标点符号去掉么？

SQuAD任务中文数据训练问题

您好，非常感激您的开源！我最近在认真研读您的文章及代码。但在做SQuAD任务，使用中文数据训练遇到一些问题，如下：
1、中文数据就按照您代码来处理，char字符输入，但是会遇到tokenize去除空格情况，导致最后“答案”位置有偏移，以至于在show_result阶段True answer经常对应不上正确的答案

请问，中文数据处理代码主要有哪些地方需要修改？我也不清楚我是否其他处理有误，多谢，谢谢

是否可以实现单机多卡训练，我在修改代码时候，出现以下问题

Traceback (most recent call last):
File "/home/yons/workfiles/codes/opencodes/BertWithPretrained/Tasks/TaskForChineseNER.py", line 315, in
train(config)
File "/home/yons/workfiles/codes/opencodes/BertWithPretrained/Tasks/TaskForChineseNER.py", line 132, in train
loss, logits = model(input_ids=token_ids, # [src_len, batch_size]
File "/home/yons/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/yons/anaconda3/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/yons/anaconda3/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/yons/anaconda3/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/yons/anaconda3/lib/python3.9/site-packages/torch/_utils.py", line 434, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/yons/anaconda3/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/yons/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/yons/workfiles/codes/opencodes/BertWithPretrained/Tasks/../model/DownstreamTasks/BertForTokenClassification.py", line 32, in forward
_, all_encoder_outputs = self.bert(input_ids=input_ids,
File "/home/yons/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/yons/workfiles/codes/opencodes/BertWithPretrained/Tasks/../model/BasicBert/Bert.py", line 290, in forward
all_encoder_outputs = self.bert_encoder(embedding_output,
File "/home/yons/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/yons/workfiles/codes/opencodes/BertWithPretrained/Tasks/../model/BasicBert/Bert.py", line 190, in forward
layer_output = layer_module(layer_output,
File "/home/yons/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/yons/workfiles/codes/opencodes/BertWithPretrained/Tasks/../model/BasicBert/Bert.py", line 162, in forward
attention_output = self.bert_attention(hidden_states, attention_mask)
File "/home/yons/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/yons/workfiles/codes/opencodes/BertWithPretrained/Tasks/../model/BasicBert/Bert.py", line 93, in forward
self_outputs = self.self(hidden_states,
File "/home/yons/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/yons/workfiles/codes/opencodes/BertWithPretrained/Tasks/../model/BasicBert/Bert.py", line 56, in forward
return self.multi_head_attention(query, key, value, attn_mask=attn_mask, key_padding_mask=key_padding_mask)
File "/home/yons/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/yons/workfiles/codes/opencodes/BertWithPretrained/Tasks/../model/BasicBert/MyTransformer.py", line 296, in forward
return multi_head_attention_forward(query, key, value, self.num_heads,
File "/home/yons/workfiles/codes/opencodes/BertWithPretrained/Tasks/../model/BasicBert/MyTransformer.py", line 360, in multi_head_attention_forward
attn_output_weights = attn_output_weights.masked_fill(
RuntimeError: The size of tensor a (367) must match the size of tensor b (184) at non-singleton dimension 3

FileNotFoundError: [Errno 2] No such file or directory: 'D:\\Download\\BertWithPretrained-main\\BertWithPretrained-main\\data\\PairSentenceClassification\\test.txt'

请问这个test.txt是自己的预测数据嘛

请问一下，在TaskForPretraining任务中，使用我自己数据集生成的pt文件太大了，怎么办呀？大佬

关于QA任务建模的问题

您好，您目前对于QA建模的操作是将context分隔开到每个句子中，
问题1：这种方式您是如何保证该问题能读取到充分的句子信息的？
问题2：图中圈出来的几个数字代表什么意思？

songci数据集，wiki2预训练时会报错，生成的掩码pt文件wiki_train_mlNone_rs2022_mr15_mtr8_mtur5.pt只有1k

注意，正在使用本地MyTransformer中的MyMultiHeadAttention实现

[2022-11-27 15:03:35] - INFO: ## 使用token embedding中的权重矩阵作为输出层的权重！torch.Size([30522, 768])
[2022-11-27 15:03:38] - INFO: 缓存文件 /home/********/博一/my_explore/BERT_learn/BertWithPretrained-main/data/WikiText/wiki_test_mlNone_rs2022_mr15_mtr8_mtur5.pt 不存在，重新处理并缓存！

正在读取原始数据: 100%|██████████████| 4358/4358 [00:00<00:00, 11122.89it/s]

正在构造NSP和MLM样本(test): 100%|██| 1847/1847 [00:00<00:00, 1681180.44it/s]

[2022-11-27 15:03:38] - INFO: 缓存文件 /home/********/博一/my_explore/BERT_learn/BertWithPretrained-main/data/WikiText/wiki_train_mlNone_rs2022_mr15_mtr8_mtur5.pt 不存在，重新处理并缓存！

正在读取原始数据: 100%|████████████| 36718/36718 [00:03<00:00, 11100.30it/s]

正在构造NSP和MLM样本(train): 100%|█| 15496/15496 [00:00<00:00, 1615704.25it/

Traceback (most recent call last):
File "TaskForPretraining.py", line 300, in
train(config)
File "TaskForPretraining.py", line 105, in train
val_file_path=config.val_file_path)
File "../utils/create_pretraining_data.py", line 334, in load_train_val_test_data
collate_fn=self.generate_batch)
File "/home/pgrad/.conda/envs/wmc_transformer/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 213, in init
sampler = RandomSampler(dataset)
File "/home/pgrad/.conda/envs/wmc_transformer/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 94, in init
"value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0

你好，在计算mlm任务的精确度的时候不考虑填充字符预测的结果吧，只计算真正有意义的字符被mask后的预测结果的精确度吧

关于MLM pretraining时，做句子对Classfication的咨询？

您好，想请教下句子对Pretraining，我看了Task/TaskForPretraining.py，是 MLM和NSP的组合任务，受到启发想咨询下，如果做句子对分类（即判断句子a和句子b是否属于同一类），是不是相应的调整一下句子对的处理（即模型输入token_type_ids改为[0] * (len(token_a_ids) + 2) + [1] * (len(token_b_ids) + 1)），用句子对label替换 nsp_label即可？还是说有其他的方法？

训练TaskForChineseNER.py任务，改变self.entities的数量，报错

原代码中entities数量为7，现改为自己的数据集，entities数量为23报错，请问如何解决

关于从头训练MLM tasks任务的咨询。

你好，感谢您提供的代码！关于预训练，我有一个问题想咨询一下。您提供的TaskForPretraining.py，实际上是从一个训练好的模型上进一步pretrain。如果我想完全从随机初始化开始进行pretrain，相关学习策略是否需要调整。例如初始学习率，衰减策略等等

attention mask相关问题

在代码里(如TaskForSingleSentenseClassification.py)，attention_mask的代码好像是[0,0,0,....,1,1,1]这种形式，
padding_mask = (sample == data_loader.PAD_IDX).transpose(0, 1)

我看在别的实现有的是[1,1,1,...,0,0,0]这种形式，不知道这两种方式有区别吗，谢谢！

关于attention_mask

请问为什么attention_mask 有效的token是false，padding的token是True呢？

pretrain model parameters预训练模型参数

Hi, appreciate your detailed tutoring on the github and weChat. They helped me understand a lot about the transformer and bert model. If we need to implement the downstream tasks you provided, where are supposed to place the pretraining parameters into the model? Are these parameters downloaded from Hugging Face's website?

很感谢您在微信和github上细心的指导，他们帮助我更好地理解了transformer和bert模型，如果我们想要去实现你提供的下游任务，我们需要在哪儿去放置预训练的参数呢？这些参数是不是指从Hugging Face那儿下载的参数？

TaskForSQuADQuestionAnswering训练任务时报错IndexError: list index out of range

在执行TaskForSQuADQuestionAnswering训练任务时，经常遇到这样的报错，请问是什么原因导致的？

正在遍历每个问题（样本）: 76%|████████████▉ | 16/21 [05:20<01:40, 20.01s/it]
Traceback (most recent call last):
File "TaskForSQuADQuestionAnswering_Train.py", line 210, in
train(config=model_config)
File "TaskForSQuADQuestionAnswering_Train.py", line 81, in train
train_iter, val_iter = data_loader.load_train_data(train_file_path=config.train_file_path)
File "../utils/data_helpers.py", line 766, in load_train_data
postfix=postfix) # 得到处理好的所有样本
File "../utils/data_helpers.py", line 97, in wrapper
data = func(*args, **kwargs)
File "../utils/data_helpers.py", line 649, in data_process
token_to_orig_map = self.get_token_to_orig_map(input_tokens, example[3], self.tokenizer)
File "../utils/data_helpers.py", line 561, in get_token_to_orig_map
token = tokenizer(origin_context_tokens[value_start])
IndexError: list index out of range