liuqidong07 / leader-pytorch Goto Github PK
View Code? Open in Web Editor NEW[arXiv'24] The official implementation code of LEADER.
Home Page: https://arxiv.org/abs/2402.02803
License: MIT License
[arXiv'24] The official implementation code of LEADER.
Home Page: https://arxiv.org/abs/2402.02803
License: MIT License
您好,我在复现您代码的时候,出现错误:
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
错误地点是pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
其中我的batch_size:1; logits.shape是[121,1,131], sequence_lengths:tensor[120]
请问是怎么回事呀
作者您好,拜读您文章后进行试验复现时出现一些问题,希望您给予帮助。由于内存有限,我们使用zero3策略训练模型后,在测试阶段遇到问题如下:
train()
File "main_llm_cls.py", line 78, in train
model = PeftModelForCLS.from_pretrained(model, model_args.peft_path, is_trainable=False)
File "/root/autodl-tmp/LEADER/llm/lora_cls.py", line 94, in from_pretrained
model.load_adapter(model_id, adapter_name, **kwargs)
File "/root/autodl-tmp/LEADER/llm/lora_cls.py", line 130, in load_adapter
set_peft_model_state_dict(self, adapters_weights, adapter_name=adapter_name)
File "/root/autodl-tmp/LEADER/llm/lora_cls.py", line 282, in set_peft_model_state_dict
model.load_state_dict(peft_model_state_dict, strict=False)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCLS:
size mismatch for base_model.model.model.layers.0.mlp.gate_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([11008, 8]).
size mismatch for base_model.model.model.layers.0.mlp.up_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([11008, 8]).
size mismatch for base_model.model.model.layers.0.mlp.down_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([8, 11008]).
size mismatch for base_model.model.model.layers.1.mlp.gate_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([11008, 8]).
size mismatch for base_model.model.model.layers.1.mlp.up_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([11008, 8]).
size mismatch for base_model.model.model.layers.1.mlp.down_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([8, 11008]).
size mismatch for base_model.model.model.layers.2.mlp.gate_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([11008, 8]).
size mismatch for base_model.model.model.layers.2.mlp.up_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([11008, 8]).
作者您好,我尝试复现您的代码时,出现以下的错误,Pytorch版本我用的也是1.12.0+cu102,但是torch显示是可用的,
import torch
if __name__ == "__main__":
print("Cuda support:", torch.cuda.is_available(),":", torch.cuda.device_count(), "devices")
accelerator = Accelerator()
print(accelerator.state)
输出:
Cuda support: True : 1 devices
Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: no
具体报错如下,您能帮我解决一下问题吗?
另外requirements.txt中trl==0.7.6需要transformers>=4.31.0,但是在文件中使用的是4.28.1的transformers包,这是否会有问题呢?
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/heyichen/LEADER-pytorch/main_llm_cls.py:216 in │
│ │
│ 213 │
│ 214 if name == "main": │
│ 215 │ │
│ ❱ 216 │ train() │
│ 217 │
│ 218 │
│ 219 │
│ │
│ /home/heyichen/LEADER-pytorch/main_llm_cls.py:60 in train │
│ │
│ 57 def train(): │
│ 58 │ │
│ 59 │ parser = HfArgumentParser((ModelArguments, DataTrainingArguments, Seq2SeqTrainingArg │
│ ❱ 60 │ model_args, data_args, training_args = parser.parse_args_into_dataclasses() │
│ 61 │ device_map = "auto" │
│ 62 │ │
│ 63 │ # load diag, proc, med word2id tokenizer │
│ │
│ /home/heyichen/.conda/envs/LEADER/lib/python3.9/site-packages/transformers/hf_argparser.py:332 │
│ in parse_args_into_dataclasses │
│ │
│ 329 │ │ │ inputs = {k: v for k, v in vars(namespace).items() if k in keys} │
│ 330 │ │ │ for k in keys: │
│ 331 │ │ │ │ delattr(namespace, k) │
│ ❱ 332 │ │ │ obj = dtype(**inputs) │
│ 333 │ │ │ outputs.append(obj) │
│ 334 │ │ if len(namespace.dict) > 0: │
│ 335 │ │ │ # additional namespace. │
│ in init:115 │
│ │
│ /home/heyichen/.conda/envs/LEADER/lib/python3.9/site-packages/transformers/training_args.py:1259 │
│ in post_init │
│ │
│ 1256 │ │ if ( │
│ 1257 │ │ │ self.framework == "pt" │
│ 1258 │ │ │ and is_torch_available() │
│ ❱ 1259 │ │ │ and (self.device.type != "cuda") │
│ 1260 │ │ │ and (get_xla_device_type(self.device) != "GPU") │
│ 1261 │ │ │ and (self.fp16 or self.fp16_full_eval) │
│ 1262 │ │ ): │
│ │
│ /home/heyichen/.conda/envs/LEADER/lib/python3.9/site-packages/transformers/training_args.py:1694 │
│ in device │
│ │
│ 1691 │ │ The device used by this process. │
│ 1692 │ │ """ │
│ 1693 │ │ requires_backends(self, ["torch"]) │
│ ❱ 1694 │ │ return self._setup_devices │
│ 1695 │ │
│ 1696 │ @Property │
│ 1697 │ def n_gpu(self): │
│ │
│ /home/heyichen/.conda/envs/LEADER/lib/python3.9/site-packages/transformers/utils/generic.py:54 │
│ in get │
│ │
│ 51 │ │ attr = "_cached" + self.fget.name │
│ 52 │ │ cached = getattr(obj, attr, None) │
│ 53 │ │ if cached is None: │
│ ❱ 54 │ │ │ cached = self.fget(obj) │
│ 55 │ │ │ setattr(obj, attr, cached) │
│ 56 │ │ return cached │
│ 57 │
│ │
│ /home/heyichen/.conda/envs/LEADER/lib/python3.9/site-packages/transformers/training_args.py:1679 │
│ in _setup_devices │
│ │
│ 1676 │ │ │ │ if self.xpu_backend and self.xpu_backend in ("mpi", "gloo"): │
│ 1677 │ │ │ │ │ torch.distributed.init_process_group(backend=self.xpu_backend, timeo │
│ 1678 │ │ │ │ else: │
│ ❱ 1679 │ │ │ │ │ torch.distributed.init_process_group(backend="nccl", timeout=self.dd │
│ 1680 │ │ │ device = torch.device("cuda", self.local_rank) │
│ 1681 │ │ │ self._n_gpu = 1 │
│ 1682 │
│ │
│ /home/heyichen/.conda/envs/LEADER/lib/python3.9/site-packages/torch/distributed/distributed_c10d │
│ .py:602 in init_process_group │
│ │
│ 599 │ │ │ # different systems (e.g. RPC) in case the store is multi-tenant. │
│ 600 │ │ │ store = PrefixStore("default_pg", store) │
│ 601 │ │ │
│ ❱ 602 │ │ default_pg = _new_process_group_helper( │
│ 603 │ │ │ world_size, │
│ 604 │ │ │ rank, │
│ 605 │ │ │ [], │
│ │
│ /home/heyichen/.conda/envs/LEADER/lib/python3.9/site-packages/torch/distributed/distributed_c10d │
│ .py:738 in _new_process_group_helper │
│ │
│ 735 │ │ │ │ pg_options.is_high_priority_stream = False │
│ 736 │ │ │ │ pg_options._timeout = timeout │
│ 737 │ │ │ │
│ ❱ 738 │ │ │ pg = ProcessGroupNCCL(prefix_store, rank, world_size, pg_options) │
│ 739 │ │ │ # In debug mode and if GLOO is available, wrap in a wrapper PG that │
│ 740 │ │ │ # enables enhanced collective checking for debugability. │
│ 741 │ │ │ if get_debug_level() == DebugLevel.DETAIL: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
Could you please tell me your Python version? I tried to reproduce your code as described in the paper using Python 3.6.5. However, when I run 'pip install -r requirements.txt', I encounter numerous version errors, such as: 'ERROR: Could not find a version that satisfies the requirement accelerate==0.18.0'.
Hi Liu, Nice work.
May I know how you got these files :
./auxiliary/RXCUI2atc4.csv
./auxiliary/drug-atc.csv
./auxiliary/ndc2RXCUI.txt
./auxiliary/drugbank_drugs_info.csv
./auxiliary/drug-DDI.csv
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.