Git Product home page Git Product logo

leader-pytorch's People

Contributors

liuqidong07 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

data-designer

leader-pytorch's Issues

训练问题

您好,我在复现您代码的时候,出现错误:
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
错误地点是pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
其中我的batch_size:1; logits.shape是[121,1,131], sequence_lengths:tensor[120]
请问是怎么回事呀

模型测试问题

作者您好,拜读您文章后进行试验复现时出现一些问题,希望您给予帮助。由于内存有限,我们使用zero3策略训练模型后,在测试阶段遇到问题如下:

 train()
 File "main_llm_cls.py", line 78, in train
   model = PeftModelForCLS.from_pretrained(model, model_args.peft_path, is_trainable=False)
 File "/root/autodl-tmp/LEADER/llm/lora_cls.py", line 94, in from_pretrained
   model.load_adapter(model_id, adapter_name, **kwargs)
 File "/root/autodl-tmp/LEADER/llm/lora_cls.py", line 130, in load_adapter
   set_peft_model_state_dict(self, adapters_weights, adapter_name=adapter_name)
 File "/root/autodl-tmp/LEADER/llm/lora_cls.py", line 282, in set_peft_model_state_dict
   model.load_state_dict(peft_model_state_dict, strict=False)
 File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
   raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCLS:
       size mismatch for base_model.model.model.layers.0.mlp.gate_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([11008, 8]).
       size mismatch for base_model.model.model.layers.0.mlp.up_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([11008, 8]).
       size mismatch for base_model.model.model.layers.0.mlp.down_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([8, 11008]).
       size mismatch for base_model.model.model.layers.1.mlp.gate_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([11008, 8]).
       size mismatch for base_model.model.model.layers.1.mlp.up_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([11008, 8]).
       size mismatch for base_model.model.model.layers.1.mlp.down_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([8, 11008]).
       size mismatch for base_model.model.model.layers.2.mlp.gate_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([11008, 8]).
       size mismatch for base_model.model.model.layers.2.mlp.up_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([11008, 8]).

训练好后的模型文件如图
模型文件
check-point

运行问题:RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

作者您好,我尝试复现您的代码时,出现以下的错误,Pytorch版本我用的也是1.12.0+cu102,但是torch显示是可用的,

import torch
if __name__ == "__main__":
    print("Cuda support:", torch.cuda.is_available(),":", torch.cuda.device_count(), "devices")
    accelerator = Accelerator()
    print(accelerator.state)

输出:
Cuda support: True : 1 devices
Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: no

具体报错如下,您能帮我解决一下问题吗?
另外requirements.txt中trl==0.7.6需要transformers>=4.31.0,但是在文件中使用的是4.28.1的transformers包,这是否会有问题呢?

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/heyichen/LEADER-pytorch/main_llm_cls.py:216 in │
│ │
│ 213 │
│ 214 if name == "main": │
│ 215 │ │
│ ❱ 216 │ train() │
│ 217 │
│ 218 │
│ 219 │
│ │
│ /home/heyichen/LEADER-pytorch/main_llm_cls.py:60 in train │
│ │
│ 57 def train(): │
│ 58 │ │
│ 59 │ parser = HfArgumentParser((ModelArguments, DataTrainingArguments, Seq2SeqTrainingArg │
│ ❱ 60 │ model_args, data_args, training_args = parser.parse_args_into_dataclasses() │
│ 61 │ device_map = "auto" │
│ 62 │ │
│ 63 │ # load diag, proc, med word2id tokenizer │
│ │
│ /home/heyichen/.conda/envs/LEADER/lib/python3.9/site-packages/transformers/hf_argparser.py:332 │
│ in parse_args_into_dataclasses │
│ │
│ 329 │ │ │ inputs = {k: v for k, v in vars(namespace).items() if k in keys} │
│ 330 │ │ │ for k in keys: │
│ 331 │ │ │ │ delattr(namespace, k) │
│ ❱ 332 │ │ │ obj = dtype(**inputs) │
│ 333 │ │ │ outputs.append(obj) │
│ 334 │ │ if len(namespace.dict) > 0: │
│ 335 │ │ │ # additional namespace. │
│ in init:115 │
│ │
│ /home/heyichen/.conda/envs/LEADER/lib/python3.9/site-packages/transformers/training_args.py:1259 │
│ in post_init
│ │
│ 1256 │ │ if ( │
│ 1257 │ │ │ self.framework == "pt" │
│ 1258 │ │ │ and is_torch_available() │
│ ❱ 1259 │ │ │ and (self.device.type != "cuda") │
│ 1260 │ │ │ and (get_xla_device_type(self.device) != "GPU") │
│ 1261 │ │ │ and (self.fp16 or self.fp16_full_eval) │
│ 1262 │ │ ): │
│ │
│ /home/heyichen/.conda/envs/LEADER/lib/python3.9/site-packages/transformers/training_args.py:1694 │
│ in device │
│ │
│ 1691 │ │ The device used by this process. │
│ 1692 │ │ """ │
│ 1693 │ │ requires_backends(self, ["torch"]) │
│ ❱ 1694 │ │ return self._setup_devices │
│ 1695 │ │
│ 1696 │ @Property
│ 1697 │ def n_gpu(self): │
│ │
│ /home/heyichen/.conda/envs/LEADER/lib/python3.9/site-packages/transformers/utils/generic.py:54 │
│ in get
│ │
│ 51 │ │ attr = "_cached" + self.fget.name
│ 52 │ │ cached = getattr(obj, attr, None) │
│ 53 │ │ if cached is None: │
│ ❱ 54 │ │ │ cached = self.fget(obj) │
│ 55 │ │ │ setattr(obj, attr, cached) │
│ 56 │ │ return cached │
│ 57 │
│ │
│ /home/heyichen/.conda/envs/LEADER/lib/python3.9/site-packages/transformers/training_args.py:1679 │
│ in _setup_devices │
│ │
│ 1676 │ │ │ │ if self.xpu_backend and self.xpu_backend in ("mpi", "gloo"): │
│ 1677 │ │ │ │ │ torch.distributed.init_process_group(backend=self.xpu_backend, timeo │
│ 1678 │ │ │ │ else: │
│ ❱ 1679 │ │ │ │ │ torch.distributed.init_process_group(backend="nccl", timeout=self.dd │
│ 1680 │ │ │ device = torch.device("cuda", self.local_rank) │
│ 1681 │ │ │ self._n_gpu = 1 │
│ 1682 │
│ │
│ /home/heyichen/.conda/envs/LEADER/lib/python3.9/site-packages/torch/distributed/distributed_c10d │
│ .py:602 in init_process_group │
│ │
│ 599 │ │ │ # different systems (e.g. RPC) in case the store is multi-tenant. │
│ 600 │ │ │ store = PrefixStore("default_pg", store) │
│ 601 │ │ │
│ ❱ 602 │ │ default_pg = _new_process_group_helper( │
│ 603 │ │ │ world_size, │
│ 604 │ │ │ rank, │
│ 605 │ │ │ [], │
│ │
│ /home/heyichen/.conda/envs/LEADER/lib/python3.9/site-packages/torch/distributed/distributed_c10d │
│ .py:738 in _new_process_group_helper │
│ │
│ 735 │ │ │ │ pg_options.is_high_priority_stream = False │
│ 736 │ │ │ │ pg_options._timeout = timeout │
│ 737 │ │ │ │
│ ❱ 738 │ │ │ pg = ProcessGroupNCCL(prefix_store, rank, world_size, pg_options) │
│ 739 │ │ │ # In debug mode and if GLOO is available, wrap in a wrapper PG that │
│ 740 │ │ │ # enables enhanced collective checking for debugability. │
│ 741 │ │ │ if get_debug_level() == DebugLevel.DETAIL: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

Environment problem

Could you please tell me your Python version? I tried to reproduce your code as described in the paper using Python 3.6.5. However, when I run 'pip install -r requirements.txt', I encounter numerous version errors, such as: 'ERROR: Could not find a version that satisfies the requirement accelerate==0.18.0'.

dataset process in construction.ipynb

Hi Liu, Nice work.
May I know how you got these files :
./auxiliary/RXCUI2atc4.csv
./auxiliary/drug-atc.csv
./auxiliary/ndc2RXCUI.txt
./auxiliary/drugbank_drugs_info.csv
./auxiliary/drug-DDI.csv

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.