Git Product home page Git Product logo

Comments (8)

sania96 avatar sania96 commented on September 12, 2024

I have been testing the repo inside my laptop and Google Colab. Here is the system information for both environments.

My local system:

Memory: 16GB
CPU: AMD Ryzen 9 5900HX with Radeon Graphics
GPU: NVIDIA GeForce RTX 3060 Mobile / Max-Q 

Google colab

CPU: Intel Xeon (2) @ 2.199GHz 
GPU: NVIDIA Tesla T4 

Command to reproduce

!python MinimumExample/Example_ONNX_LlamaV2.py \
--onnx_file 7B_float16/ONNX/LlamaV2_7B_float16.onnx \
--embedding_file 7B_float16/embeddings.pth \
--tokenizer_path tokenizer.model \
--prompt "What is the lightest element?"

Output in my local system

python3 MinimumExample/Example_ONNX_LlamaV2.py --onnx_file 7B_float16/ONNX/LlamaV2_7B_float16.onnx --embedding_file 7B_float16/embeddings.pth --tokenizer_path tokenizer.model --prompt "hello"
/home/anindyadeep/anaconda3/envs/llm/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:65: UserWarning: Specified provider 'DmlExecutionProvider' is not in available provider names.Available providers: 'TensorrtExecutionProvider, CUDAExecutionProvider, CPUExecutionProvider'
  warnings.warn(
2023-08-27 12:25:33.996863660 [E:onnxruntime:, inference_session.cc:1644 operator()] Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:368 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 33554432

Traceback (most recent call last):
  File "/home/anindyadeep/workspace/llama2-onnx/Llama-2-Onnx/MinimumExample/Example_ONNX_LlamaV2.py", line 166, in <module>
    response = run_onnx_llamav2(
  File "/home/anindyadeep/workspace/llama2-onnx/Llama-2-Onnx/MinimumExample/Example_ONNX_LlamaV2.py", line 47, in run_onnx_llamav2
    llm_session = onnxruntime.InferenceSession(
  File "/home/anindyadeep/anaconda3/envs/llm/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 383, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/home/anindyadeep/anaconda3/envs/llm/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 435, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:368 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 33554432

Output in Google colab

/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py:65: UserWarning: Specified provider 'DmlExecutionProvider' is not in available provider names.Available providers: 'CPUExecutionProvider'
  warnings.warn(
^C

This probably means the process is automatically getting killed.

So now I have two questions here:

  1. What might be the root cause here, Instead of having Cuda and everything installed, it is switching back to DmlExecutionProvider and giving error.
  2. The execution time is large here. Although I am getting error or the process is getting killed, but till reaching that state, the time take is around 52-60 seconds in google colab (after which it is using ^C to kill the process) and `10-15`` seconds in my local m(after which it is giving error)

!! Update:

made some changes in the example code in just to provide the CPU Execution provider.

options = onnxruntime.SessionOptions()
    llm_session = onnxruntime.InferenceSession(
        onnx_file,
        sess_options=options,
        providers=[
            "CPUExecutionProvider",
        ],
    )

And then ran the same command, it took more than 2.5 minutes and finally the process got killed. It seems like I might not have the correct cuda vs onnx compatibility for which it could be generating error.

cuda version: 12.2
onnx version: 1.15.1

Hi, did you resolve the issue?
i am having the same issue here.

from llama-2-onnx.

Anindyadeep avatar Anindyadeep commented on September 12, 2024

Nope, I did't got any response, so I left the thread. But it is worth checking out again.

from llama-2-onnx.

raffaeleterribile avatar raffaeleterribile commented on September 12, 2024

I was able to run the minimum example with python 3.10.13 and NO CUDA: I'm using CPU for inference because my GPU has limited memory. So instead of installing ONNXRuntime with "pip install torch onnxruntime-gpu", I've installed it with "pip install torch onnxruntime".
I get the same warning about DirectML and the loading takes long time, but finally I've seen the response. I don't know exactly how much time it takes because was late and went to sleep, so I've seen the results in the morning.

from llama-2-onnx.

Anindyadeep avatar Anindyadeep commented on September 12, 2024

I was able to run the minimum example with python 3.10.13 and NO CUDA: I'm using CPU for inference because my GPU has limited memory. So instead of installing ONNXRuntime with "pip install torch onnxruntime-gpu", I've installed it with "pip install torch onnxruntime". I get the same warning about DirectML and the loading takes long time, but finally I've seen the response. I don't know exactly how much time it takes because was late and went to sleep, so I've seen the results in the morning.

That's awesome, but the time gap is a large to asses the working of the runtime. But I will also check out on the same.

from llama-2-onnx.

raffaeleterribile avatar raffaeleterribile commented on September 12, 2024

Yes, it's slow. And I have to delete and recreate the python virtual environment several times. Initially I've intalled "onnxruntime-gpu", unistalled it and installed "onnxruntime" (CPU version), but I've got other errors and so I've deleted and recreated the virtual enviroment.
To use CUDA (if you have a GPU with enough memory), you have to use version 11.8: version 12 it's not compatible with onnxruntime

from llama-2-onnx.

Anindyadeep avatar Anindyadeep commented on September 12, 2024

Yes, it's slow. And I have to delete and recreate the python virtual environment several times. Initially I've intalled "onnxruntime-gpu", unistalled it and installed "onnxruntime" (CPU version), but I've got other errors and so I've deleted and recreated the virtual enviroment. To use CUDA (if you have a GPU with enough memory), you have to use version 11.8: version 12 it's not compatible with onnxruntime

wow, that's a lot of ifs and so, but yeah got it. But thanks for the workaround.

from llama-2-onnx.

merveermann avatar merveermann commented on September 12, 2024

Hello all, I actually came across the same problem but with the 7B_FT_float32 model. I have two GPUs that have 24 GB of GPU memory, but as far as I understand, to run the 7B_FT_float32 model, a minimum of 25 GB of GPU memory is needed. So, is there a way to run this on my device? Is it possible to run ONNXRuntime on multiple GPUs?

from llama-2-onnx.

avsanjay avatar avsanjay commented on September 12, 2024

yes, running on multiple GPU's would be very useful

Hello all, I actually came across the same problem but with the 7B_FT_float32 model. I have two GPUs that have 24 GB of GPU memory, but as far as I understand, to run the 7B_FT_float32 model, a minimum of 25 GB of GPU memory is needed. So, is there a way to run this on my device? Is it possible to run ONNXRuntime on multiple GPUs?

from llama-2-onnx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.