Not able to run Llama 7B float 16 not in my system or google colab about llama-2-onnx HOT 8 OPEN

microsoft commented on September 12, 2024

Not able to run Llama 7B float 16 not in my system or google colab

from llama-2-onnx.

Comments (8)

sania96 commented on September 12, 2024

I have been testing the repo inside my laptop and Google Colab. Here is the system information for both environments.

My local system:

Memory: 16GB
CPU: AMD Ryzen 9 5900HX with Radeon Graphics
GPU: NVIDIA GeForce RTX 3060 Mobile / Max-Q

Google colab

CPU: Intel Xeon (2) @ 2.199GHz 
GPU: NVIDIA Tesla T4

Command to reproduce

!python MinimumExample/Example_ONNX_LlamaV2.py \
--onnx_file 7B_float16/ONNX/LlamaV2_7B_float16.onnx \
--embedding_file 7B_float16/embeddings.pth \
--tokenizer_path tokenizer.model \
--prompt "What is the lightest element?"

Output in my local system

python3 MinimumExample/Example_ONNX_LlamaV2.py --onnx_file 7B_float16/ONNX/LlamaV2_7B_float16.onnx --embedding_file 7B_float16/embeddings.pth --tokenizer_path tokenizer.model --prompt "hello"
/home/anindyadeep/anaconda3/envs/llm/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:65: UserWarning: Specified provider 'DmlExecutionProvider' is not in available provider names.Available providers: 'TensorrtExecutionProvider, CUDAExecutionProvider, CPUExecutionProvider'
  warnings.warn(
2023-08-27 12:25:33.996863660 [E:onnxruntime:, inference_session.cc:1644 operator()] Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:368 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 33554432

Traceback (most recent call last):
  File "/home/anindyadeep/workspace/llama2-onnx/Llama-2-Onnx/MinimumExample/Example_ONNX_LlamaV2.py", line 166, in <module>
    response = run_onnx_llamav2(
  File "/home/anindyadeep/workspace/llama2-onnx/Llama-2-Onnx/MinimumExample/Example_ONNX_LlamaV2.py", line 47, in run_onnx_llamav2
    llm_session = onnxruntime.InferenceSession(
  File "/home/anindyadeep/anaconda3/envs/llm/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 383, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/home/anindyadeep/anaconda3/envs/llm/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 435, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:368 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 33554432

Output in Google colab

/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py:65: UserWarning: Specified provider 'DmlExecutionProvider' is not in available provider names.Available providers: 'CPUExecutionProvider'
  warnings.warn(
^C

This probably means the process is automatically getting killed.

So now I have two questions here:

What might be the root cause here, Instead of having Cuda and everything installed, it is switching back to DmlExecutionProvider and giving error.
The execution time is large here. Although I am getting error or the process is getting killed, but till reaching that state, the time take is around 52-60 seconds in google colab (after which it is using ^C to kill the process) and `10-15`` seconds in my local m(after which it is giving error)

!! Update:

made some changes in the example code in just to provide the CPU Execution provider.

options = onnxruntime.SessionOptions()
    llm_session = onnxruntime.InferenceSession(
        onnx_file,
        sess_options=options,
        providers=[
            "CPUExecutionProvider",
        ],
    )

And then ran the same command, it took more than 2.5 minutes and finally the process got killed. It seems like I might not have the correct cuda vs onnx compatibility for which it could be generating error.

cuda version: 12.2
onnx version: 1.15.1

Hi, did you resolve the issue?
i am having the same issue here.

from llama-2-onnx.

Anindyadeep commented on September 12, 2024

Nope, I did't got any response, so I left the thread. But it is worth checking out again.

from llama-2-onnx.

raffaeleterribile commented on September 12, 2024

I was able to run the minimum example with python 3.10.13 and NO CUDA: I'm using CPU for inference because my GPU has limited memory. So instead of installing ONNXRuntime with "pip install torch onnxruntime-gpu", I've installed it with "pip install torch onnxruntime".
I get the same warning about DirectML and the loading takes long time, but finally I've seen the response. I don't know exactly how much time it takes because was late and went to sleep, so I've seen the results in the morning.

from llama-2-onnx.

Anindyadeep commented on September 12, 2024

I was able to run the minimum example with python 3.10.13 and NO CUDA: I'm using CPU for inference because my GPU has limited memory. So instead of installing ONNXRuntime with "pip install torch onnxruntime-gpu", I've installed it with "pip install torch onnxruntime". I get the same warning about DirectML and the loading takes long time, but finally I've seen the response. I don't know exactly how much time it takes because was late and went to sleep, so I've seen the results in the morning.

That's awesome, but the time gap is a large to asses the working of the runtime. But I will also check out on the same.

from llama-2-onnx.

raffaeleterribile commented on September 12, 2024

Yes, it's slow. And I have to delete and recreate the python virtual environment several times. Initially I've intalled "onnxruntime-gpu", unistalled it and installed "onnxruntime" (CPU version), but I've got other errors and so I've deleted and recreated the virtual enviroment.
To use CUDA (if you have a GPU with enough memory), you have to use version 11.8: version 12 it's not compatible with onnxruntime

from llama-2-onnx.

Anindyadeep commented on September 12, 2024

Yes, it's slow. And I have to delete and recreate the python virtual environment several times. Initially I've intalled "onnxruntime-gpu", unistalled it and installed "onnxruntime" (CPU version), but I've got other errors and so I've deleted and recreated the virtual enviroment. To use CUDA (if you have a GPU with enough memory), you have to use version 11.8: version 12 it's not compatible with onnxruntime

wow, that's a lot of ifs and so, but yeah got it. But thanks for the workaround.

from llama-2-onnx.

merveermann commented on September 12, 2024

Hello all, I actually came across the same problem but with the 7B_FT_float32 model. I have two GPUs that have 24 GB of GPU memory, but as far as I understand, to run the 7B_FT_float32 model, a minimum of 25 GB of GPU memory is needed. So, is there a way to run this on my device? Is it possible to run ONNXRuntime on multiple GPUs?

from llama-2-onnx.

avsanjay commented on September 12, 2024

yes, running on multiple GPU's would be very useful

Hello all, I actually came across the same problem but with the 7B_FT_float32 model. I have two GPUs that have 24 GB of GPU memory, but as far as I understand, to run the 7B_FT_float32 model, a minimum of 25 GB of GPU memory is needed. So, is there a way to run this on my device? Is it possible to run ONNXRuntime on multiple GPUs?

from llama-2-onnx.

Not able to run Llama 7B float 16 not in my system or google colab about llama-2-onnx HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent