I am having a go at running inference and evaluation for this model, and running into

Hi, I think it's because this RMSNorm is being set to None <div class="Box Box

That line you pointed out requires this to be installed: <a href="https://github.com/D

Hi! I got a similar problem while running the sample code: <div class="highlight h

Type Error in GPTLMHeadModel about based HOT 7 OPEN

axelmagn commented on May 26, 2024

Type Error in GPTLMHeadModel

from based.

Comments (7)

axelmagn commented on May 26, 2024 2

After some tweaking, I think I've got it working. I ended up using the HazyReasearch/flash-attention fork. For others trying via docker, this is the dockerfile I used:

FROM nvidia/cuda:11.8.0-devel-ubuntu22.04

ARG TORCH_CUDA_ARCH_LIST="8.0+PTX"

RUN apt-get update && apt-get install -y \
    build-essential \
    apt-utils \
    python3.10 \
    python3-pip \
    git \
    && rm -rf /var/lib/apt/lists/*

RUN pip install --upgrade pip
RUN pip install \
    torch==2.1.2 \
    torchvision==0.16.2 \
    torchaudio==2.1.2 \
    --index-url https://download.pytorch.org/whl/cu118 # due to observed causal-conv1d dependency

RUN pip install \
    jupyter==1.0.0 \
    hydra-core==1.3.2 \
    packaging==23.2 \
    ninja==1.11.1.1 


# install apex
RUN pip install -v \
    --disable-pip-version-check \
    --no-cache-dir \
    --no-build-isolation \
    --config-settings "--build-option=--cpp_ext" \
    --config-settings "--build-option=--cuda_ext" \
    'git+https://github.com/NVIDIA/apex@b496d85'

RUN pip install 'git+https://github.com/HazyResearch/[email protected]' --no-build-isolation
RUN pip install 'git+https://github.com/HazyResearch/[email protected]#subdirectory=csrc/fused_dense_lib'  --no-build-isolation
RUN pip install 'git+https://github.com/HazyResearch/[email protected]#subdirectory=csrc/layer_norm' --no-build-isolation

# install based
RUN mkdir -p /app
WORKDIR /app
COPY . .
RUN pip install .

CMD python3 test_script.py

It requires NVIDIA docker tookit to run, with the command:

docker run --rm --runtime=nvidia --gpus all based

from based.

simran-arora commented on May 26, 2024

Hi,
I think it's because this RMSNorm is being set to None

based/based/models/gpt.py

Line 371 in e8de564

norm_cls = partial(

Due to the import structure here:

based/based/models/gpt.py

Line 52 in e8de564

try:

The options are to

install the norm from flash attention here: https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
set it to import RMSNorm from here: https://github.com/HazyResearch/based/blob/master/based/ops/triton/layer_norm.py, by uncommenting this line:

based/based/models/gpt.py

Line 57 in e8de564

# from based.ops.triton.layer_norm import RMSNorm

Sorry for the difficulty -- we will fix the install / instructions for this

from based.

axelmagn commented on May 26, 2024

No worries, and thanks for the speedy reply. Your guidance helped me get past the above error by installing the norm from flash-attn, but there seem to be more undocumented dependency issues:

root@d75213223120:/app# python3 test_script.py 
tokenizer_config.json: 100%|██████████████████████████████████████████████| 26.0/26.0 [00:00<00:00, 149kB/s]
config.json: 100%|█████████████████████████████████████████████████████████| 665/665 [00:00<00:00, 8.40MB/s]
vocab.json: 100%|██████████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 9.97MB/s]
merges.txt: 100%|████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 28.5MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████| 1.36M/1.36M [00:00<00:00, 9.81MB/s]
config.json: 100%|█████████████████████████████████████████████████████| 2.86k/2.86k [00:00<00:00, 35.0MB/s]
No module named 'causal_attention_cuda'
Traceback (most recent call last):
  File "/app/test_script.py", line 6, in <module>
    model = GPTLMHeadModel.from_pretrained_hf("hazyresearch/based-360m").to("cuda", dtype=torch.float16)
  File "/app/based/models/gpt.py", line 468, in from_pretrained_hf
    model = cls(config, device=device, **kwargs)
  File "/app/based/models/gpt.py", line 741, in __init__
    self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)
  File "/app/based/models/gpt.py", line 585, in __init__
    [
  File "/app/based/models/gpt.py", line 586, in <listcomp>
    create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
  File "/app/based/models/gpt.py", line 382, in create_block
    block = Block(
  File "/app/based/models/block.py", line 86, in __init__
    self.mixer = mixer_cls(dim)
  File "/app/based/models/mixers/slide_attention.py", line 357, in __init__
    if fused_bias_fc and FusedDense is None: raise ImportError("fused_dense is not installed")
ImportError: fused_dense is not installed

I'm a little baffled, since it seems like FusedDense is being imported from flash_attn here:

based/based/models/mixers/slide_attention.py

Line 29 in e8de564

 from flash_attn.ops.fused_dense import ColumnParallelLinear, FusedDense, RowParallelLinear 

Are there additional subpackages within flash-attn that need to be installed?

For reference, here is my updated dockerfile:

FROM nvidia/cuda:11.8.0-devel-ubuntu22.04

RUN apt-get update && apt-get install -y \
    apt-utils \
    python3.10 \
    python3-pip \
    git \
    && rm -rf /var/lib/apt/lists/*

RUN pip install --upgrade pip
RUN pip install \
    torch==2.1.2 \
    torchvision==0.16.2 \
    torchaudio==2.1.2 \
    --index-url https://download.pytorch.org/whl/cu118 # due to observed causal-conv1d dependency

RUN pip install \
    jupyter==1.0.0 \
    hydra-core==1.3.2 \
    packaging==23.2 \
    ninja==1.11.1.1 

# RUN pip install 'git+https://github.com/Dao-AILab/flash-attention.git@6c9e60d' 
RUN pip install 'git+https://github.com/Dao-AILab/flash-attention.git@6c9e60d#subdirectory=csrc/layer_norm'

# install apex
RUN pip install -v \
    --disable-pip-version-check \
    --no-cache-dir \
    --no-build-isolation \
    --config-settings "--build-option=--cpp_ext" \
    --config-settings "--build-option=--cuda_ext" \
    'git+https://github.com/NVIDIA/apex@b496d85'

# install based
RUN mkdir -p /app
WORKDIR /app
COPY . .
RUN pip install .

CMD python3 test_script.py

from based.

simran-arora commented on May 26, 2024

That line you pointed out requires this to be installed: https://github.com/Dao-AILab/flash-attention/tree/main/csrc/fused_dense_lib

Would recommend cloning flash-attention and python setup.py install within this directory
An alternative workaround, without the install, is to, in the config, set fused_bias_fc = False

from based.

melisa-writer commented on May 26, 2024

Hi! I got a similar problem while running the sample code:

import torch
from transformers import AutoTokenizer
from based.models.gpt import GPTLMHeadModel

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = GPTLMHeadModel.from_pretrained_hf("hazyresearch/based-360m").to("cuda", dtype=torch.float16)

input = tokenizer.encode("If I take one more step, it will be", return_tensors="pt").to("cuda")
output = model.generate(input, max_length=20)
print(tokenizer.decode(output[0]))

Error:

Traceback (most recent call last):
  File "/home/melisarussak/based/inference_test.py", line 6, in <module>
    model = GPTLMHeadModel.from_pretrained_hf("hazyresearch/based-360m").to("cuda", dtype=torch.float16)
  File "/home/melisarussak/based/based/models/gpt.py", line 470, in from_pretrained_hf
    model = cls(config, device=device, **kwargs)
  File "/home/melisarussak/based/based/models/gpt.py", line 743, in __init__
    self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)
  File "/home/melisarussak/based/based/models/gpt.py", line 587, in __init__
    [
  File "/home/melisarussak/based/based/models/gpt.py", line 588, in <listcomp>
    create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
  File "/home/melisarussak/based/based/models/gpt.py", line 373, in create_block
    norm_cls = partial(
TypeError: the first argument must be callable

so I used the Dockerfile given by @axelmagn and now I get:

No module named 'causal_attention_cuda'
Successfully imported the causal dot product kernel!
Could not import the FLA triton kernels...
Traceback (most recent call last):
  File "/app/inference_test.py", line 9, in <module>
    output = model.generate(input, max_length=20)
  File "/app/based/generation.py", line 573, in generate
    output = decode(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/app/based/generation.py", line 194, in decode
    scores.append(get_logits(sequences[-1], inference_params))
  File "/app/based/generation.py", line 155, in get_logits
    logits = model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/app/based/models/gpt.py", line 806, in forward
    hidden_states = self.transformer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/app/based/models/gpt.py", line 674, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/app/based/models/block.py", line 189, in forward
    hidden_states = self.mixer(hidden_states, position_ids=position_ids, decay=decay, **mixer_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/app/based/models/mixers/linear_attention.py", line 127, in forward
    return self.recurrent_forward(hidden_states, kv_state, k_state, q, k, v)
  File "/app/based/models/mixers/linear_attention.py", line 195, in recurrent_forward
    kv_state += k[:, :, -1:] * v[:, :, -1:]
RuntimeError: The size of tensor a (16) must match the size of tensor b (273) at non-singleton dimension 4

Is this due to code changes 2 days ago or I am missing some steps?

from based.

simran-arora commented on May 26, 2024

yes that was due to the changes, please try again and let me know if you run into issues

from based.

melisa-writer commented on May 26, 2024

it works now! 🎉 thank you!

from based.

Type Error in GPTLMHeadModel about based HOT 7 OPEN

Comments (7)

Related Issues (10)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent