hazyresearch / h3 Goto Github PK

Language Modeling with the H3 State Space Model

License: Apache License 2.0

Python 0.37% C++ 1.86% Cuda 0.80% C 0.02% Shell 0.01% CSS 0.05% JavaScript 0.11% HTML 2.32% CMake 0.12% Makefile 0.01% Assembly 88.08% PHP 0.28% Pawn 4.93% POV-Ray SDL 1.04% Dockerfile 0.02%

h3's Introduction

Hungry Hungry Hippos (H3)

This repository provides the official implementation of H3 from the following paper.

Hungry Hungry Hippos: Towards Language Modeling with State Space Models
Tri Dao*, Daniel Y. Fu*, Khaled K. Saab, Armin W. Thomas, Atri Rudra, Christopher Ré
International Conference on Learning Representations, 2023. Notable top-25% (spotlight).
Paper: https://arxiv.org/abs/2212.14052

Code & model release

You can find model weights on the Hugging Face Hub here (under "Files and Versions" for each model):

Loading weights and running inference

Examples of how to load the weights and run inference are given in benchmarks/benchmark_generation.py and examples/generate_text_h3.py.

Here's an example of how to download and run our 125M model (you may need to install FlashAttention):

git lfs install
git clone https://huggingface.co/danfu09/H3-125M

git clone https://github.com/HazyResearch/H3.git

PYTHONPATH=$(pwd)/H3 python H3/examples/generate_text_h3.py --ckpt H3-125M/model.pt --prompt "Hungry Hungry Hippos: Towards Language Modeling With State Space Models is a new language model that" --dmodel 768 --nlayer 12 --attn-layer-idx 6 --nheads=12

You should get an output like this (may change due to sampling in the text generation):

Hungry Hungry Hippos: Towards Language Modeling With State Space Models is a new language model that uses state-space models to create a human-like vocabulary that can help improve human understanding and judgment of language. It takes a human's past experience of language, and tries to capture their cognitive patterns. State Space Models helps the researchers make sense of language in its own terms, which helps users learn about their language of choice. State Space Models is used to develop a set of languages for researchers in an effort to help them develop more intelligent language models. The goal is to increase and develop a human-like language model using state space models. It is hoped that it will aid people to do more work to develop a language that is more

Here's the summary of model sizes for each model:

Model	dmodel	nlayer	nheads
125M	768	12	12
355M	1024	24	16
1.3B	2048	24	16
2.7B	2560	32	20

See examples/README.md for examples about how to load all these models and run them!

Acknowledgments

Some of the files related to S4D and HiPPO initialization are adapted from the https://github.com/HazyResearch/state-spaces.

Citation

If you use this codebase, or otherwise found our work valuable, please cite:

@inproceedings{dao2023hungry,
  title={Hungry {H}ungry {H}ippos: Towards Language Modeling with State Space Models},
  author={Dao, Tri and Fu, Daniel Y. and Saab, Khaled K. and Thomas, Armin W.
  and Rudra, Atri and R{\'e}, Christopher},
  booktitle={International Conference on Learning Representations},
  year={2023}
}

h3's People

Contributors

Stargazers

Watchers

h3's Issues

2.7B Evaluations

Hi, great work!

Very excited to try out the models.

Curious if you have more detailed evaluation for the 2.7B model, as I can't find this in the H3 paper

The motivation for not fusioning fff(k) into the kernel

Thanks for your great work. Here I want to ask why fft(k) is not fused into the kernel, is it a performance issue?
I mean why is it implemented as follows:

def fftconv_fast(u, k, D, dropout_mask):
     """Fuse padding + rfft + pointwise mult + ifft + multiply with D + gelu + dropout
     """
     seqlen = u.shape[-1]
     fft_size = 2 * seqlen
     k_f = torch.fft.rfft(k, n=fft_size)
     out = fftconv_fwd(u, k_f, D, dropout_mask, fft_size)
     return out

instead of:

def fftconv_fast(u, k, D, dropout_mask):
     """Fuse padding + rfft + pointwise mult + ifft + multiply with D + gelu + dropout
     """
     seqlen = u.shape[-1]
     fft_size = 2 * seqlen
     out = fftconv_fwd(u, k, D, dropout_mask, fft_size)
     return out

Error running benchmarks/benchmark_generation.py

Hi there. It's great to see another LM trained on the Pile.

When I run benchmarks/benchmark_generation.py:

[KeOps] Compiling cuda jit compiler engine ... OK
[pyKeOps] Compiling nvrtc binder for python ... OK
Number of parameters: 1326096384
[KeOps] Generating code for formula Sum_Reduction(ComplexMult(Var(0,2,1),ComplexExp(ComplexMult(Var(1,2,1),Var(2,2,0)))),0) ... OK
Segmentation fault

and it exits after "Segmentation fault".

So I uninstall pykeops, and then the new error is:

Traceback (most recent call last):
  File "/fsx/BlinkDL/CODE/_PUBLIC_/H3/benchmarks/benchmark_generation_h3.py", line 68, in <module>
    fn()
  File "/fsx/BlinkDL/CODE/_PUBLIC_/H3/benchmarks/benchmark_generation_h3.py", line 65, in <lambda>
    fn = lambda: model.generate(input_ids=input_ids, max_length=max_length,
  File "/fsx/BlinkDL/conda/lib/python3.9/site-packages/flash_attn-0.2.8-py3.9-linux-x86_64.egg/flash_attn/utils/generation.py", line 150, in generate
    output = decode(input_ids, self, max_length, top_k=top_k, top_p=top_p,
  File "/fsx/BlinkDL/conda/lib/python3.9/site-packages/flash_attn-0.2.8-py3.9-linux-x86_64.egg/flash_attn/utils/generation.py", line 107, in decode
    logits = model(input_ids, inference_params=inference_params).logits[:, -1]
  File "/fsx/BlinkDL/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/fsx/BlinkDL/CODE/_PUBLIC_/H3/src/models/ssm_seq.py", line 186, in forward
    hidden_states = self.backbone(input_ids, position_ids=position_ids,
  File "/fsx/BlinkDL/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/fsx/BlinkDL/CODE/_PUBLIC_/H3/src/models/ssm_seq.py", line 141, in forward
    hidden_states, residual = layer(hidden_states, residual, mixer_kwargs=mixer_kwargs)
  File "/fsx/BlinkDL/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/fsx/BlinkDL/conda/lib/python3.9/site-packages/flash_attn-0.2.8-py3.9-linux-x86_64.egg/flash_attn/modules/block.py", line 126, in forward
    hidden_states = self.mixer(hidden_states, **mixer_kwargs)
  File "/fsx/BlinkDL/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/fsx/BlinkDL/conda/lib/python3.9/site-packages/flash_attn-0.2.8-py3.9-linux-x86_64.egg/flash_attn/modules/mha.py", line 481, in forward
    kv = self._update_kv_cache(qkv[:, :, 1:], inference_params)
  File "/fsx/BlinkDL/conda/lib/python3.9/site-packages/flash_attn-0.2.8-py3.9-linux-x86_64.egg/flash_attn/modules/mha.py", line 419, in _update_kv_cache
    assert self.layer_idx is not None, 'Generation requires layer_idx in the constructor'
AssertionError: Generation requires layer_idx in the constructor

Release of pretraining and fine tuning code

Hi, thanks for a very nice piece of work. Do you also plan to release pretraining and fine tuning code for this model? Also, is there a way to apply the model on long sequences, such as those from the LRA benchmark? Thanks!

Having trouble for compiling fftconv

Thanks for your great work. I have encountered difficulties in compiling fftconv, can you provide torch and cuda versions, as well as other environment variables. One more question, do you have plans to release the following version of the fftconv code:

u_f = torch.fft.fft(u)
k_f = torch.fft.fft(k)
y_f = u_f * k_f
y = torch.fft.ifft(y_f)

TypeError: forward() got an unexpected keyword argument 'last_token_only'

Traceback (most recent call last):
File "/home/boofboy/Desktop/x/main.py", line 124, in
output_ids = model.generate(input_ids=input_ids, max_length=max_length,
File "/home/boofboy/miniconda3/envs/mini/lib/python3.9/site-packages/flash_attn-1.0.4-py3.9-linux-x86_64.egg/flash_attn/utils/generation.py", line 167, in generate
output = decode(input_ids, self, max_length, top_k=top_k, top_p=top_p,
File "/home/boofboy/miniconda3/envs/mini/lib/python3.9/site-packages/flash_attn-1.0.4-py3.9-linux-x86_64.egg/flash_attn/utils/generation.py", line 115, in decode
logits = model(input_ids, inference_params=inference_params, last_token_only=True).logits
File "/home/boofboy/miniconda3/envs/mini/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
TypeError: forward() got an unexpected keyword argument 'last_token_only'

when adding the last_token_only option, it naturally gives the error during torch.multinomial(torch.softmax(logits_top, dim=-1), num_samples=1).squeeze(dim=-1)
RuntimeError: prob_dist must be 1 or 2 dim.

im using pytorch 2.0. all libraries have been installed.
Any help would be much appreciated!

inconsistent output from fftconv_func and native pytorch fft

Hello,
I have noticed that the output of an h3 layer, when provided the same input tensor, is different when use_fast_fftconv=True versus when use_fast_fftconv=False. Similarly, I have tested this on the fftconv_func and fftconv_ref functions (in fftconv.py) and they give different outputs when given the same input arguments. Is this behavior expected? Is fftconv_func performing any approximation that causes this relatively large Euclidean distance between the two outputs?
Thanks in advance for your help.

why dividing kv_f by fft_size?

Hi,

Thanks for this great work! I have a quick question about the implementation of H3.

H3/src/models/ssm/h3.py

Line 145 in 8ebedd6

kv_f = torch.fft.rfft(kv.to(dtype=ssm_kernel.dtype), n=fft_size) / fft_size

In the fft of SSM_diag, you divided$kv_f by fft_size. But in the FFT of SSM_shift, you did not do this for k_f.
Could you please explain the insights of this?

What is fftconv_bwd doing?

Great work, in this snippet:

    out = fftconv_ref(u, k, D, dropout_mask)
    out = fftconv_fast(u, k, D, dropout_mask)
    g = torch.randn_like(out)
    fftconv_fast_bwd(g, u, k, D, dropout_mask)

what is 'g'? Does fast_bwd perform like an MSE loss over g and the output?
Also, the fftconv_func doesn't seem to be within src/ops/, is this intentional?

Error when using use_fast_fftconv option in generate_text_h3.py

It seems like these two codes cause an error when using use_fast_fftconv option for generate_text_h3.py.

PYTHONPATH=$(pwd)/H3 python ./H3/examples/generate_text_h3.py --ckpt ./H3-125M/model.pt --prompt "Hungry Hungry Hippos: Towards Language Modeling With State Space Models is a new language model that" --dmodel 768 --nlayer 12 --attn-layer-idx 6 --nheads=12 --genlen 128

einops.EinopsError: Error while processing rearrange-reduction pattern "b 1 h -> b h".
Input tensor shape: torch.Size([1, 2, 768]). Additional info: {}.
Shape mismatch, 2 != 1

       if self.use_fast_fftconv and L_og % 2 != 0:
            u = F.pad(u, (0, 0, 0, 1))

https://github.com/HazyResearch/H3/blob/main/src/models/ssm/h3.py#L189

        shift_k, next_state_k = self.ssm_k_kernel.step(rearrange(k, 'b 1 h -> b h'), state_k)

https://github.com/HazyResearch/H3/blob/main/src/models/ssm/h3.py#L80

By the way, why does u needs to be padded to an even number when using fast_fftconv?

FFT Conv on Seq > 8192?

In the paper FFTConv is used on sequence lengths > 8192, however this line in the cpp code has:

TORCH_CHECK(fft_size >= 16 && fft_size <= 16384 && (fft_size == 1 << int(log2(float(fft_size)))));

And since fft_size = 2 * seq_len, this effectively limits the seqlen to 8192.

How did you guys overcome this?

/bin/bash: line 0: cd: csrc/fused_softmax: No such file or directory

Got that error upon:

docker build .
fuse_softmax_error.txt

Trying to generate something coherent

I am sorry if this is stupid question.
I use Google colab.Here is the code I use.It works.But it tends to produce bullshit.What am I doing wrong?

Unpickling errors when loading models

Hi,

when loading pretrained models from checkpoints using the generate_text_h3 script I'm getting the following issue:

Traceback (most recent call last):
  File "H3/examples/generate_text_h3.py", line 41, in <module>
    state_dict = torch.load(args.ckpt, map_location=device)
  File "/home/.../.conda/envs/work/lib/python3.8/site-packages/torch/serialization.py", line 795, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/.../.conda/envs/work/lib/python3.8/site-packages/torch/serialization.py", line 1002, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, 'v'.

Python version = 3.8.2
torch version = 1.13.1

Any ideas what I might be doing wrong? Likely a versioning issue

Correct method to load 2.7B?

Hi I can run 1.3B using benchmark code here, but 2.7B is still not working (bad results) with the following params:

parser = argparse.ArgumentParser(description='H3 generation benchmarking')
parser.add_argument('--dmodel', type=int, default=2560) # 2048
parser.add_argument('--nlayer', type=int, default=32) # 24
parser.add_argument('--attn-layer-idx', type=list, default=[8, 16, 24]) # [8, 16]
parser.add_argument('--nheads', type=int, default=20) # 16
parser.add_argument('--ckpt', type=str, default='/fsx/BlinkDL/CODE/_PUBLIC_/H3/H3-2.7B/model-3attn.pt')
parser.add_argument('--promptlen', type=int, default=1024)
parser.add_argument('--genlen', type=int, default=128)
args = parser.parse_args()

CPU Port

How can I port these models to a CPU only code ? I have 128 GB Mac M1 Ultra Studio system which has 21 Tflops throughput, which render the use of block fftconv moot.

Is it possible to create a CPU only model ? How should one start working on it ?

Licensing information

Hello,
Thanks for publishing this repository.
Where can I find the licensing terms for the project(s) published in this repo?
Thanks in advance.

Error Running `generate_text_h3.py` (`CUDA error: CUBLAS_STATUS_NOT_INITIALIZED`)

Hey There!

I followed the steps mentioned in the README.MD and when I try running generate_text_h3.py I get the following error.

Some Notes:

I installed https://github.com/HazyResearch/flash-attention from source.
As I got import errors, I installed the libraries one-by-one.
NOTE: I installed the versions that were default
I'm using a Linux Ubuntu machine

I'll be happy to share any other info (regarding versioning, etc.) you might have.

Here is the error trace for your reference:

[[A(h3) debayan@lambda-femtosense-2:~/h3$ PYTHONPATH=$(pwd)/H3 python3 -i  H3/examples/generate_text_h3.py --ckpt H3-125M/model.pt --prompt "Hungry Hungry Hippos: Towards Language Modeling With State" --dmodel 768 --nlayer 12 --attn-layer-idx 6 --nheads=12
args.ckpt H3-125M/model.pt
Traceback (most recent call last):
  File "/home/debayan/h3/H3/examples/generate_text_h3.py", line 60, in <module>
    output_ids = model.generate(input_ids=input_ids, max_length=max_length,
  File "/home/debayan/miniconda3/envs/h3/lib/python3.10/site-packages/flash_attn-0.2.8-py3.10-linux-x86_64.egg/flash_attn/utils/generation.py", line 150, in generate
    output = decode(input_ids, self, max_length, top_k=top_k, top_p=top_p,
  File "/home/debayan/miniconda3/envs/h3/lib/python3.10/site-packages/flash_attn-0.2.8-py3.10-linux-x86_64.egg/flash_attn/utils/generation.py", line 107, in decode
    logits = model(input_ids, inference_params=inference_params).logits[:, -1]
  File "/home/debayan/miniconda3/envs/h3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/debayan/h3/H3/src/models/ssm_seq.py", line 187, in forward
    hidden_states = self.backbone(input_ids, position_ids=position_ids,
  File "/home/debayan/miniconda3/envs/h3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/debayan/h3/H3/src/models/ssm_seq.py", line 142, in forward
    hidden_states, residual = layer(hidden_states, residual, mixer_kwargs=mixer_kwargs)
  File "/home/debayan/miniconda3/envs/h3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/debayan/miniconda3/envs/h3/lib/python3.10/site-packages/flash_attn-0.2.8-py3.10-linux-x86_64.egg/flash_attn/modules/block.py", line 126, in forward
    hidden_states = self.mixer(hidden_states, **mixer_kwargs)
  File "/home/debayan/miniconda3/envs/h3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/debayan/h3/H3/src/models/ssm/h3.py", line 114, in forward
    q = self.q_proj.weight @ u.T + self.q_proj.bias.to(dtype).unsqueeze(-1)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
>>>

Training code?

Amazing work - has some really interesting implications for the research field.

Is there the possibility to see the scripts for actually training these models?

Thanks

Setup for repo unclear

Hey all, sick model.

I've been playing around with the code and wanted to submit some feedback.

The readme could use an update as far as dependency installation goes.

Running the commands specified in the readme does not lead to successful execution of the code.

I've got a docker container running the code, but a repro of a start to finish would be immensely useful for other users. Setup is implied, so unless I'm missing something, there's a lot of guesswork/trial and error required to get the thing running.

I've done a number of ML models before so I generally know what's needed, but for someone who doesn't do that a lot there's quite a hurdle to play with this thing.

Releasing Training and Synthetic benchmarks

Hello,
Thanks for the great work! I was wondering if you have plans of releasing the synthetic datasets used in the paper + training scripts for them?

ssm utils

trying to run the benchmark -- where are the ssm_utils mentioned here?https://github.com/HazyResearch/H3/blob/main/src/models/ssm/ss_kernel_diag.py#L15

thanks!

Question about methodology used for evaluating FlashConv against cuFFT

Hi, I have a question related to a Figure in the H3 paper.
In Figure 2, it shows performance evaluation of FlashConv against cuFFT and attention.
Is it correct to think that it's comparing all operations in H3, including qkv computation and kernel generations and not just FFTconv related operation (FFTconv + elementwise multiplication + residual computation)?