hazyresearch / safari Goto Github PK

Convolutions for Sequence Modeling

License: Apache License 2.0

Python 1.67% C++ 1.83% Cuda 0.74% C 0.01% CSS 0.05% JavaScript 0.10% HTML 2.29% CMake 0.07% Makefile 0.01% Assembly 87.05% PHP 0.27% Pawn 4.87% POV-Ray SDL 1.03%

safari's Introduction

Convolutions for Sequence Modeling

This repository provides implementations and experiments for the following papers, as well as simplified presentations of earlier work such as S4.

Please see these instructions for how to download weights and run our pretrained models:

H3 (125m-2.7B)
Hyena (small, 150M)

Hyena

Hyena Hierarchy: Towards Larger Convolutional Language models Michael Poli*, Stefano Massaroli*, Eric Nguyen*, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Ré
ICML 2023. Oral.
Paper

Long Convs

Simple Hardware-Efficient Long Convolutions for Sequence Modeling
Daniel Y. Fu*, Elliot L. Epstein*, Eric Nguyen, Armin W. Thomas, Michael Zhang, Tri Dao, Atri Rudra, Christopher Ré
ICML 2023.
Paper

Hungry Hungry Hippos (H3)

Hungry Hungry Hippos: Towards Language Modeling with State Space Models
Daniel Y. Fu*, Tri Dao*, Khaled K. Saab, Armin W. Thomas, Atri Rudra, Christopher Ré
ICLR 2023. Notable top-25% (spotlight).
Paper

Roadmap

~~Include H3, LLM training, and synthetics in this repository~~
~~Move in fast convolution code~~
~~Add Hyena implementation and experiments~~
pip package

Changelog

See CHANGELOG.md

Setup

Requirements

This repository requires Python 3.8+ and Pytorch 1.10+. Other packages are listed in requirements.txt.

Getting Started

The easiest way to get started is to run the standalone_cifar.py script. This scripts trains a simple long convolution model on CIFAR-10:

python -m standalone_cifar

See the experiments page for more:

LRA experiments from the Long Convs paper
H3 experiments (language model, synthetics)
H3 + Long Conv experiments
Hyena language and vision experiments

Resources

We're happy to share independent reimplementations and explainer posts about methods presented in this repository.

Hyena:

Citation

If you use this codebase, or otherwise found our work valuable, you can cite us as follows:

@article{poli2023hyena,
  title={Hyena Hierarchy: Towards Larger Convolutional Language Models},
  author={Poli, Michael and Massaroli, Stefano and Nguyen, Eric and Fu, Daniel Y and Dao, Tri and Baccus, Stephen and Bengio, Yoshua and Ermon, Stefano and R{\'e}, Christopher},
  journal={arXiv preprint arXiv:2302.10866},
  year={2023}
}

@article{fu2023simple,
  title={Simple Hardware-Efficient Long Convolutions for Sequence Modeling},
  author={Fu, Daniel Y. and Epstein, Elliot L. and Nguyen, Eric and Thomas, Armin W. and Zhang, Michael and Dao, Tri and Rudra, Atri and R{\'e}, Christopher},
  journal={International Conference on Machine Learning},
  year={2023}
}

@inproceedings{fu2023hungry,
  title={Hungry {H}ungry {H}ippos: Towards Language Modeling with State Space Models},
  author={Fu, Daniel Y. and Dao, Tri and Saab, Khaled K. and Thomas, Armin W.
  and Rudra, Atri and R{\'e}, Christopher},
  booktitle={International Conference on Learning Representations},
  year={2023}
}

Acknowledgements

This repo was forked from Albert Gu's state spaces repo and borrows its structure. It also contains code from the FlashAttention training scripts.

safari's People

Contributors

Stargazers

Watchers

safari's Issues

Questions about bidirectional-version of H3

Thank you for the amazing code and work! I am interested in using H3 bidirectionally and have some questions:

Would I instantiate H3 with bidirectional shift and diagonal SSMs, similar to S4 and S4D?
If so, this can be achieved by passing a kernel (to be applied in the reverse direction) into the k_rev argument of fftconv() where it is applied?
Below, I have written a script comparing bidirectional convolutions under various implementations. The S4 & S4D convolution implementation seems to give differing results from the H3 and naive implementations. Assuming I have not made an error in my code, is this intended?

import numpy as np
import scipy
import torch
import torch.nn.functional as F

L = 100
n_fft = L * 2


def conv_direct(u, k, k_rev):
    fwd = scipy.signal.convolve(u.numpy(), k.numpy(), method="direct")[:L]
    bwd = scipy.signal.convolve(u.flip(-1).numpy(), k_rev.numpy(), method="direct")[:L]
    return fwd + np.flip(bwd, -1)


def conv_fft_s4(u, k, k_rev):
    k = F.pad(k, (0, L)) + F.pad(k_rev.flip(-1), (L, 0))
    u_f = torch.fft.rfft(u, n=n_fft)
    k_f = torch.fft.rfft(k, n=n_fft)
    return torch.fft.irfft(u_f * k_f, n=n_fft)[..., :L].numpy()


def conv_fft_h3(u, k, k_rev):
    u_f = torch.fft.rfft(u, n=n_fft, norm="backward")
    k_f = torch.fft.rfft(k, n=n_fft, norm="forward") + torch.fft.rfft(k_rev, n=n_fft, norm="forward").conj()
    return torch.fft.irfft(u_f * k_f, n=n_fft, norm="forward")[..., :L].numpy()


def conv_fft_s4_v2(u, k, k_rev):
    k = F.pad(k, (0, L)) + torch.roll(F.pad(k_rev.flip(-1), (L, 0)), 1, -1)
    u_f = torch.fft.rfft(u, n=n_fft)
    k_f = torch.fft.rfft(k, n=n_fft)
    return torch.fft.irfft(u_f * k_f, n=n_fft)[..., :L].numpy()


def compare():
    u = torch.randn(L)
    k = torch.randn(L)
    k_rev = torch.randn(L)

    direct = conv_direct(u=u, k=k, k_rev=k_rev)
    fft_s4 = conv_fft_s4(u=u, k=k, k_rev=k_rev)
    fft_h3 = conv_fft_h3(u=u, k=k, k_rev=k_rev)
    fft_s4v2 = conv_fft_s4_v2(u=u, k=k, k_rev=k_rev)

    print("Direct:  ", direct[:5])
    print("FFT S4:  ", fft_s4[:5])
    print("FFT H3:  ", fft_h3[:5])
    print("FFT S4v2:", fft_s4v2[:5])

    assert np.abs(direct - fft_h3).max() <= 1e-5
    assert np.abs(fft_s4v2 - fft_h3).max() <= 1e-5


if __name__ == "__main__":
    compare()

Output:

Direct:   [-0.8028737 -2.0864778 -3.4721029 11.840934  12.045782 ]
FFT S4:   [-2.326698  -3.9480963 12.992389  12.143847  10.611053 ]
FFT H3:   [-0.80287194 -2.0864816  -3.4721045  11.840934   12.045784  ]
FFT S4v2: [-0.8028715 -2.0864806 -3.4721053 11.840935  12.045782 ]

Thanks in advance!

Question about synthetic dataset.

Hi,

I just finish reading you hyena paper, and first of all great work 💪

I wanted to do the same synthetic experiments that you did in order to get a sense why the other models are failing.

However, I didn’t understand how you generate the dataset. For example, in the majority voting/counting do you add a token distinguishing the input from the output?

RuntimeError: u must have shape (batch_size, H, L)

Hello,
I am trying to run the benchmark here with fused_fft_conv enabled but I am getting RuntimeError: u must have shape (batch_size, H, L) error. In this case the shape of u is [1, 1, 768, 1, 2048] but it expects [1, 1, 768]. Normally, fftconv handles the last dimension but in this case, the shape check fails.

Log:

Traceback (most recent call last):
  File "/localscratch/safari/benchmarks/runtime_hyena_flashmha.py", line 77, in <module>
    m, t = benchmark_forward(hyena, x, repeats=10, desc='', verbose=False)
  File "/localscratch/safari/benchmarks/runtime_hyena_flashmha.py", line 23, in benchmark_forward
    m = t.timeit(repeats)
  File "/opt/conda/envs/gps/lib/python3.9/site-packages/torch/utils/benchmark/utils/timer.py", line 266, in timeit
    self._timeit(number=max(int(number // 100), 2))
  File "/opt/conda/envs/gps/lib/python3.9/site-packages/torch/utils/benchmark/utils/timer.py", line 256, in _timeit
    return max(self._timer.timeit(number), 1e-9)
  File "/opt/conda/envs/gps/lib/python3.9/timeit.py", line 177, in timeit
    timing = self.inner(it, self.timer)
  File "<timeit-src>", line 6, in inner
  File "/opt/conda/envs/gps/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/localscratch/safari/src/models/sequence/hyena.py", line 361, in forward
    v = self.filter_fn(v, l_filter, k=k[o], bias=bias[o, None, :, None])
  File "/opt/conda/envs/gps/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/localscratch/safari/src/models/sequence/hyena.py", line 218, in forward
    y = fftconv_func(
  File "/localscratch/safari/src/ops/fftconv.py", line 102, in fftconv_func
    return FFTConvFunc.apply(u, k, D, dropout_mask, gelu, force_fp16_output,
  File "/localscratch/safari/src/ops/fftconv.py", line 79, in forward
    out = fftconv_fwd(u, k_f, D, v, head_dim, q, dropout_mask, gelu, False, False, fft_size, force_fp16_output, output_hbl_layout, fftfp16)
RuntimeError: u must have shape (batch_size, H, L)

H3 / LongConvKernel - l_max=None isn't working

In H3 model here noted that

safari/src/models/sequence/h3_conv.py

Line 36 in 1ff064e

  l_max: the maximum kernel length, also denoted by L. Set l_max=None to always use a global kernel 

But it isn't working, because of

safari/src/models/sequence/long_conv_kernel.py

Line 56 in 1ff064e

return torch.randn(self.channels, self.H, self.L) * 0.002

Which is leading to torch.randn(int, int, None) error:

TypeError: randn(): argument 'size' must be tuple of ints, but found element of type NoneType at pos 3

So, are global kernels supported now? Or, to make it more global, should I use large l_max value?

Pickling error with standalone_cifar

System:

Windows 11
AMD 5600x
GTX 1080
32 GB RAM
python 3.11.3
torch: 2.0.0+cu117

Traceback:

==> Preparing data..
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar/cifar-10-python.tar.gz
100%|██████████| 170498071/170498071 [00:02<00:00, 60617744.05it/s]
Extracting ./data/cifar/cifar-10-python.tar.gz to ./data/cifar/
Files already downloaded and verified
Files already downloaded and verified
==> Building model..
Optimizer group 0 | 34 tensors | lr 0.01 | weight_decay 0.05
Optimizer group 1 | 6 tensors | lr 0.001 | weight_decay 0.0
Epoch: 0:   0%|          | 0/300 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\RyanD\PycharmProjects\safari\standalone_cifar.py", line 264, in <module>
    train()
  File "C:\Users\RyanD\PycharmProjects\safari\standalone_cifar.py", line 199, in train
    pbar = tqdm(enumerate(trainloader))
                ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\RyanD\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 442, in __iter__
    return self._get_iterator()
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\RyanD\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 388, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\RyanD\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 1043, in __init__
    w.start()
  File "C:\Users\RyanD\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "C:\Users\RyanD\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\RyanD\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\context.py", line 336, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "C:\Users\RyanD\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\popen_spawn_win32.py", line 94, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\RyanD\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <function <lambda> at 0x00000160ACB009A0>: attribute lookup <lambda> on __main__ failed

How to set config properly to launch an experiment on wk103 (of any model)?

I run the command

python train.py wandb.mode=offline experiment=wt103/base pipeline=wt103

from safari/ directory and the program gives:

Error executing job with overrides: ['wandb.mode=offline', 'experiment=wt103/base', 'pipeline=wt103']
Traceback (most recent call last):
  File "/public/home/chenzhuo/cz/safari/train.py", line 679, in main
    config = utils.train.process_config(config)
  File "/public/home/chenzhuo/cz/safari/src/utils/train.py", line 69, in process_config
    config = omegaconf_filter_keys(config, lambda k: not k.startswith('__'))
  File "/public/home/chenzhuo/cz/safari/src/utils/config.py", line 121, in omegaconf_filter_keys
    {k: omegaconf_filter_keys(v, fn) for k, v in d.items() if fn(k)}
  File "/public/home/chenzhuo/cz/safari/src/utils/config.py", line 121, in <dictcomp>
    {k: omegaconf_filter_keys(v, fn) for k, v in d.items() if fn(k)}
  File "/public/home/chenzhuo/cz/safari/src/utils/config.py", line 121, in omegaconf_filter_keys
    {k: omegaconf_filter_keys(v, fn) for k, v in d.items() if fn(k)}
omegaconf.errors.InterpolationResolutionError: KeyError raised while resolving interpolation: "Environment variable 'DATA_PATH' not found"
    full_key: dataset.cache_dir
    object_type=dict

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Then I add HYDRA_FULL_ERROR=1 :

Error executing job with overrides: ['wandb.mode=offline', 'experiment=wt103/base', 'pipeline=wt103']
Traceback (most recent call last):
  File "/public/home/chenzhuo/cz/safari/train.py", line 689, in <module>
    main()
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/public/home/chenzhuo/cz/safari/train.py", line 679, in main
    config = utils.train.process_config(config)
  File "/public/home/chenzhuo/cz/safari/src/utils/train.py", line 69, in process_config
    config = omegaconf_filter_keys(config, lambda k: not k.startswith('__'))
  File "/public/home/chenzhuo/cz/safari/src/utils/config.py", line 121, in omegaconf_filter_keys
    {k: omegaconf_filter_keys(v, fn) for k, v in d.items() if fn(k)}
  File "/public/home/chenzhuo/cz/safari/src/utils/config.py", line 121, in <dictcomp>
    {k: omegaconf_filter_keys(v, fn) for k, v in d.items() if fn(k)}
  File "/public/home/chenzhuo/cz/safari/src/utils/config.py", line 121, in omegaconf_filter_keys
    {k: omegaconf_filter_keys(v, fn) for k, v in d.items() if fn(k)}
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 562, in items
    return dict(self.items_ex(resolve=True, keys=None)).items()
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 588, in items_ex
    value = self[key]
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 375, in __getitem__
    self._format_and_raise(key=key, value=None, cause=e)
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/base.py", line 231, in _format_and_raise
    format_and_raise(
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/_utils.py", line 899, in format_and_raise
    _raise(ex, cause)
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/_utils.py", line 797, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 369, in __getitem__
    return self._get_impl(key=key, default_value=_DEFAULT_MARKER_)
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 451, in _get_impl
    return self._resolve_with_default(
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/basecontainer.py", line 98, in _resolve_with_default
    resolved_node = self._maybe_resolve_interpolation(
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/base.py", line 719, in _maybe_resolve_interpolation
    return self._resolve_interpolation_from_parse_tree(
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/base.py", line 584, in _resolve_interpolation_from_parse_tree
    resolved = self.resolve_parse_tree(
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/base.py", line 769, in resolve_parse_tree
    raise InterpolationResolutionError(
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/base.py", line 764, in resolve_parse_tree
    return visitor.visit(parse_tree)
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/antlr4/tree/Tree.py", line 34, in visit
    return tree.accept(self)
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/grammar/gen/OmegaConfGrammarParser.py", line 206, in accept
    return visitor.visitConfigValue(self)
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/grammar_visitor.py", line 101, in visitConfigValue
    return self.visit(ctx.getChild(0))
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/antlr4/tree/Tree.py", line 34, in visit
    return tree.accept(self)
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/grammar/gen/OmegaConfGrammarParser.py", line 342, in accept
    return visitor.visitText(self)
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/grammar_visitor.py", line 301, in visitText
    return self._unescape(list(ctx.getChildren()))
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/grammar_visitor.py", line 389, in _unescape
    text = str(self.visitInterpolation(node))
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/grammar_visitor.py", line 125, in visitInterpolation
    return self.visit(ctx.getChild(0))
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/antlr4/tree/Tree.py", line 34, in visit
    return tree.accept(self)
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/grammar/gen/OmegaConfGrammarParser.py", line 1041, in accept
    return visitor.visitInterpolationResolver(self)
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/grammar_visitor.py", line 179, in visitInterpolationResolver
    return self.resolver_interpolation_callback(
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/base.py", line 750, in resolver_interpolation_callback
    return self._evaluate_custom_resolver(
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/base.py", line 694, in _evaluate_custom_resolver
    return resolver(
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/omegaconf.py", line 445, in resolver_wrapper
    ret = resolver(*args, **kwargs)
  File "/public/home/chenzhuo/anaconda3/envs/effseq/lib/python3.9/site-packages/omegaconf/resolvers/oc/__init__.py", line 38, in env
    raise KeyError(f"Environment variable '{key}' not found")
omegaconf.errors.InterpolationResolutionError: KeyError raised while resolving interpolation: "Environment variable 'DATA_PATH' not found"
    full_key: dataset.cache_dir
    object_type=dict

My questions are:

Am I using the config correctly? If not, how to set the config?
What does the output mean?

fftconv compiling issue

      In file included from /usr/local/cuda-11.3/include/cuda/std/cmath:19,                                                      
                       from /usr/local/cuda-11.3/include/cuda/std/complex:12,                                                    
                       from fftconv.cpp:6:                                                                                       
      /usr/local/cuda-11.3/include/cuda/std/detail/libcxx/include/cmath:569:101: error: ‘float cuda::std::__3::hypot(float, float
, float)’ conflicts with a previous declaration                                                                                  
        569 | inline _LIBCUDACXX_INLINE_VISIBILITY float       hypot(       float x,       float y,       float z ) { return sqrt
(x*x + y*y + z*z); }
            |                                                                                                     ^
      In file included from fftconv.cpp:3:
      /usr/include/c++/9/cmath:1868:3: note: previous declaration ‘float std::hypot(float, float, float)’
       1868 |   hypot(float __x, float __y, float __z)
            |   ^~~~~
      In file included from /usr/local/cuda-11.3/include/cuda/std/cmath:19,
                       from /usr/local/cuda-11.3/include/cuda/std/complex:12,
                       from fftconv.cpp:6:
      /usr/local/cuda-11.3/include/cuda/std/detail/libcxx/include/cmath:570:101: error: ‘double cuda::std::__3::hypot(double, dou
ble, double)’ conflicts with a previous declaration
        570 | inline _LIBCUDACXX_INLINE_VISIBILITY double      hypot(      double x,      double y,      double z ) { return sqrt
(x*x + y*y + z*z); }
            |                                                                                                     ^
      In file included from fftconv.cpp:3:
      /usr/include/c++/9/cmath:1872:3: note: previous declaration ‘double std::hypot(double, double, double)’
       1872 |   hypot(double __x, double __y, double __z)
            |   ^~~~~
      In file included from /usr/local/cuda-11.3/include/cuda/std/cmath:19,
                       from /usr/local/cuda-11.3/include/cuda/std/complex:12,
                       from fftconv.cpp:6:
      /usr/local/cuda-11.3/include/cuda/std/detail/libcxx/include/cmath:572:101: error: ‘long double cuda::std::__3::hypot(long d
ouble, long double, long double)’ conflicts with a previous declaration
        572 | inline _LIBCUDACXX_INLINE_VISIBILITY long double hypot( long double x, long double y, long double z ) { return sqrt
(x*x + y*y + z*z); }
            |                                                                                                     ^
      In file included from fftconv.cpp:3:
      /usr/include/c++/9/cmath:1876:3: note: previous declaration ‘long double std::hypot(long double, long double, long double)’
       1876 |   hypot(long double __x, long double __y, long double __z)
            |   ^~~~~
      error: command '/usr/bin/gcc' failed with exit code 1

with cuda 11.3, gcc 9.4.0, pytorch 2.1.0, pip install . under folder csrc/fftconv
Kindly ask if anyone encounter this issue. Thank you!

Question about long sequence lengths with Hyena

Hello!

In the Hyena paper, section 4.4 says "Hyena speedups reach 100x at sequence length 64K."

The figure referenced by that section (figure 4.3) stops at a sequence length short of 10k and the optimized implementation in this repo appears to be limited to an 8k sequence length.

There are a few other references to a 100x speedup over FlashAttention in the paper (and in blog posts). Are these measured speedups or extrapolated from smaller sequence lengths?

I've experimented with the implementation in standalone_hyena.py but it appears to be ~3x slower than FlashAttention at sequence lengths > 32k tokens.

Do you have an estimate for when the fftconv implementation in this repo will support longer sequence lengths (or a pointer to another Hyena codebase if the speedups in the paper were measured)?

Thanks for the great work!

learn_ifft in long_conv.py

In sequence.long_conv on line 144 there is a clause:

            if self.learn_ifft:
                y = self.block_fft_u(y_f, N=L_kernel+L,forward=False).real[..., :L]
            else:
                y = torch.fft.ifft(y_f, n=L_kernel+L, dim=-1).real[..., :L] # (B C H L)

However, learn_ifft is not defined anywhere and hence throws an error.

Long latency when loading fft_conv kernel for the first time

The first kernel launch of fft_conv_fwd takes abnormally long time (about 100 sec).
After the first kernel launch it works fine so for training it's not much of an issue but it makes debugging very cumbersome.
Could it be a problem with my ld options?

Arguments for simple H3 model than could learn Causal or Masked LM modeling?

Hi all! I've been trying to get the training experiments to run and struggling with some errors in which hydra cannot parse the configs given at /experiment/pile/h3 (I was following the instructions in experiments.md for the Pile). I'm actually hoping to train on a different dataset entirely though, for which I already have a working pipeline. Given correct installation of the dependencies in this repo, is there a best way to instantiate an H3 model that would be suitable for Causal language modeling, and/or for Masked LM?

For example, hoping to come up with something comparable to this, and just using it in my existing pipeline:

config = AutoConfig.from_pretrained(
        "roberta-base",
        vocab_size=tokenizer.vocab_size,
        random_init=True,
        is_decoder=True
    )

model = RobertaForCausalLM(config)

Here is what I have for Causal LM, though I am still trying to sort out the dependencies to get it to run.

model = ConvLMHeadModel(
        d_model=768, n_layer=12, d_inner=768 * 4,
        vocab_size=tokenizer.vocab_size, resid_dropout=0.0, embed_dropout=0.1,
        layer="h3", attn_layer_idx=[1, 8],
        attn_cfg=None,
        fused_mlp=True,
        fused_dropout_add_ln=True,
        residual_in_fp32=True,
        pad_vocab_size_multiple=8,
    )

Are these reasonable choices for these arguments, and/or are there others I would need to specify? Are there important pieces of the training or data preparation specified in the configs, that I would need to replicate in another pipeline designed for a HF transformer LM? Sorry if these are obvious/documented, was just having a hard time reading through the configs.

Downstream evaluation on SuperGLUE

Thank you for your great work! In the paper, you described LM over SuperGLUE being evaluated for zero-shot learning. Could you please share the code for that evaluation?

Bidirectional Hyena

Hello,
I'm currently trying to build a bidirectional prediction model using the standalone_hyena.py.

Is is possible to make bidirectonal Hyena like #7?
If possible, how can I implement that on standalone_hyena.py ?

Thank you!

Hyena YAML parsing fails

Hey, I was trying to run the Hyena 150B token training according to experiments.md, but the YAML parsing fails and complains about

yaml.scanner.ScannerError: mapping values are not allowed here
  in "[...]/safari/configs/experiment/pile/hyena.yaml", line 9, column 18

This is probably due to the ${eval: [...]} in the mentioned line and the parser assumes eval: [...] should be parsed as a YAML expression. I haven't found this eval: feature in the Hydra or lightning-hydra-template documentation.

I'm probably missing something here due to unfamiliarity with lightning-hydra-template. :)
However, if I'm not, has there possibly been dependency breakage or are you maybe using a custom Hydra fork?

Environment

PyTorch 1.12, Lightning 1.8.6, Hydra 1.3.2

I simply installed requirements.txt as specified in README.md.

Command

As specified in experiments.md for the small Hydra training, with two adjustments:

The YAML config name was adjusted to hyena-150b-tokens since hyena-150b does not exist.
experiment/... was changed to experiment=... so the argument is correctly parsed.

python -m train experiment=pile/hyena-150b-tokens

Accuracy on CIFAR is not similar to that in the paper

First of all, thanks for your great work and maintaining this repository.

I have tried to get Hyena's results on CIFAR but the accuracy reaches to ~60% after 100 epochs. From the Appendix, the model dimension is 128 which is different from experiment/cifar/hyena-vit-cifar.yaml. So, I wonder if this is the only setting from this config file that should be fixed to get the results reported in the paper (Acc= 91%)?

What is the suggested config for running LRA exps with Hyena?

I tried the hyena model on PathX exp but got bad results (val/loss=nan and the grad_norm of later layers near infinite).
My config:
`# @Package global
defaults:

/pipeline: pathx
override /scheduler: cosine_warmup

scheduler:
num_training_steps: 125000 # 50 epochs
num_warmup_steps: 2500 # 1 epoch

model:
name: model
n_layers: 6
d_model: 256
norm: batch
layer:
name: hyena
emb_dim: 3
filter_order: 64
local_order: 3
modulate: True
l_max: 16384
w: 1
lr: ${optimizer.lr}
lr_pos_emb: ${optimizer.lr}
return_state: True

loader:
batch_size: 25

optimizer:
lr: 0.0005
weight_decay: 0.05

trainer:
max_epochs: 50

train:
seed: 2222
interval: step # For cosine scheduler
`

My command:
python -m train trainer.devices=8 experiment=lra/hyena-lra-pathx +dataset.data_dir=./data/pathfinder128

Is there something I've set up wrong, please?

How to reproduce the Hyena-Imagenet experiment result simillar with the paper?

Hello all,
I am trying to reproduce the Imagenet task with your great work!
After the installation is done with the procedure with the repogitory, I try the command ''python -m train wandb=null experiment=imagenet/hyena-vit" on the experiments.md.

In my system, there are 4ea of A6000, so I changed the number of device from 8 to 4 in hyena-vit.yaml
At the last epoch ends, I get the message just like the below.

Epoch 304: 100%|█| 2893/2893 [20:58<00:00, 2.30it/s, loss=4.29, val/accuracy=0.685, val/accuracy@5=0.885, val/accuracy@10=0.927, val/loss=1.510, train/accuracy=0.452, train/accuracy@5=0.667, train/accuracy@10=0.7

In my recognition, the val/accuracy should be similar with 79.8 that the result on the paper.
Is there any fault while the training step?
the attachment is training config file, If there any mistake that I did, please let me know about it...

Or, is it possible to request the checkpoint file of vit_hyena?

Thank you.
config_tree.txt

What is the suggested way to implement an autograd version of nn.functional.conv1d?

Suppose I want to use fftconv_fwd to implement nn.functional.conv1d - how would I go about doing this? Is there already a function in safari that operates on pytorch tensors and computes a differentiable version of both the convolution as well as the concatenation of the filters? In particular, I'm looking for > 100K kernel size

Hyena seems forward leakage?

hi, I'm testing Hyena structure and for simplicity let's focus on standalone_hyena.py. I revise the existing example just a little to demo the existence of forward leakage below.

I understand there is a "Causality check by gradient". But we can do it alternatively in a disturbance from input to output. Specifically, we should believe "a disturbance over input from future should not affect output in the past".


if __name__ == "__main__":
    x = torch.randn(1, 1024, 512, requires_grad=True)
    x2 = x.detach().clone()
    x2[:, 3, :] = 1e10 * torch.sign(x2[:, 3, :])      # an obvious interrupt/disturbance added at somewhere, set to 3 in this example

    m_hyena = HyenaOperator(
        d_model=x.shape[2],
        l_max=1024,
        order=2,
        filter_order=64
    )

    yhat = m_hyena(x)
    yhat2 = m_hyena(x2)
    print(yhat[0,:20,0])
    print(yhat2[0,:20,0])

tensor([-0.0627, -0.0652, -0.0135, -0.0132, -0.0580, -0.0491, 0.0646, -0.1024,
-0.0986, -0.0817, -0.0367, -0.1008, 0.0637, -0.0156, -0.0301, -0.1105,
-0.0579, -0.1214, 0.0497, -0.1366], grad_fn=)
tensor([ 5.7309e+08, 3.5364e+09, 2.8772e+09, 1.9058e+27, 7.9790e+27,
1.5603e+27, 1.4734e+16, -4.5865e+16, -1.1996e+17, -6.1985e+16,
5.0280e+16, 2.5590e+16, 6.0684e+16, 1.7392e+16, -8.0251e+15,
6.0426e+16, 1.1078e+16, 8.8085e+16, 1.4208e+16, 1.3096e+17],
grad_fn=)

Intuitively, the we should expect the first three outputs of yhat and yhat2 are the same. However, they are not as print. I think it is an indicator showing that there are some "forward leakage" problems in Hyena implementation.

Non-causal implementation of language model for synthetic datasets

Regarding synthetic datasets: from the implementation and as it explained in the issue, train loss is evaluated on all tokens and test is only evaluated on the last token . Then my question is what is the advantage of such autoregressive training strategy, which require the model to be causal, rather than simply modelling the training as a classification problem, i.e. loss and accuracy of training is evaluated only on the last token as such that
$p({y}[..., -1]) \simeq Hyena(x) [..., -1]$
If we follow this training approach then the target is estimated based on all the token in the sentence and, it seems that, it is not required for the model to be causal for datasets: Associative Recall and induction head, is it trues?

Question about the model size

Hi,

I am trying to build a Hyena model using hyperparameters from Table A4 (the 4th row). I am using the implementation of a standalone model:

layer2 = HyenaOperator( d_model=1024, l_max=19072, order=36, filter_order=64, num_inner_mlps=4, emb_dim=17, w=14 )

However, when I check the number of parameters, I get ~42M instead of 355M as stated in the paper. Is it because I am using the standalone implementation? But even then how come the difference is so big? Or maybe I am missing something?

`def count_parameters(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)

count_parameters(layer)`

GPU memory requirements of Hyena

Hi,

I am trying using Hyena for a use case where the length of the input sequence is up to 1500 and the overall batch size is 40k tokens, and where scaling over this length is challenging because of the quadratic memory requirements of attention-based models. Replacing the (self-)attention with Hyena did not reduce the memory requirements though, and scaling beyond that length still causes OOM issues in a GPU with 40GB of RAM. In the paper I have seen only scaling curves for the runtime, but it shows sequence lengths far beyond that limit. I am wondering if you have scaling curves for the amount of RAM required with different sequence lengths to understand how the memory requirements scale with the length of the input sequence.

Thanks.

Hyena

dropout_add_layer_norm is not installed

Hi All,

We are trying to run Hyena with lambada.py and get this error: "ImportError: dropout_add_layer_norm is not installed"
Many trials with installing flash-attn in various versions and other imports did not work.

Any hints or tips on how to solve?

Thx!
Ofer

Relative positional encodings with Hyena

Hello,

Is there a way to implement relative positional encodings with Hyena similar to what was done in the Transformer-XL paper? Any tips on how to implement that?

Order of hyena operation

The order of operation described in the hyena paper was conv in time domain first then frequency domain.
But the implementation in this repo does conv in frequency domain first then time domain.
Am I missing something?

https://github.com/HazyResearch/safari/blob/main/src/models/sequence/hyena.py#L339

RWKV

On the corresponding paper there are references to RWKV, but in the codebase I don't see any references to experiments with RWKV?

Question about discrepancy between implementations available in the repo and related papers

Hi, I'm bit confused about current implementations of the repo and implementations used/discussed in related papers. I'll just state what I think is true. Please correct me if I'm wrong.

Flashconv from h3
Fused kernel is implemented at fftconv_cuda.cu but it is not using block FFT.
FlashButterfly in "Simple Hardware-Efficient Long Convolutions for Sequence Modeling"
long_conv.py uses BlockFFT (which is same as Butterfly Decomposition) with support for learnable parameters for dft_matrix. But not using fused kernel and Three-pass algorithm is also not implemented.

low train accuracy (10% / 40%) on synthetic language modeling tasks for H3

I was reproducing these two runs

python -m train experiment=synthetics/associative_recall/h3
python -m train experiment=synthetics/induction_head/h3

from https://github.com/HazyResearch/safari/blob/main/experiments.md and got the following results respectively:

# associative recall
Epoch 399: 100%|██████████| 189/189 [00:02<00:00, 66.60it/s, loss=1.47, val/accuracy_ignore_index=0.980, val/loss=0.0471, val/perplexity=1.050, test/accuracy_ignore_index=0.980, test/loss=0.0471, test/perplexity=1.050, train/accuracy_ignore_index=0.410, train/loss=1.470, train/perplexity=4.370]

# induction head
Epoch 399: 100%|██████████| 189/189 [00:02<00:00, 65.47it/s, loss=2.63, val/accuracy_ignore_index=1.000, val/loss=0.0804, val/perplexity=1.080, test/accuracy_ignore_index=1.000, test/loss=0.0804, test/perplexity=1.080, train/accuracy_ignore_index=0.165, train/loss=2.640, train/perplexity=14.00]

For both tasks, test accuracies are above 97% as expected, but train accuracies are around train/accuracy_ignore_index=0.410 and train/accuracy_ignore_index=0.165. Any reason why this might happen?

RuntimeError: Expected fft_size >= 16 && fft_size <= 16384

Hello, thanks for the interesting research and open source repo!

I'm trying to integrate the HyenaOperator (with default settings) in a sequence modeling task and am running into the error in the title when using the fftconv extension.

My sequence (u in the trace below) has the shape (batch=10, channels=32, seq_len=8760) which apparently leads to an fft_size of 32768.

  File ".../hyena.py", line 31, in fftconv_fused
    return fftconv_func(u, k, D, gelu=False, force_fp16_output=torch.is_autocast_enabled())
  File ".../extensions/fftconv/fftconv.py", line 175, in fftconv_func
    return FFTConvFunc.apply(
  File ".../extensions/fftconv/fftconv.py", line 98, in forward
    out = fftconv_fwd(
RuntimeError: Expected fft_size >= 16 && fft_size <= 16384 && (fft_size == 1 << int(log2(float(fft_size)))) to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)

Is the maximum supported sequence length 8192? Is this a theoretical / hardware limitation? Or just of the current implementation? Would it be possible to support longer sequences?

Thanks!

Inconsistent between implementation and paper descriptions

In the paper Algorithm 3, for hyena order N, there are (N+1) projections, and N filters
with order=2, it returns
mlp2(x) * FFTConv(mlp1(x) * FFTConv(mlp0(x), filter0), filter1)

However, in the implementation e.g.

safari/standalone_hyena.py

Line 244 in 4f5972c

v = self.dropout(v * x_i)

For hyena order N, there are (N+1) projections and (N-1) filters
In the code, for example, with order=2,
it will do mlp2(x) * FFTConv(mlp0(x) * mlp1(x), filter0)

i.e., for order=N there is only (N-1) FFTConv applications.

is it intentional or am I missing something (the code is quite convoluted) ?

A lot of the experiment had done with order=2. Does that mean one application of FFTConv per layer is enough ?

Does Hyena support BERT style LLM?

Hi, thanks for this awesome work! I am wondering if this could be applied to Bert style model since the paper describe that hyena filter preserves causality in order to predict only depending on the past. I have read your HyenaDNA paper and am thinking about use Hyena in my project, which needs looking from both future and past. Thanks a lot in advance.

Question concerning FFT operation.

safari/standalone_hyena.py

Lines 17 to 21 in 9ecfaf0

 k_f = torch.fft.rfft(k, n=fft_size) / fft_size 

 u_f = torch.fft.rfft(u.to(dtype=k.dtype), n=fft_size) 

 if len(u.shape) > 3: k_f = k_f.unsqueeze(1) 

 y = torch.fft.irfft(u_f * k_f, n=fft_size, norm='forward')[..., :seqlen]

Hello. Thank you for the great work in this paper. I only have a minor question concerning the code.

When performing the FFT, it is my understanding that the inputs should be shifted before and after the operation to be equivalent to the DFT.

Therefore, fftshift(fft(ifftshift(x))) and fftshift(ifft(ifftshift(X))) are the correct methods.

Because the rfft function removes half of the frequency space, I believe that the correct transformation should be rfft(ifftshift(x)) and fftshift(irfft(X)) for the conversions to and from the frequency domain. This may not impact the model performance, and there may be no great difference in the outputs, but I believe that it may be worth noting.

I have included the following links for reference.

https://groups.google.com/g/comp.soft-sys.matlab/c/rUcc0bRRZf4?pli=1

https://dsp.stackexchange.com/questions/66716/why-do-we-have-to-rearrange-a-vector-and-shift-the-zero-point-to-the-first-index

Potential reasons for stuck building wheel for droput-layer-norm

Hi! It's a really nice repo and the code is quite clear to read.

As I was trying to run the hyena code for language model, it seems the building for third layer_norm package is stuck and I've waited for the third install a whole day.

cd ./csrc/fused_dense_lib && pip install .
cd ../xentropy && pip install .
cd ../layer_norm && pip install .

I just wonder if it's normal for a build to take so long or some bug with my configuration.

Basically the current output for (stuck terminal) is:

Building wheels for collected packages: dropout-layer-norm
  Building wheel for dropout-layer-norm (setup.p) ... |

Thank you so much for reading my issue!

About the squash operator of long convolution

Hi,
Thanks for sharing this nice wok. I have one question regarding putting the squashing operator in the forward method of the long convolution module:

safari/standalone_long_convs.py

Line 71 in 02220c6

k = F.relu(torch.abs(k)-self.kernel_lam)*torch.sign(k)

Doesn't this lead to complete zero kernels as training continues for large training steps? in other words, you keep eroding kernel weights until reaching zero everywhere after e.g., 1M training steps. Is this correct?

Thanks!

configs for Hyena Wikitext103 experiments

Your work is excellent! I am trying to follow your work and facing some problems. I wonder if you may share the config for the wiki103 dataset of the Hyena. I try to conduct experiments with 125-slim but the test perplexity is higher than the reported result (about 21 with hyena). And I am wondering whether the removal of flash-atten will influence the result or not.

How to visualize Hyena Matrix H(u) from the Paper

Firstly, I would like to extend my sincere gratitude for your amazing work and the code that you have shared. It's truly impressive.

I was reading through your paper and noticed the visualization of the Hyena Matrix H(u), which I found to be particularly interesting. However, when I went through the codebase, I realized that there isn't a direct implementation for materializing this matrix.

Could you kindly guide me on how I might go about visualizing the Hyena Matrix H(u) using my own dataset? Any pointers or additional information you could provide would be greatly appreciated.

Encoder decoder

Hi
Can we make the hyena model work like a full vanilla transformer? where we can pass encoder last hidden state as memory to the decoder.
I was trying to build OCR with the hyena model so I tried prepending the image embeddings with text embeddings but it seems to not learn anything.

Thanks

	k_f = torch.fft.rfft(k, n=fft_size) / fft_size
	u_f = torch.fft.rfft(u.to(dtype=k.dtype), n=fft_size)

	if len(u.shape) > 3: k_f = k_f.unsqueeze(1)
	y = torch.fft.irfft(u_f * k_f, n=fft_size, norm='forward')[..., :seqlen]

hazyresearch / safari Goto Github PK

safari's Introduction

Convolutions for Sequence Modeling

Hyena

Long Convs

Hungry Hungry Hippos (H3)

Roadmap

Changelog

Setup

Requirements

Getting Started

Resources

Hyena:

Citation

Acknowledgements

safari's People

Contributors

Stargazers

Watchers

Forkers

safari's Issues

Environment

Command

Recommend Projects

Recommend Topics

Recommend Org