jzhang38 / longmamba Goto Github PK

View Code? Open in Web Editor NEW

170.0 170.0 8.0 1.97 MB

Some preliminary explorations of Mamba's context scaling.

Python 98.87% Shell 1.13%

longmamba's People

Contributors

Stargazers

Watchers

Forkers

codeaudit auxon yangwang92 nickydark1 evelynmitchell doraemonzzz zyj1729 radarfudan

longmamba's Issues

Memory Usage

8x A100 80 GB strikes me as a lot of memory usage for a 2.8 GB model at 16k context.

Is this attributable to the trainable A matrix being large? (owing to the size of h, the state)?

cache clearing interval for previous hidden states

I love this exploration! Thanks for writing and coding this up. Right now, we're working on modifications to the causal conv1d and selective scan CUDA kernels to support defining the input state, so we are reviewing your code carefully.

What is the objective of the exponential fall-off in the cache clearing in train-infinite.py?

        if completed_steps % clear_cache_interval == 0:
            for layer_idx in range(model.config.n_layer):
                conv_state = torch.zeros((1, model.config.d_model*2, 3), dtype=torch.bfloat16, device=accelerator.device).detach()
                ssm_state = torch.zeros((1, model.config.d_model*2, 16), dtype=torch.bfloat16, device=accelerator.device).detach()
                previous_hidden_states.append((conv_state, ssm_state))
            clear_cache_interval *= 2

Also, a general question: do you have a feeling for why your current implementation isn't working? Might vanishing gradients be an issue when running over longer sequences? I noticed that you're using bf16. I found this caused instability, and using amp for higher precision seemed to help.

Question: Is the model available Instruction tuned?

Hello,

Just wondering if the model that you provided on huggingface was instruction tuned to perform the needle in the haystack test.

Also, (hypothetically speaking) would some of the practices to reduce GPU requirements also apply to SSSM models? For example, Unsloth reduces the GPU demand so consumer GPUs can train Llama2 -7B and Mistral - 7B models. My 8BG GPU was able to finetune Mistral for a small usecase of mine. It would absolutely amazing to see a Mamba-7B model train for half the resources that Unsloth Mistral 7B needs.

jzhang38 / longmamba Goto Github PK

longmamba's People

Contributors

Stargazers

Watchers

Forkers

longmamba's Issues

Memory Usage

cache clearing interval for previous hidden states

Question: Is the model available Instruction tuned?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent