cstankonrad / long_llama Goto Github PK

LongLLaMA is a large language model capable of handling long contexts. It is based on OpenLLaMA and fine-tuned with the Focused Transformer (FoT) method.

License: Apache License 2.0

Python 65.22% Jupyter Notebook 34.06% Shell 0.72%

long_llama's People

Contributors

Stargazers

Watchers

Forkers

jinhyeong-lim snoopycn apollohuang1 ricklentz dumpmemory cosimoiaia jjhw kp-forks areafather techthiyanes kaidduong winnerking-2020 lucinao540 tonywhite11 joskid randomminds daihaoguang3151 shadowkun awesome-software cdj0311 haikuoxin dyngs mjdhasan yuanmeng1120 rkp64 brando90 jminj f901107 hhy5277 jmaigc maryyali pearce1999 gokunwu patr47753iciahill yett5527asamuel mitzen agnesbro8843wning ma3892022tthewdunlop ra2249227lap aubreyreyn3869olds josegron soon14 qcwthu zerovspace ssahgal dagelf gumplus tseanard qiqieatquqi tecworks-dev speedyapm blockvisors jaedukseo superxiang rayjue isuyu bainaryglobe dadd75 mwksandman kumar045 buaalearn coldra1n tony163163 cellinlab lyhiving coopergu receptor-brain liangofthechen ailabteam tomchapin touristshaun hitech777 shanthshivam jeffara knowledgehacker chiyee karolchlasta isaacbmiller constantwangheng caocongfeng big-data-ai son-koku gutpuncher isanth student-7

long_llama's Issues

Compared to RAG techniques

Hi! It is a great work and I'm very interested in FoT.
But I'm curious about how it compares to RAG techniques. For example, would it be better to use Openai's text-embedding-ada-002 for the passkey retrieval task? What is the advantage of FoT?
I would appreciate your insight @CStanKonrad

It's questionable whether the context window has truly been expanded？

It feels like it's not really about expanding the context window, but rather enhancing it through the key-value pairs stored during training as external knowledge. This means that once the training is complete, the memory will no longer change. It's akin to having an external knowledge base for the training data domain. If I fine-tune on financial data and then try to infer in the technology sector, it's completely ineffective. The context window hasn't been expanded, and the dependency on longer texts is still not captured. Moreover, the external memory becomes useless in this scenario。
I am very much looking forward to getting an answer. @CStanKonrad

About the use of rotary position coding.

I have a doubt about the rotary positional encoding part of the code.

your code :


def rotate_as_if_first(x, rotary_emb):
    # x: [bs, num_attention_heads, seq_len, head_size]
    # apply rotary as if all elements were first in the sequence
    cos, sin = rotary_emb(x, x.shape[-2])
    return rotate_one(x, cos, sin, torch.zeros(x.shape[0], x.shape[-2], dtype=torch.long, device=cos.device))

Should it be like this ：


def rotate_as_if_first(x, rotary_emb, position_ids):
    # x: [bs, num_attention_heads, seq_len, head_size]
    # apply rotary as if all elements were first in the sequence
    cos, sin = rotary_emb(x, x.shape[-2])
    return rotate_one(x, cos, sin, position_ids)

When the function rotate_as_if_first calls the function rotate_one, the parameter position_ids needs to be passed in instead of generating a position parameter by torch.zeros(x.shape[0], x.shape[-2], dtype=torch.long, device=cos.device) .

0-shot long-context summarization / QA inference

Hi,

Thank you for this great effort.
I am trying to use your 3B m-instruct-v1_1 model to evaluate on my custom long-context QA dataset with context length up to 200k.

I have a question. I find it difficult to locate keywords like 256k in your colab / .py examples. There are several mentions of 1024 , 2048.. as normal llama has. So this model does support long context right? in which case I should not be using the "drop-in" replacement example.

Thank you very much.

Would LongNet be easily applied to the attention with FoT

https://arxiv.org/abs/2307.02486
Scaling to 1 billion context length paper in addition to this seems like it would solve the pursuit of infinite context length. Also FoT feels similar to L2P learn to prompt which integrates a pool of prompts to help get over the forgetful issues while applying continuous learning to a model... Maybe there could be both the database of kvs accessed via knn that blends well also with L2P... Plus the LongNet dilation algorithm could definitely benefit from contrast learning too.

Thoughts?

FoT can only be used for pre-training, can't it be used for instruction fine-tuning?

I don’t know much about how cross-batch data is loaded during training.

How to integrate the method with GQA?

I have read the cross batch code and I see that the implementation is a modification to the Multi-Head Attention version of the LLaMA attention. But larger LLaMA2 model uses a GQA. So is it possible to further integrate the method with GQA?

How's the speed droping when length get large compare with vanilla llama?

Finetuning code?

That sounds massively interesting, and while we try to run inference and read the paper, should we expect the release of the finetuning code?

How would you go about instruction finetuning?

How would you finetune in this style with an instruction finetuning data set like Open-Orca?

Code for zero-shot arxiv evaluation

Hi,

Can you provide the code or more detail into how you zero-shot evaluate Arxiv dataset?
I cannot get a good result when trying the arxiv summarization. I guess it is because I don't know the prompt or the model size is not 7B?

Where is the learnable temperature parameter in cross_batch_attention?

Hi, in section B.2 of the paper it seems that before calculation of the softmax score, a learnable temperature is applied to the attn_weights, but I couldn't seem to find it in the cross_batch_attention code. Was this parameter removed for some reason? Thanks!

How much vram needed to finetune 3b model? Is 12gb enough?

Comparison with other tuning methods

Thanks for your interesting work! I've some questions about your work:

In my opinion, TruthfulQA is just an ordinary dataset, and there is no difference between it and other datasets (like MedicalQA). So your work is just an interesting method to regress given datasets (by adjusting the distribution), or it can impore the general ability of the model to generate more "truthful" answer?
In Table1, you compared your method with Supervised Finetuning and Few-shot prompting, Is there more comparison between your mothod and other tuning methods like LLaMA+LoRA? If possible, could you compare your method with LLaMA+LangChain, because in practice if we want LLaMA to generate more precise answer, we'd consider LLaMA+LangChain first, though I think this method is inelegant and don't like the idea of LLM with database.

Does each token requires KNN search during inference?

If i use faiss as a Memory, during the inference，calculating each token requires 3(becase there are 3 memory attention layers) knn search, right? Will the generation speed become very slow?

Need clarification on token limit of input used for fine tuning

Hi, I am going through the page:https://huggingface.co/syzymon/long_llama_code_7b_instruct. I found the text "All inputs were truncated and randomly padded (left/right) to 3072 tokens" under Training. Is there a reason behind this truncation? . I have noticed in creating instruct version of the model from long llama model, the context lengths used for finetuning the model are significantly smaller than the context length provided during inference time. I like to get this clarification because I have prepared a dataset in the format similar to jsonl file in the link: https://github.com/chrishayuk/opl-train/blob/main/JSONL/train.jsonl . Here each line belong to one input. Several of the input in my custom jsonl file have larger tokens like ~15K. If I follow the FT script you provided here : https://github.com/CStanKonrad/long_llama/tree/main/instruction_fine_tuning for my dataset, will the inputs with longer tokens get trucated or ignored after certain tokens during finetuning?. Or is there a possibility that during finetuning process my larger input gets split into several windows ?

Support for gradient_checkpointing

Thanks for your awesome work! There is a small problem: when I fine-tune long_llama with gradient_checkpointing, it raises an error:

Could you please update the code in transformers to make long_llama support gradient_checkpointing. I think it is useful for the community to use long_llama.
@CStanKonrad

CrossBatch details in appendix A.2

As mentioned in Appendix A.2:

...across elements of the batch by dividing batch entries into four segments of equal size...The last segment consists of elements exposed to three additional contexts coming from the same document. We abbreviate this setup as 1/4(0,0), 1/4(1,1), 1/4(2,1), 1/4(3,0).

I am a little bit confused the implementation about the upper description. For the segment 1/4(2,1), suppose now we are in the i-th batch, do we need to maintain memory of all (i-2, i-1) batch so that we can obtain two different positive ones? or we just maintain batch i-1 but repeat the positive one twice? and how about the one negative one?

Beyond that, there are no any ablation experiments about this trick, is it necessary for llong llama training? Thx.

UPDATE: I guess maybe I misunderstand the description in A.2, are four segments divided vertically within a batch?

Help: questions about training on 8k input text length

Hi, long_llama is very surprising according to the report in the paper, and thank u for ur great work.
I re
I'm interested in training Long_Llama-3b on some long text corpus.
But out of memory error is very usual on my A100-80G.
Is there any solutions to finetune this model on text of 10k length?
Do you have any idea about reducing memory usage?
I noticed in your paper that you mentioned the model was trained at a length of 8k.
Can u share ur training script so I can learn from it?

Below is my trainning script

#!/bin/bash
EXP_NAME="example_inst_ft_3b_low_budget"
accelerate   launch  -m instruction_fine_tuning.fine_tuning \
    --run_name "$EXP_NAME" \
    --ddp_find_unused_parameters False\
    --output_dir "$EXP_NAME"/ \
    --model_path "/mnt/shared_home/zhenghao2022/FormatGPT/long_llama/longllama" \
    --torch_dtype bfloat16\
    --data_type "instructions" \
    --data_path "/mnt/shared_home/zhenghao2022/FormatGPT/long_llama/data/18-09-CC-NEWS-20180929021529-00549.json" \
    --data_revision "f0823c7ffc48c9d33a42f16cf0b885fed4a7d0a1" \
    --dataset_split "train" \
    --prompt_field "system_prompt" \
    --post_prompt_text "
" \
    --question_field "question" \
    --post_question_text "
" \
    --response_field "response" \
    --last_context_length 768 \
    --max_input_length 4096\
    --max_output_length 4096\
    --max_total_length 10240\
    --always_pad False\
    --random_pad True \
    --max_steps 300\
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1\
    --learning_rate 1.0e-5 \
    --weight_decay 0. \
    --warmup_steps 100 \
    --lr_scheduler_type "cosine" \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_total_limit 3 \
    --save_steps 500 \
    --logging_strategy "steps" \
    --logging_steps 25 \
    --gradient_checkpointing False \
    --tf32 True \
    --bf16 True

utilizing Long Llama with Mojo Framework and applying 4-bit quantization and is it possible to use flash attention 2 and your thoughts about Speculative execution for LLM

I am interested in loading Long Llama with Mojo Framework as mentioned here https://github.com/tairov/llama2.mojo to increase the model speed while applying 4-bit quantization for model compression. Could you provide guidance or examples on how this can be achieved? Particularly, I am curious about how to maintain model performance while reducing the model size using 4-bit quantization , and is it possible to use flash attention 2 , and what do you think about using long llama 3b with code long llama for Speculative execution for LLM as mentioned here https://twitter.com/karpathy/status/1697318534555336961

Position Interpolation (PI, [Chen
et al., 2023] and [kaiokendev, 2023]) introduces a modification to the rotary positional encoding
scheme that enables fine-tuning for 32K context. In contrast to this work, our method does not
rely on positional encodings, following the findings from [Haviv et al., 2022]. Removing positional
encoding in memory allows us to extrapolate to 256k tokens, although the model was only trained on
sequences up to 8K, yielding theoretically unbounded context length.

Does that mean that one can't use both scaled positional embeddings and FoT attention?