Git Product home page Git Product logo

long_llama's People

Contributors

cstankonrad avatar isaacbmiller avatar syzymon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

long_llama's Issues

Compared to RAG techniques

Hi! It is a great work and I'm very interested in FoT.
But I'm curious about how it compares to RAG techniques. For example, would it be better to use Openai's text-embedding-ada-002 for the passkey retrieval task? What is the advantage of FoT?
I would appreciate your insight @CStanKonrad

It's questionable whether the context window has truly been expanded?

It feels like it's not really about expanding the context window, but rather enhancing it through the key-value pairs stored during training as external knowledge. This means that once the training is complete, the memory will no longer change. It's akin to having an external knowledge base for the training data domain. If I fine-tune on financial data and then try to infer in the technology sector, it's completely ineffective. The context window hasn't been expanded, and the dependency on longer texts is still not captured. Moreover, the external memory becomes useless in this scenario。
I am very much looking forward to getting an answer. @CStanKonrad

About the use of rotary position coding.

I have a doubt about the rotary positional encoding part of the code.

your code :


def rotate_as_if_first(x, rotary_emb):
    # x: [bs, num_attention_heads, seq_len, head_size]
    # apply rotary as if all elements were first in the sequence
    cos, sin = rotary_emb(x, x.shape[-2])
    return rotate_one(x, cos, sin, torch.zeros(x.shape[0], x.shape[-2], dtype=torch.long, device=cos.device))

Should it be like this :


def rotate_as_if_first(x, rotary_emb, position_ids):
    # x: [bs, num_attention_heads, seq_len, head_size]
    # apply rotary as if all elements were first in the sequence
    cos, sin = rotary_emb(x, x.shape[-2])
    return rotate_one(x, cos, sin, position_ids)

When the function rotate_as_if_first calls the function rotate_one, the parameter position_ids needs to be passed in instead of generating a position parameter by torch.zeros(x.shape[0], x.shape[-2], dtype=torch.long, device=cos.device) .

0-shot long-context summarization / QA inference

Hi,

Thank you for this great effort.
I am trying to use your 3B m-instruct-v1_1 model to evaluate on my custom long-context QA dataset with context length up to 200k.

I have a question. I find it difficult to locate keywords like 256k in your colab / .py examples. There are several mentions of 1024 , 2048.. as normal llama has. So this model does support long context right? in which case I should not be using the "drop-in" replacement example.

Thank you very much.

Would LongNet be easily applied to the attention with FoT

https://arxiv.org/abs/2307.02486
Scaling to 1 billion context length paper in addition to this seems like it would solve the pursuit of infinite context length. Also FoT feels similar to L2P learn to prompt which integrates a pool of prompts to help get over the forgetful issues while applying continuous learning to a model... Maybe there could be both the database of kvs accessed via knn that blends well also with L2P... Plus the LongNet dilation algorithm could definitely benefit from contrast learning too.

Thoughts?

How to integrate the method with GQA?

I have read the cross batch code and I see that the implementation is a modification to the Multi-Head Attention version of the LLaMA attention. But larger LLaMA2 model uses a GQA. So is it possible to further integrate the method with GQA?

Finetuning code?

That sounds massively interesting, and while we try to run inference and read the paper, should we expect the release of the finetuning code?

Code for zero-shot arxiv evaluation

Hi,

Can you provide the code or more detail into how you zero-shot evaluate Arxiv dataset?
I cannot get a good result when trying the arxiv summarization. I guess it is because I don't know the prompt or the model size is not 7B?

Comparison with other tuning methods

Thanks for your interesting work! I've some questions about your work:

  1. In my opinion, TruthfulQA is just an ordinary dataset, and there is no difference between it and other datasets (like MedicalQA). So your work is just an interesting method to regress given datasets (by adjusting the distribution), or it can impore the general ability of the model to generate more "truthful" answer?

  2. In Table1, you compared your method with Supervised Finetuning and Few-shot prompting, Is there more comparison between your mothod and other tuning methods like LLaMA+LoRA? If possible, could you compare your method with LLaMA+LangChain, because in practice if we want LLaMA to generate more precise answer, we'd consider LLaMA+LangChain first, though I think this method is inelegant and don't like the idea of LLM with database.

Need clarification on token limit of input used for fine tuning

Hi, I am going through the page:https://huggingface.co/syzymon/long_llama_code_7b_instruct. I found the text "All inputs were truncated and randomly padded (left/right) to 3072 tokens" under Training. Is there a reason behind this truncation? . I have noticed in creating instruct version of the model from long llama model, the context lengths used for finetuning the model are significantly smaller than the context length provided during inference time. I like to get this clarification because I have prepared a dataset in the format similar to jsonl file in the link: https://github.com/chrishayuk/opl-train/blob/main/JSONL/train.jsonl . Here each line belong to one input. Several of the input in my custom jsonl file have larger tokens like ~15K. If I follow the FT script you provided here : https://github.com/CStanKonrad/long_llama/tree/main/instruction_fine_tuning for my dataset, will the inputs with longer tokens get trucated or ignored after certain tokens during finetuning?. Or is there a possibility that during finetuning process my larger input gets split into several windows ?

Support for gradient_checkpointing

Thanks for your awesome work! There is a small problem: when I fine-tune long_llama with gradient_checkpointing, it raises an error:
image
Could you please update the code in transformers to make long_llama support gradient_checkpointing. I think it is useful for the community to use long_llama.
@CStanKonrad

CrossBatch details in appendix A.2

As mentioned in Appendix A.2:

...across elements of the batch by dividing batch entries into four segments of equal size...The last segment consists of elements exposed to three additional contexts coming from the same document. We abbreviate this setup as 1/4(0,0), 1/4(1,1), 1/4(2,1), 1/4(3,0).

I am a little bit confused the implementation about the upper description. For the segment 1/4(2,1), suppose now we are in the i-th batch, do we need to maintain memory of all (i-2, i-1) batch so that we can obtain two different positive ones? or we just maintain batch i-1 but repeat the positive one twice? and how about the one negative one?

Beyond that, there are no any ablation experiments about this trick, is it necessary for llong llama training? Thx.


UPDATE: I guess maybe I misunderstand the description in A.2, are four segments divided vertically within a batch?

Help: questions about training on 8k input text length

Hi, long_llama is very surprising according to the report in the paper, and thank u for ur great work.
I re
I'm interested in training Long_Llama-3b on some long text corpus.
But out of memory error is very usual on my A100-80G.
Is there any solutions to finetune this model on text of 10k length?
Do you have any idea about reducing memory usage?
I noticed in your paper that you mentioned the model was trained at a length of 8k.
Can u share ur training script so I can learn from it?

Below is my trainning script

#!/bin/bash
EXP_NAME="example_inst_ft_3b_low_budget"
accelerate   launch  -m instruction_fine_tuning.fine_tuning \
    --run_name "$EXP_NAME" \
    --ddp_find_unused_parameters False\
    --output_dir "$EXP_NAME"/ \
    --model_path "/mnt/shared_home/zhenghao2022/FormatGPT/long_llama/longllama" \
    --torch_dtype bfloat16\
    --data_type "instructions" \
    --data_path "/mnt/shared_home/zhenghao2022/FormatGPT/long_llama/data/18-09-CC-NEWS-20180929021529-00549.json" \
    --data_revision "f0823c7ffc48c9d33a42f16cf0b885fed4a7d0a1" \
    --dataset_split "train" \
    --prompt_field "system_prompt" \
    --post_prompt_text "
" \
    --question_field "question" \
    --post_question_text "
" \
    --response_field "response" \
    --last_context_length 768 \
    --max_input_length 4096\
    --max_output_length 4096\
    --max_total_length 10240\
    --always_pad False\
    --random_pad True \
    --max_steps 300\
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1\
    --learning_rate 1.0e-5 \
    --weight_decay 0. \
    --warmup_steps 100 \
    --lr_scheduler_type "cosine" \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_total_limit 3 \
    --save_steps 500 \
    --logging_strategy "steps" \
    --logging_steps 25 \
    --gradient_checkpointing False \
    --tf32 True \
    --bf16 True

utilizing Long Llama with Mojo Framework and applying 4-bit quantization and is it possible to use flash attention 2 and your thoughts about Speculative execution for LLM

I am interested in loading Long Llama with Mojo Framework as mentioned here https://github.com/tairov/llama2.mojo to increase the model speed while applying 4-bit quantization for model compression. Could you provide guidance or examples on how this can be achieved? Particularly, I am curious about how to maintain model performance while reducing the model size using 4-bit quantization , and is it possible to use flash attention 2 , and what do you think about using long llama 3b with code long llama for Speculative execution for LLM as mentioned here https://twitter.com/karpathy/status/1697318534555336961

I have some questions

Could you give me contact to you? I copied code, moved model and input both to GPU.And my results are some lisp without any sense...

How is the contrastive data pipeline implemented?

Hi, I saw in the paper mentioning that C_curr and C_prev from the same document in the batch, but didn't really see how this is implemented.

It seems that in the data_processing part of the code, each time the processor just samples from a new piece of data, how does it guarantee that the next batch of data will have same context in different steps? Thanks

FoT attention and the scaling trick

In your paper, you say

Position Interpolation (PI, [Chen
et al., 2023] and [kaiokendev, 2023]) introduces a modification to the rotary positional encoding
scheme that enables fine-tuning for 32K context. In contrast to this work, our method does not
rely on positional encodings, following the findings from [Haviv et al., 2022]. Removing positional
encoding in memory allows us to extrapolate to 256k tokens, although the model was only trained on
sequences up to 8K, yielding theoretically unbounded context length.

Does that mean that one can't use both scaled positional embeddings and FoT attention?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.