sahil280114 / codealpaca Goto Github PK

View Code? Open in Web Editor NEW

1.4K 1.4K 108.0 9.13 MB

License: Apache License 2.0

Python 100.00%

codealpaca's People

Contributors

Stargazers

Watchers

Forkers

kylixc jimbog if-ai battbeach jcl2023 ai-jie01 harpreetpaul llegomark huming0618 saridsa1 hyojunguy penghao1023 dnyaneshwaru traviscooper richardkelley kustomzone klei22 kastnerkyle teknium1 jonas-schmitt wilfoderek sotokisehiro shafiahmed anasolva techthiyanes philipbroadway cedrickchee dumpmemory iuriimattos2 personx000 mchlrnx wardhunt hhy5277 dnim-laicifitra ericxsun gyunggyung jade2290 pami0000 lpreemo positioner pjahad dchen3018 zackjibuaa co-simulation zhouliang-yu cm2435 okletsgo vsujeesh bctnry tanglespace andreas-it-dev air23zj sxthunder rebbrunner qqq-tech rheehot farmda yibit luehang neophack bu6030 ph-ausseil ai-ld nashid atanida yidong72 mtzkb knowledgehacker wyrcode sit1991 yuanzhil shoo99 bellyfat xhr777 04rr spydaz bomberman997 yingweima2022 wm370857724 nickrosh mrcodechef gokunwu avisoori-databricks prodhan abdelazizelm sssszh lance1232001 dongguanting donwany superxiang daihaoguang3151 gptcrash zhaopu7 hbcbh1999 flowstudy feihu618 tonywang-sh common-nat leeywin belle9217

codealpaca's Issues

Hosting your dataset on the Hugging Face Hub

Hi @sahil280114, this is a really cool project and great job augmenting the Alpaca dataset with code instructions!

Would you be interested in hosting this on the Hugging Face Hub (https://huggingface.co/datasets)? The Alpaca dataset is also hosted there (link) and your version would be of wide interest to the community, especially since many people have noted the Alpaca models can't code very well :)

If that sounds interesting, you just need to create a new dataset repo and upload your code_alpaca_20k.json file through the UI. For more details, you can check out our docs here

PS LLaMa is now included in the main branch of transformers, so you may also want to update your training instructions :)

bug: get empty state dict

I follow the step in README, but I get the empty state dict. Here is the code and the output:
code:

trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
print(trainer.model.state_dict()['model.layers.30.mlp.gate_proj.weight'])
print('training')
trainer.train()
print(trainer.model.state_dict()['model.layers.30.mlp.gate_proj.weight'])
print('trained')
trainer.save_state()
print(trainer.model.state_dict()['model.layers.30.mlp.gate_proj.weight'])
print('saved')

output:
tensor([[ 1.5984e-03, -1.6602e-02, -1.6460e-03, ..., -1.6632e-02,
-1.9989e-02, 1.1383e-02],
[ 9.5062e-03, 3.3356e-02, 5.6343e-03, ..., -3.6743e-02,
-3.2074e-02, 2.6810e-02],
[ 1.1917e-02, -2.1515e-02, -2.6352e-02, ..., 2.7328e-02,
-4.0550e-03, 1.5320e-02],
...,
[-2.8503e-02, 1.5316e-03, -1.8753e-02, ..., 2.9846e-02,
-1.9440e-02, 2.6703e-02],
[ 5.6505e-05, -4.5898e-02, 2.0660e-02, ..., -6.5689e-03,
-3.2043e-02, 1.8005e-02],
[-7.1106e-03, -7.1487e-03, -4.5624e-03, ..., 1.3138e-02,
-4.3060e-02, -1.5869e-02]])
training
tensor([], device='cuda:0', dtype=torch.float16)
trained
tensor([], device='cuda:0', dtype=torch.float16)
saved

Just want to say thank you.

This produced an instruction-following dataset with 20K examples obtained at a much lower cost (less than $200).

Thank you for your willingness to make the dataset publicly available in a non-commercial context, it is much appreciated.

Private data

Hi Sahil .. great work!
One question: If I understand correctly, we still need to communicate with OpenAI to generate data from seed file. Is there any privacy concern with this? What if the seed file has information that is confidential from the org's pov?

How can i run this model locally using cpu on Windows 11?

Does exist any web ui/gui to run this model offline on PC?

Instructions for training 13b model

Is there any docs for training the 13b version of this model?

Please publish weights?

Would you please be so kind as to publish your current weights as the release or somewhere?

A new code editing dataset

Thank you for creating this great repo!

Following your work, we took a step further to build a dataset containing around 100k instruction-tuning data for code editing. Please feel free to check out our repo here. The data is also released on Huggingface here.

AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 256 != 8 * 1 * 8

Hi, when I run:

torchrun --nproc_per_node=8 train.py \
    --model_name_or_path decapoda-research/llama-7b-hf \
    --data_path ./data/code_alpaca_20k.json \
    --fp16 True \
    --output_dir ./output \
    --num_train_epochs 3 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --deepspeed ds_config.json \
    --tf32 False

It raise error:

Traceback (most recent call last):
  File "/mnt/workspace/huangdong.s/codealpaca/train.py", line 222, in <module>
    train()
  File "/mnt/workspace/huangdong.s/codealpaca/train.py", line 188, in train
    model = transformers.AutoModelForCausalLM.from_pretrained(
  File "/mnt/workspace/anaconda3/envs/codebias/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 484, in from_pretrained
    return model_class.from_pretrained(
  File "/mnt/workspace/anaconda3/envs/codebias/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2670, in from_pretrained
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
  File "/mnt/workspace/anaconda3/envs/codebias/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 724, in __init__
    _ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,
  File "/mnt/workspace/anaconda3/envs/codebias/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 769, in __init__
    self._configure_train_batch_size()
  File "/mnt/workspace/anaconda3/envs/codebias/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 942, in _configure_train_batch_size
    self._batch_assertion()
  File "/mnt/workspace/anaconda3/envs/codebias/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 890, in _batch_assertion
    assert train_batch == micro_batch * grad_acc * self.world_size, (
AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 256 != 8 * 1 * 8

deepseed = 0.9.3
accelerate = 0.20.2

Is it commercially usable?

65b model possible?

@sahil280114 Please let us know if you manage to fine tune a 65b model for people to play around with. It will be very interesting to see how good it can get at coding and math tasks

Difference between new_codealpaca.json, rosetta_alpaca.json and codealpaca-20k.json?

Hi, I found there are 4 files in data folder, but no info in the README indicates their differences. Maybe adding something to clarify?

new_codealpaca.json
rosetta_alpaca.json

Llama2 model with code instruction-tuning on a single RTX 3090 is available now

Hi there,

First, thanks for your kindness in making the code alpaca dataset publicly available. 🙏

Based on your open-source code instruction data, I further developed a multilingual code generation Llama2 Model with parameter-efficient instruction-tuning on a single RTX 3090, called CodeUp. If some guys are interested, please move forward to https://github.com/juyongjiang/CodeUp. Thanks. 😄

CodeUp	Training Pipeline

More training data

Nice work,

One suggestion, why not to use the already quite formatted (not exactly but close) data from rosseta code to populate. There are not so many examples even in the 20K.json, This will significantly improve the amount of code to train on (and languages).

E.g: look at this: https://rosettacode.org/wiki/Accumulator_factory

Please elaborate the process of coding data generation.

Please elaborate the process that how we can generate the coding dataset based on some GitHub repo in this format. i.e., Instruction, input, output. And how you generate this coding dataset…?

Open LLaMA project

I just sauw this project. Should be a good idea to use this check point free from Meta ?

https://github.com/openlm-research/open_llama

Cannot solve complex problems

It cannot solve complex problems of Algo and Graphs.

Example.
Input : Implementing the Knapsack algorithm to solve the 0/1 Knapsack problem: Description: Write a program to implement the Knapsack algorithm to solve the 0/1 Knapsack problem. Input: int values[] = {60, 100, 120}; int weights[] = {10, 20, 30}; int W = 50; Write program in C++ 17 and without explanations.

Output :

 
int main() { 
    int values[] = {60, 100, 120}; 
    int weights[] = {10, 20, 30}; 
    int W = 50; 
    int n = sizeof(values) / sizeof(values[0]); 
    int m = sizeof(weights) / sizeof(weights[0]); 
    int max_value = values[0]; 
    int max_weight = weights[0]; 
    int max_weight_value = 0; 
    int max_value_weight = 0; 
    int sum = 0; 
    int result = 0; 
    for (int i = 0; i < n; i++) { 
        for (int j = 0; j < m; j++) { 
            sum = sum + values[i] * weights[j]; 
            if (sum > max_value) { 
                max_value = values[i]; 
            } 
            if (sum > max_weight) { 
                max_weight = weights[j]; 
            } 
            if (sum > max_weight_value) { 
                max_weight_value = sum; 
            } 
            if (sum > max_value_weight) { 
                max_value_weight = sum; 
            } 
        } 
    } 
    if (max_value_weight > W) { 
        result = max_value_weight; 
    } else { 
        result = max_value; 
    } 
    cout << result << endl; 
    return 0; 
}

Output : 16800 
Expected Output : 220

How big is the finished model?

How large is the model when it is done training?
and does anyone want to share their's?

How long did it take?

Assuming Lambda Labs 8xA100 and 80gb, which is 12 bucks. Can get a reasonable $ estimate that way.