sahil280114 / codealpaca Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
Hi @sahil280114, this is a really cool project and great job augmenting the Alpaca dataset with code instructions!
Would you be interested in hosting this on the Hugging Face Hub (https://huggingface.co/datasets)? The Alpaca dataset is also hosted there (link) and your version would be of wide interest to the community, especially since many people have noted the Alpaca models can't code very well :)
If that sounds interesting, you just need to create a new dataset repo and upload your code_alpaca_20k.json
file through the UI. For more details, you can check out our docs here
PS LLaMa is now included in the main
branch of transformers
, so you may also want to update your training instructions :)
I follow the step in README, but I get the empty state dict. Here is the code and the output:
code:
trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
print(trainer.model.state_dict()['model.layers.30.mlp.gate_proj.weight'])
print('training')
trainer.train()
print(trainer.model.state_dict()['model.layers.30.mlp.gate_proj.weight'])
print('trained')
trainer.save_state()
print(trainer.model.state_dict()['model.layers.30.mlp.gate_proj.weight'])
print('saved')
output:
tensor([[ 1.5984e-03, -1.6602e-02, -1.6460e-03, ..., -1.6632e-02,
-1.9989e-02, 1.1383e-02],
[ 9.5062e-03, 3.3356e-02, 5.6343e-03, ..., -3.6743e-02,
-3.2074e-02, 2.6810e-02],
[ 1.1917e-02, -2.1515e-02, -2.6352e-02, ..., 2.7328e-02,
-4.0550e-03, 1.5320e-02],
...,
[-2.8503e-02, 1.5316e-03, -1.8753e-02, ..., 2.9846e-02,
-1.9440e-02, 2.6703e-02],
[ 5.6505e-05, -4.5898e-02, 2.0660e-02, ..., -6.5689e-03,
-3.2043e-02, 1.8005e-02],
[-7.1106e-03, -7.1487e-03, -4.5624e-03, ..., 1.3138e-02,
-4.3060e-02, -1.5869e-02]])
training
tensor([], device='cuda:0', dtype=torch.float16)
trained
tensor([], device='cuda:0', dtype=torch.float16)
saved
This produced an instruction-following dataset with 20K examples obtained at a much lower cost (less than $200).
Thank you for your willingness to make the dataset publicly available in a non-commercial context, it is much appreciated.
Hi Sahil .. great work!
One question: If I understand correctly, we still need to communicate with OpenAI to generate data from seed file. Is there any privacy concern with this? What if the seed file has information that is confidential from the org's pov?
Does exist any web ui/gui to run this model offline on PC?
Is there any docs for training the 13b version of this model?
Would you please be so kind as to publish your current weights as the release or somewhere?
Hi, when I run:
torchrun --nproc_per_node=8 train.py \
--model_name_or_path decapoda-research/llama-7b-hf \
--data_path ./data/code_alpaca_20k.json \
--fp16 True \
--output_dir ./output \
--num_train_epochs 3 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--gradient_accumulation_steps 4 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 500 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--deepspeed ds_config.json \
--tf32 False
It raise error:
Traceback (most recent call last):
File "/mnt/workspace/huangdong.s/codealpaca/train.py", line 222, in <module>
train()
File "/mnt/workspace/huangdong.s/codealpaca/train.py", line 188, in train
model = transformers.AutoModelForCausalLM.from_pretrained(
File "/mnt/workspace/anaconda3/envs/codebias/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 484, in from_pretrained
return model_class.from_pretrained(
File "/mnt/workspace/anaconda3/envs/codebias/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2670, in from_pretrained
init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
File "/mnt/workspace/anaconda3/envs/codebias/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 724, in __init__
_ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,
File "/mnt/workspace/anaconda3/envs/codebias/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 769, in __init__
self._configure_train_batch_size()
File "/mnt/workspace/anaconda3/envs/codebias/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 942, in _configure_train_batch_size
self._batch_assertion()
File "/mnt/workspace/anaconda3/envs/codebias/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 890, in _batch_assertion
assert train_batch == micro_batch * grad_acc * self.world_size, (
AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 256 != 8 * 1 * 8
deepseed = 0.9.3
accelerate = 0.20.2
@sahil280114 Please let us know if you manage to fine tune a 65b model for people to play around with. It will be very interesting to see how good it can get at coding and math tasks
Hi, I found there are 4 files in data folder, but no info in the README indicates their differences. Maybe adding something to clarify?
new_codealpaca.json
rosetta_alpaca.json
Hi there,
First, thanks for your kindness in making the code alpaca dataset publicly available. 🙏
Based on your open-source code instruction data, I further developed a multilingual code generation Llama2 Model with parameter-efficient instruction-tuning on a single RTX 3090, called CodeUp. If some guys are interested, please move forward to https://github.com/juyongjiang/CodeUp. Thanks. 😄
CodeUp | Training Pipeline |
---|---|
Nice work,
One suggestion, why not to use the already quite formatted (not exactly but close) data from rosseta code to populate. There are not so many examples even in the 20K.json, This will significantly improve the amount of code to train on (and languages).
E.g: look at this: https://rosettacode.org/wiki/Accumulator_factory
Please elaborate the process that how we can generate the coding dataset based on some GitHub repo in this format. i.e., Instruction, input, output. And how you generate this coding dataset…?
I just sauw this project. Should be a good idea to use this check point free from Meta ?
It cannot solve complex problems of Algo and Graphs.
Example.
Input : Implementing the Knapsack algorithm to solve the 0/1 Knapsack problem: Description: Write a program to implement the Knapsack algorithm to solve the 0/1 Knapsack problem. Input: int values[] = {60, 100, 120}; int weights[] = {10, 20, 30}; int W = 50; Write program in C++ 17 and without explanations.
Output :
int main() {
int values[] = {60, 100, 120};
int weights[] = {10, 20, 30};
int W = 50;
int n = sizeof(values) / sizeof(values[0]);
int m = sizeof(weights) / sizeof(weights[0]);
int max_value = values[0];
int max_weight = weights[0];
int max_weight_value = 0;
int max_value_weight = 0;
int sum = 0;
int result = 0;
for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j++) {
sum = sum + values[i] * weights[j];
if (sum > max_value) {
max_value = values[i];
}
if (sum > max_weight) {
max_weight = weights[j];
}
if (sum > max_weight_value) {
max_weight_value = sum;
}
if (sum > max_value_weight) {
max_value_weight = sum;
}
}
}
if (max_value_weight > W) {
result = max_value_weight;
} else {
result = max_value;
}
cout << result << endl;
return 0;
}
Output : 16800
Expected Output : 220
How large is the model when it is done training?
and does anyone want to share their's?
Assuming Lambda Labs 8xA100 and 80gb, which is 12 bucks. Can get a reasonable $ estimate that way.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.