Comments (24)
I Think They Mentioned it Indirectly!
-
They Given The Command To Train:
(accelerate launch --dynamo_backend=inductor --num_processes=8 --num_machines=1 --machine_rank=0 --deepspeed_multinode_launcher standard --mixed_precision=bf16 --use_deepspeed --deepspeed_config_file=configs/deepspeed/ds_config.json train.py --config configs/train/finetune-7b.yaml) -
This Command Wants a tiny Correction Because No Such File finetune-7b.yaml in configs/train. It Should be Like this (accelerate launch --dynamo_backend=inductor --num_processes=8 --num_machines=1 --machine_rank=0 --deepspeed_multinode_launcher standard --mixed_precision=bf16 --use_deepspeed --deepspeed_config_file=configs/deepspeed/ds_config.json train.py --config configs/train/finetune.yaml)
Alright, This command Direct us to where the config file is located (configs/train/finetune.yaml)
Now Put Your Cleaned Dataset in Any Folders you want it can be like dataset/your_dataset
Then We Should Place Our Cleaned Dataset Path in finetune.yaml and Run The Command Again
from gpt4all.
@TheFaheem Thanks for the explanation.
Now Put Your Cleaned Dataset in Any Folders you want it can be like dataset/your_dataset
Two questions:
- How does the command pick the dataset in
dataset/your_dataset
? I don't see it specified as a CLI argument.- How does the format of the dataset should be?
Sorry for the noob questions, and thanks if you want to reply. 🙇
Don't Worry, My Friend!
- If You Go to Generate.py on line 56, 57. Our Dataset is Loaded as DataLoder using load_data method from data.py (check import)
- Now We can see at data.py on line 56, 57. On load_data function, We are Taking The Dataset Path (dataset/your_dataset Path ) which we mentioned in configs/train/finetune.yaml and processing it and converting it to DataLoader Object.
- So, Here You Go My Friends! I hope Your Problem Solved....
And My Friend Don't Be Shy to Ask Any Question and Consider You As Low. Remember My Friend Even Trillion Starts with 0!
from gpt4all.
Someone who has it running and knows how, just prompt GPT4ALL to write out a guide for the rest of us, eh?
from gpt4all.
@TheFaheem Thanks for the explanation.
Now Put Your Cleaned Dataset in Any Folders you want it can be like dataset/your_dataset
Two questions:
- How does the command pick the dataset in
dataset/your_dataset
? I don't see it specified as a CLI argument. - How does the format of the dataset should be?
Sorry for the noob questions, and thanks if you want to reply. 🙇
from gpt4all.
I would also like to know how to do this
from gpt4all.
Could you please provide a simple example, or just using a flag, so we can use our own documents, data, etc ? Something really simple to set and use, that would be fantastic.
Don't Worry My Friend, Follow Me!
- First, Place Your Dataset in Anywhere you want inside the repo.
- Copy The Path of the Dataset
- Now, Go to configs/train/finetune.yaml
- There You'll Find dataset_path, Paste Your Copied Path of Your Dataset.
- And Thats pretty much it!
- Now Run The Training Command as They Told, Change The HyperParams if u want...
👋👋My Friend!
from gpt4all.
Why the issue got closed?
As The Issue was Solved!
A issue gets solved when documentation adjustments are made or changes in the code are linked to the issue.
This issue requires extensive comprehensive explanations and multiple examples to get resolved and closed.
I can't see how this issue is resolved and should not be closed. In the future this issue will appear again if closed, since it is not resolved in the repostitory by commits or pull requests that would guide people in resolution.
from gpt4all.
I would love to know this as well.
from gpt4all.
Could you please provide a simple example, or just using a flag, so we can use our own documents, data, etc ? Something really simple to set and use, that would be fantastic.
from gpt4all.
Why the issue got closed?
from gpt4all.
Hello @TheFaheem, I loved your responses. I am in a little confusion, if you could help me through this.
I am planning to create a GenerativeQA model on my own dataset. So, for that I have chosen "GPT-J" and especially this nlpcloud/instruct-gpt-j-fp16
(a fp16 version so that it fits under 12GB).
Now, the thing is I have 2 options:
- Set the retriever: which can fetch the relevant context from the document store (database) using embeddings and then pass those top (say 3) most relevant documents as the context in the prompt as with the question. Like:
Answer as truthfully as possible, don't try to make up an answer. Just answer from the context below:
Context:
* doc_1
* doc_2
* doc_3
Question: {question}
Answer:
But in that case loading the GPT-J in my GPU (Tesla T4) it gives the CUDA out-of-memory error, possibly because of the large prompt.
For that reason I think there is the option 2.
- Fine-Tune the model with data: Now this is one time cost, first we fine tune it and then just ask the question without the context.
So, my ask is...
If I am going with the 2nd option, will that be appropriate? And if yes, then what should be my data look like? How should I structure my data?
Please shade some light mate!
Thanks.
from gpt4all.
Could you please provide a simple example, or just using a flag, so we can use our own documents, data, etc ? Something really simple to set and use, that would be fantastic.
Don't Worry My Friend, Follow Me!
* First, Place Your Dataset in Anywhere you want inside the repo. * Copy The Path of the Dataset * Now, Go to configs/train/finetune.yaml * There You'll Find dataset_path, Paste Your Copied Path of Your Dataset. * And Thats pretty much it! * Now Run The Training Command as They Told, Change The HyperParams if u want...
👋👋My Friend!
@TheFaheem Thanks for the explanation.
Now Put Your Cleaned Dataset in Any Folders you want it can be like dataset/your_dataset
Two questions:
- How does the command pick the dataset in
dataset/your_dataset
? I don't see it specified as a CLI argument.- How does the format of the dataset should be?
Sorry for the noob questions, and thanks if you want to reply. 🙇
Don't Worry, My Friend!
* If You Go to Generate.py on line 56, 57. Our Dataset is Loaded as DataLoder using load_data method from data.py (check import) * Now We can see at data.py on line 56, 57. On load_data function, We are Taking The Dataset Path (dataset/your_dataset Path ) which we mentioned in configs/train/finetune.yaml and processing it and converting it to DataLoader Object. * So, Here You Go My Friends! I hope Your Problem Solved....
And My Friend Don't Be Shy to Ask Any Question and Consider You As Low. Remember My Friend Even Trillion Starts with 0!
Thanks a lot! still not clear to me the format of dataset can you add more details or any example of dataset. I have my data in the form of txt file .and need to convert to an acceptable format
from gpt4all.
+1 for this, i believe this is what the people want.
I want to be able to fine tune the model with my own datasets.
from gpt4all.
Data needs to be raw or formatted into some form? for example I want to train on specific code/classes of a framework that I use. Should I provide raw classes or do i need to convert these classes into some format that gpt4all can understand to learn from?
from gpt4all.
Would it be possible to use llamaindex (GPT Index) for this? Has anybody tried?
from gpt4all.
+1 @jodosha
I am looking for the same, but in the mean time I guess we can pass the relevant documents in the prompt as the "context" and the get the answers. If you are looking for the generativeQA.
But still, I am too looking for the answers for your both questions 🤗
from gpt4all.
What model and tokenizer do we provide in /configs/train/finetune.yaml?
File "/app/gpt-4-all/gpt4all/transformers/src/transformers/utils/hub.py", line 454, in cached_file
raise EnvironmentError(
OSError: nomic-ai/gpt4all-lora does not appear to have a file named config.json. Checkout 'https://huggingface.co/nomic-ai/gpt4all-lora/main' for available files.
from gpt4all.
What model and tokenizer do we provide in /configs/train/finetune.yaml? File "/app/gpt-4-all/gpt4all/transformers/src/transformers/utils/hub.py", line 454, in cached_file raise EnvironmentError( OSError: nomic-ai/gpt4all-lora does not appear to have a file named config.json. Checkout 'https://huggingface.co/nomic-ai/gpt4all-lora/main' for available files.
You Can Try "decapoda-research/llama-7b-hf" or You Can Download The Model Weights Through Official and Convert it to Format Suited for Hugging Face Using Transformers.
from gpt4all.
Why the issue got closed?
As The Issue was Solved!
from gpt4all.
Hello @TheFaheem, I loved your responses. I am in a little confusion, if you could help me through this.
I am planning to create a GenerativeQA model on my own dataset. So, for that I have chosen "GPT-J" and especially this
nlpcloud/instruct-gpt-j-fp16
(a fp16 version so that it fits under 12GB).Now, the thing is I have 2 options:
- Set the retriever: which can fetch the relevant context from the document store (database) using embeddings and then pass those top (say 3) most relevant documents as the context in the prompt as with the question. Like:
Answer as truthfully as possible, don't try to make up an answer. Just answer from the context below: Context: * doc_1 * doc_2 * doc_3 Question: {question} Answer:
But in that case loading the GPT-J in my GPU (Tesla T4) it gives the CUDA out-of-memory error, possibly because of the large prompt.
For that reason I think there is the option 2.
- Fine-Tune the model with data: Now this is one time cost, first we fine tune it and then just ask the question without the context.
So, my ask is...
If I am going with the 2nd option, will that be appropriate? And if yes, then what should be my data look like? How should I structure my data?
Please shade some light mate! Thanks.
Can You Tell How Your Dataset looks Like?
from gpt4all.
You Can Refer to => https://huggingface.co/gaussalgo/T5-LM-Large_Canard-HotpotQA-rephrase which use hotpot qa. Maybe You Can Get The Intuition There!
from gpt4all.
if you check the configs/generate.yaml will see an example for the model_name and tokenizer_name. You can just copy the name values.
model_name: "zpn/llama-7b"
tokenizer_name: "zpn/llama-7b"
don't forget to update the paths
from gpt4all.
Can I use a model from local, not from the huggingface?
from gpt4all.
Does a csv work? Or is it have to be a datataset.Dict format.
The thing is that they are using .arrow files to store their data when the load_dataset is used. So should we just convert our data into .arrow?
The main part has been beautifully covered by the community especially @TheFaheem. Just the issue is how the data should be placed as in which format. The main format is this one
https://huggingface.co/datasets/nomic-ai/gpt4all-j-prompt-generations
so similar along these lines, this is a parquet but HuggingFace shows it as a csv.
So what should be done?
from gpt4all.
Related Issues (20)
- [macOS] How to make the bundle for MacOS? HOT 1
- End Token Of Phi-3 Instruct Is Ignored HOT 5
- Collections that are missing embeddings can get stuck that way until an explicit re-index HOT 3
- gpt4all 2.7.4 only uses half wide of screen. fix this. HOT 9
- run local tools like python, multimedia, system software, system config tools, etc?
- [Improvement]Enhancement of Sidebar Contrast in White and Dark Themes HOT 1
- When clicked on the yellow download button then the UI is broken and not intractable HOT 1
- [Feature] Feature request title... HOT 1
- Meaningless strings for slightly bigger prompts when running on GPU
- GPT4All v2.7.4 - List of chats: Off-topic short description of the subject of a conversation HOT 6
- [Feature] GPT4All v2.7.4: Lack of confirmation message when the user intends to Remove a LocalDocs collection
- [Feature] python: Provide a way get or use the max context size supported by the LLM
- [Feature] Context links should use the "hand" cursor to indicate that they can be clicked on
- [Feature] Models to inform you it does not know something or has incomplete data HOT 1
- Generation speed indication is scientific notation for speeds >= 100 tokens/s
- "About" does not show version number HOT 2
- Embeddings model from website model explorer can't be instantiated in Python bindings
- sequence of list of models HOT 1
- llama3 template HOT 5
- Installing a stable-diffusion.cpp model crashes GPT4All HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gpt4all.