I know it has been covered elsewhere, but people need to understand is that you can us

I Think They Mentioned it Indirectly! They Give

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Use your own data about gpt4all HOT 24 CLOSED

nomic-ai commented on May 23, 2024 82

Use your own data

from gpt4all.

Comments (24)

TheFaheem commented on May 23, 2024 13

I Think They Mentioned it Indirectly!

They Given The Command To Train:
(accelerate launch --dynamo_backend=inductor --num_processes=8 --num_machines=1 --machine_rank=0 --deepspeed_multinode_launcher standard --mixed_precision=bf16 --use_deepspeed --deepspeed_config_file=configs/deepspeed/ds_config.json train.py --config configs/train/finetune-7b.yaml)
This Command Wants a tiny Correction Because No Such File finetune-7b.yaml in configs/train. It Should be Like this (accelerate launch --dynamo_backend=inductor --num_processes=8 --num_machines=1 --machine_rank=0 --deepspeed_multinode_launcher standard --mixed_precision=bf16 --use_deepspeed --deepspeed_config_file=configs/deepspeed/ds_config.json train.py --config configs/train/finetune.yaml)

Alright, This command Direct us to where the config file is located (configs/train/finetune.yaml)

Now Put Your Cleaned Dataset in Any Folders you want it can be like dataset/your_dataset

Then We Should Place Our Cleaned Dataset Path in finetune.yaml and Run The Command Again

from gpt4all.

TheFaheem commented on May 23, 2024 9

@TheFaheem Thanks for the explanation.

Now Put Your Cleaned Dataset in Any Folders you want it can be like dataset/your_dataset

Two questions:

How does the command pick the dataset in dataset/your_dataset? I don't see it specified as a CLI argument.

How does the format of the dataset should be?

Sorry for the noob questions, and thanks if you want to reply. 🙇

Don't Worry, My Friend!

If You Go to Generate.py on line 56, 57. Our Dataset is Loaded as DataLoder using load_data method from data.py (check import)
Now We can see at data.py on line 56, 57. On load_data function, We are Taking The Dataset Path (dataset/your_dataset Path ) which we mentioned in configs/train/finetune.yaml and processing it and converting it to DataLoader Object.
So, Here You Go My Friends! I hope Your Problem Solved....

And My Friend Don't Be Shy to Ask Any Question and Consider You As Low. Remember My Friend Even Trillion Starts with 0!

from gpt4all.

iterhating commented on May 23, 2024 8

Someone who has it running and knows how, just prompt GPT4ALL to write out a guide for the rest of us, eh?

from gpt4all.

jodosha commented on May 23, 2024 6

@TheFaheem Thanks for the explanation.

Now Put Your Cleaned Dataset in Any Folders you want it can be like dataset/your_dataset

Two questions:

How does the command pick the dataset in dataset/your_dataset? I don't see it specified as a CLI argument.
How does the format of the dataset should be?

Sorry for the noob questions, and thanks if you want to reply. 🙇

from gpt4all.

iterhating commented on May 23, 2024 5

I would also like to know how to do this

from gpt4all.

TheFaheem commented on May 23, 2024 4

Could you please provide a simple example, or just using a flag, so we can use our own documents, data, etc ? Something really simple to set and use, that would be fantastic.

Don't Worry My Friend, Follow Me!

First, Place Your Dataset in Anywhere you want inside the repo.
Copy The Path of the Dataset
Now, Go to configs/train/finetune.yaml
There You'll Find dataset_path, Paste Your Copied Path of Your Dataset.
And Thats pretty much it!
Now Run The Training Command as They Told, Change The HyperParams if u want...

👋👋My Friend!

from gpt4all.

BoQsc commented on May 23, 2024 4

Why the issue got closed?

As The Issue was Solved!

A issue gets solved when documentation adjustments are made or changes in the code are linked to the issue.

This issue requires extensive comprehensive explanations and multiple examples to get resolved and closed.

I can't see how this issue is resolved and should not be closed. In the future this issue will appear again if closed, since it is not resolved in the repostitory by commits or pull requests that would guide people in resolution.

from gpt4all.

tprinty commented on May 23, 2024 3

I would love to know this as well.

from gpt4all.

FiveTechSoft commented on May 23, 2024 2

Could you please provide a simple example, or just using a flag, so we can use our own documents, data, etc ? Something really simple to set and use, that would be fantastic.

from gpt4all.

BoQsc commented on May 23, 2024 2

Why the issue got closed?

from gpt4all.

AayushSameerShah commented on May 23, 2024 2

Hello @TheFaheem, I loved your responses. I am in a little confusion, if you could help me through this.

I am planning to create a GenerativeQA model on my own dataset. So, for that I have chosen "GPT-J" and especially this nlpcloud/instruct-gpt-j-fp16 (a fp16 version so that it fits under 12GB).

Now, the thing is I have 2 options:

Set the retriever: which can fetch the relevant context from the document store (database) using embeddings and then pass those top (say 3) most relevant documents as the context in the prompt as with the question. Like:

Answer as truthfully as possible, don't try to make up an answer. Just answer from the context below:

Context:
* doc_1
* doc_2
* doc_3

Question: {question}
Answer:

But in that case loading the GPT-J in my GPU (Tesla T4) it gives the CUDA out-of-memory error, possibly because of the large prompt.

For that reason I think there is the option 2.

Fine-Tune the model with data: Now this is one time cost, first we fine tune it and then just ask the question without the context.

So, my ask is...

If I am going with the 2nd option, will that be appropriate? And if yes, then what should be my data look like? How should I structure my data?

Please shade some light mate!
Thanks.

from gpt4all.

itsalwaysamir commented on May 23, 2024 2

Could you please provide a simple example, or just using a flag, so we can use our own documents, data, etc ? Something really simple to set and use, that would be fantastic.

Don't Worry My Friend, Follow Me!
* First, Place Your Dataset in Anywhere you want inside the repo.

* Copy The Path of the Dataset

* Now, Go to configs/train/finetune.yaml

* There You'll Find dataset_path, Paste Your Copied Path of Your Dataset.

* And Thats pretty much it!

* Now Run The Training Command as They Told, Change The HyperParams if u want...
👋👋My Friend!

@TheFaheem Thanks for the explanation.

Now Put Your Cleaned Dataset in Any Folders you want it can be like dataset/your_dataset

Two questions:

How does the command pick the dataset in dataset/your_dataset? I don't see it specified as a CLI argument.

How does the format of the dataset should be?

Sorry for the noob questions, and thanks if you want to reply. 🙇

Don't Worry, My Friend!
* If You Go to Generate.py on line 56, 57. Our Dataset is Loaded as DataLoder using load_data method from data.py (check import)

* Now We can see at data.py on line 56, 57. On load_data function, We are Taking The Dataset Path (dataset/your_dataset Path ) which we mentioned in configs/train/finetune.yaml and processing it and converting it to DataLoader Object.

* So, Here You Go My Friends! I hope Your Problem Solved....
And My Friend Don't Be Shy to Ask Any Question and Consider You As Low. Remember My Friend Even Trillion Starts with 0!

Thanks a lot! still not clear to me the format of dataset can you add more details or any example of dataset. I have my data in the form of txt file .and need to convert to an acceptable format

from gpt4all.

TheKelvinPerez commented on May 23, 2024 1

+1 for this, i believe this is what the people want.

I want to be able to fine tune the model with my own datasets.

from gpt4all.

rajanpanchal commented on May 23, 2024 1

Data needs to be raw or formatted into some form? for example I want to train on specific code/classes of a framework that I use. Should I provide raw classes or do i need to convert these classes into some format that gpt4all can understand to learn from?

from gpt4all.

PhillipRt commented on May 23, 2024

Would it be possible to use llamaindex (GPT Index) for this? Has anybody tried?

from gpt4all.

AayushSameerShah commented on May 23, 2024

+1 @jodosha
I am looking for the same, but in the mean time I guess we can pass the relevant documents in the prompt as the "context" and the get the answers. If you are looking for the generativeQA.

But still, I am too looking for the answers for your both questions 🤗

from gpt4all.

ChiralCarbon commented on May 23, 2024

What model and tokenizer do we provide in /configs/train/finetune.yaml?
File "/app/gpt-4-all/gpt4all/transformers/src/transformers/utils/hub.py", line 454, in cached_file
raise EnvironmentError(
OSError: nomic-ai/gpt4all-lora does not appear to have a file named config.json. Checkout 'https://huggingface.co/nomic-ai/gpt4all-lora/main' for available files.

from gpt4all.

TheFaheem commented on May 23, 2024

What model and tokenizer do we provide in /configs/train/finetune.yaml? File "/app/gpt-4-all/gpt4all/transformers/src/transformers/utils/hub.py", line 454, in cached_file raise EnvironmentError( OSError: nomic-ai/gpt4all-lora does not appear to have a file named config.json. Checkout 'https://huggingface.co/nomic-ai/gpt4all-lora/main' for available files.

You Can Try "decapoda-research/llama-7b-hf" or You Can Download The Model Weights Through Official and Convert it to Format Suited for Hugging Face Using Transformers.

from gpt4all.

TheFaheem commented on May 23, 2024

Why the issue got closed?

As The Issue was Solved!

from gpt4all.

TheFaheem commented on May 23, 2024

Hello @TheFaheem, I loved your responses. I am in a little confusion, if you could help me through this.

I am planning to create a GenerativeQA model on my own dataset. So, for that I have chosen "GPT-J" and especially this nlpcloud/instruct-gpt-j-fp16 (a fp16 version so that it fits under 12GB).

Now, the thing is I have 2 options:

Set the retriever: which can fetch the relevant context from the document store (database) using embeddings and then pass those top (say 3) most relevant documents as the context in the prompt as with the question. Like:
Answer as truthfully as possible, don't try to make up an answer. Just answer from the context below:

Context:
* doc_1
* doc_2
* doc_3

Question: {question}
Answer:
But in that case loading the GPT-J in my GPU (Tesla T4) it gives the CUDA out-of-memory error, possibly because of the large prompt.

For that reason I think there is the option 2.

Fine-Tune the model with data: Now this is one time cost, first we fine tune it and then just ask the question without the context.

So, my ask is...

If I am going with the 2nd option, will that be appropriate? And if yes, then what should be my data look like? How should I structure my data?

Please shade some light mate! Thanks.

Can You Tell How Your Dataset looks Like?

from gpt4all.

TheFaheem commented on May 23, 2024

You Can Refer to => https://huggingface.co/gaussalgo/T5-LM-Large_Canard-HotpotQA-rephrase which use hotpot qa. Maybe You Can Get The Intuition There!

from gpt4all.

dancasmed commented on May 23, 2024

if you check the configs/generate.yaml will see an example for the model_name and tokenizer_name. You can just copy the name values.

model_name: "zpn/llama-7b"
tokenizer_name: "zpn/llama-7b"

don't forget to update the paths

from gpt4all.

alinh99 commented on May 23, 2024

Can I use a model from local, not from the huggingface?

from gpt4all.

tiwarikaran commented on May 23, 2024

Does a csv work? Or is it have to be a datataset.Dict format.
The thing is that they are using .arrow files to store their data when the load_dataset is used. So should we just convert our data into .arrow?

The main part has been beautifully covered by the community especially @TheFaheem. Just the issue is how the data should be placed as in which format. The main format is this one
https://huggingface.co/datasets/nomic-ai/gpt4all-j-prompt-generations

so similar along these lines, this is a parquet but HuggingFace shows it as a csv.

So what should be done?

from gpt4all.

Use your own data about gpt4all HOT 24 CLOSED

Comments (24)

So, my ask is...

So, my ask is...

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent