I have a data set I want to train or fine tune on my data set. So how I can do this ?<

But what do we set as model name <a href="https://github.com

But what do we set as model name <a href="https://github.com/nomic-ai/gpt4all/blob/mai

Finetuning Interface: How to train for custom data? about gpt4all HOT 17 CLOSED

nomic-ai commented on May 25, 2024 35

Finetuning Interface: How to train for custom data?

from gpt4all.

Comments (17)

daleevans commented on May 25, 2024 13

It's very easy, just change the dataset_path in the config (configs/train/finetune_lora.yaml or configs/train/finetune.yaml). The format of the data is trivial, just jsonl like this

{"prompt": "Russia Finishes Building Iran Nuclear Plant  MOSCOW (Reuters) - Russia and Iran said Thursday they had  finished construction of an atomic power plant in the Islamic  Republic -- a project the United States fears Tehran could use  to make nuclear arms. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "This is a piece of news regarding world politics and science and technology.", "source": "bigscience/p3"}
{"prompt": "Goosen brings a sparkle to the gloom With overnight rain causing a 2-hour delay to the start of play in the HSBC World Matchplay Championship at Wentworth, Retief Goosen did his bit to help to make up ground. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"}
{"prompt": "Geiberger has share of clubhouse lead at Greensboro Brent Geiberger took advantage of ideal morning conditions to earn a share of the clubhouse lead with a 6-under-par 66 during the first  \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"}

from gpt4all.

intc-hharshtk commented on May 25, 2024 7

WHat if i do not have formatted data, just lots of pages of knowledge. Do I have to convert it to above mentioned method? Ig yes how

from gpt4all.

suzhidong commented on May 25, 2024 4

I'm also interested

from gpt4all.

intc-hharshtk commented on May 25, 2024 3

Same, would be awesome to work on private knowledge base

from gpt4all.

bstadt commented on May 25, 2024 3

updated title here to make it more like a feature request. we plan on offering this soon

from gpt4all.

imrankh46 commented on May 25, 2024 1

It's very easy, just change the dataset_path in the config (configs/train/finetune_lora.yaml or configs/train/finetune.yaml). The format of the data is trivial, just jsonl like this

{"prompt": "Russia Finishes Building Iran Nuclear Plant  MOSCOW (Reuters) - Russia and Iran said Thursday they had  finished construction of an atomic power plant in the Islamic  Republic -- a project the United States fears Tehran could use  to make nuclear arms. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "This is a piece of news regarding world politics and science and technology.", "source": "bigscience/p3"}
{"prompt": "Goosen brings a sparkle to the gloom With overnight rain causing a 2-hour delay to the start of play in the HSBC World Matchplay Championship at Wentworth, Retief Goosen did his bit to help to make up ground. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"}
{"prompt": "Geiberger has share of clubhouse lead at Greensboro Brent Geiberger took advantage of ideal morning conditions to earn a share of the clubhouse lead with a 6-under-par 66 during the first  \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"}

Thanks buddy. Unfortunately
My data set format base on prompt and completion.
I'm doing this for pashto poetry generation.

But I want to run the model on single GPU. Is something like this possible?

from gpt4all.

daleevans commented on May 25, 2024 1

But what do we set as model name in the file, and wandb entities? These are also requested, but not explained, and if we understood those, I'd be up for a documentation PR eyes

There seem to be another questions to answer to be able to train custom data. Could we reopen this issue? It would be helpful if these terms be in the documentation to other be able to train their own chat with their own data.

Put the filesystem path to the directory containing your hf formatted model and tokenizer files in those fields. If you don’t have a wandb account, which I assume is the case since otherwise it would be obvious, disable wandb. You may have to comment out a line or two if you disable wandb tracking. You may find it easier to just get a wandb account.

from gpt4all.

farnazgh commented on May 25, 2024 1

WHat if i do not have formatted data, just lots of pages of knowledge. Do I have to convert it to above mentioned method? Ig yes how

There is this blog regarding your question
https://levelup.gitconnected.com/training-your-own-llm-using-privategpt-f36f0c4f01ec

from gpt4all.

tprinty commented on May 25, 2024

Also interested. Could be an additional DOC file to do this.

from gpt4all.

ivanol55 commented on May 25, 2024

But what do we set as model name in the file, and wandb entities? These are also requested, but not explained, and if we understood those, I'd be up for a documentation PR 👀

from gpt4all.

imrankh46 commented on May 25, 2024

updated title here to make it more like a feature request. we plan on offering this soon

Thanks, I will also contribute...

from gpt4all.

guijarro commented on May 25, 2024

I have tons of documentation in the form of Gitlab pages, such as https://batchdocs.web.cern.ch/ . Has anyone worked on the automatic preprocessing of Markup language pages to be feed as train dataset to gpt4all?

from gpt4all.

marcosjoao37 commented on May 25, 2024

But what do we set as model name in the file, and wandb entities? These are also requested, but not explained, and if we understood those, I'd be up for a documentation PR eyes

There seem to be another questions to answer to be able to train custom data. Could we reopen this issue? It would be helpful if these terms be in the documentation to other be able to train their own chat with their own data.

from gpt4all.

marcosjoao37 commented on May 25, 2024

@daleevans Thank you four your feedback! I will try your instructions soon! 🤗

from gpt4all.

pinsystem commented on May 25, 2024

I cannot find any 'yaml' or 'finetune...' or 'config' folder or files. I'm using Windows version. Is it possible to train with custom data in Windows?

from gpt4all.

giorgionetg commented on May 25, 2024

under configs folder...

from gpt4all.

dominikj111 commented on May 25, 2024

It's very easy, just change the dataset_path in the config (configs/train/finetune_lora.yaml or configs/train/finetune.yaml). The format of the data is trivial, just jsonl like this

{"prompt": "Russia Finishes Building Iran Nuclear Plant  MOSCOW (Reuters) - Russia and Iran said Thursday they had  finished construction of an atomic power plant in the Islamic  Republic -- a project the United States fears Tehran could use  to make nuclear arms. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "This is a piece of news regarding world politics and science and technology.", "source": "bigscience/p3"}
{"prompt": "Goosen brings a sparkle to the gloom With overnight rain causing a 2-hour delay to the start of play in the HSBC World Matchplay Championship at Wentworth, Retief Goosen did his bit to help to make up ground. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"}
{"prompt": "Geiberger has share of clubhouse lead at Greensboro Brent Geiberger took advantage of ideal morning conditions to earn a share of the clubhouse lead with a 6-under-par 66 during the first  \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"}

Is there any step by step documentation?
It looks easy, I just wondering how to use the downloaded model + my additional set.

from gpt4all.

Finetuning Interface: How to train for custom data? about gpt4all HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent