Comments (17)
It's very easy, just change the dataset_path in the config (configs/train/finetune_lora.yaml or configs/train/finetune.yaml). The format of the data is trivial, just jsonl like this
{"prompt": "Russia Finishes Building Iran Nuclear Plant MOSCOW (Reuters) - Russia and Iran said Thursday they had finished construction of an atomic power plant in the Islamic Republic -- a project the United States fears Tehran could use to make nuclear arms. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "This is a piece of news regarding world politics and science and technology.", "source": "bigscience/p3"}
{"prompt": "Goosen brings a sparkle to the gloom With overnight rain causing a 2-hour delay to the start of play in the HSBC World Matchplay Championship at Wentworth, Retief Goosen did his bit to help to make up ground. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"}
{"prompt": "Geiberger has share of clubhouse lead at Greensboro Brent Geiberger took advantage of ideal morning conditions to earn a share of the clubhouse lead with a 6-under-par 66 during the first \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"}
from gpt4all.
WHat if i do not have formatted data, just lots of pages of knowledge. Do I have to convert it to above mentioned method? Ig yes how
from gpt4all.
I'm also interested
from gpt4all.
Same, would be awesome to work on private knowledge base
from gpt4all.
updated title here to make it more like a feature request. we plan on offering this soon
from gpt4all.
It's very easy, just change the dataset_path in the config (configs/train/finetune_lora.yaml or configs/train/finetune.yaml). The format of the data is trivial, just jsonl like this
{"prompt": "Russia Finishes Building Iran Nuclear Plant MOSCOW (Reuters) - Russia and Iran said Thursday they had finished construction of an atomic power plant in the Islamic Republic -- a project the United States fears Tehran could use to make nuclear arms. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "This is a piece of news regarding world politics and science and technology.", "source": "bigscience/p3"} {"prompt": "Goosen brings a sparkle to the gloom With overnight rain causing a 2-hour delay to the start of play in the HSBC World Matchplay Championship at Wentworth, Retief Goosen did his bit to help to make up ground. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"} {"prompt": "Geiberger has share of clubhouse lead at Greensboro Brent Geiberger took advantage of ideal morning conditions to earn a share of the clubhouse lead with a 6-under-par 66 during the first \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"}
Thanks buddy. Unfortunately
My data set format base on prompt and completion.
I'm doing this for pashto poetry generation.
But I want to run the model on single GPU. Is something like this possible?
from gpt4all.
But what do we set as model name in the file, and wandb entities? These are also requested, but not explained, and if we understood those, I'd be up for a documentation PR eyes
There seem to be another questions to answer to be able to train custom data. Could we reopen this issue? It would be helpful if these terms be in the documentation to other be able to train their own chat with their own data.
Put the filesystem path to the directory containing your hf formatted model and tokenizer files in those fields. If you donβt have a wandb account, which I assume is the case since otherwise it would be obvious, disable wandb. You may have to comment out a line or two if you disable wandb tracking. You may find it easier to just get a wandb account.
from gpt4all.
WHat if i do not have formatted data, just lots of pages of knowledge. Do I have to convert it to above mentioned method? Ig yes how
There is this blog regarding your question
https://levelup.gitconnected.com/training-your-own-llm-using-privategpt-f36f0c4f01ec
from gpt4all.
Also interested. Could be an additional DOC file to do this.
from gpt4all.
But what do we set as model name in the file, and wandb entities? These are also requested, but not explained, and if we understood those, I'd be up for a documentation PR π
from gpt4all.
updated title here to make it more like a feature request. we plan on offering this soon
Thanks, I will also contribute...
from gpt4all.
I have tons of documentation in the form of Gitlab pages, such as https://batchdocs.web.cern.ch/ . Has anyone worked on the automatic preprocessing of Markup language pages to be feed as train dataset to gpt4all?
from gpt4all.
But what do we set as model name in the file, and wandb entities? These are also requested, but not explained, and if we understood those, I'd be up for a documentation PR eyes
There seem to be another questions to answer to be able to train custom data. Could we reopen this issue? It would be helpful if these terms be in the documentation to other be able to train their own chat with their own data.
from gpt4all.
@daleevans Thank you four your feedback! I will try your instructions soon! π€
from gpt4all.
I cannot find any 'yaml' or 'finetune...' or 'config' folder or files. I'm using Windows version. Is it possible to train with custom data in Windows?
from gpt4all.
under configs folder...
from gpt4all.
It's very easy, just change the dataset_path in the config (configs/train/finetune_lora.yaml or configs/train/finetune.yaml). The format of the data is trivial, just jsonl like this
{"prompt": "Russia Finishes Building Iran Nuclear Plant MOSCOW (Reuters) - Russia and Iran said Thursday they had finished construction of an atomic power plant in the Islamic Republic -- a project the United States fears Tehran could use to make nuclear arms. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "This is a piece of news regarding world politics and science and technology.", "source": "bigscience/p3"} {"prompt": "Goosen brings a sparkle to the gloom With overnight rain causing a 2-hour delay to the start of play in the HSBC World Matchplay Championship at Wentworth, Retief Goosen did his bit to help to make up ground. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"} {"prompt": "Geiberger has share of clubhouse lead at Greensboro Brent Geiberger took advantage of ideal morning conditions to earn a share of the clubhouse lead with a 6-under-par 66 during the first \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"}
Is there any step by step documentation?
It looks easy, I just wondering how to use the downloaded model + my additional set.
from gpt4all.
Related Issues (20)
- GPT4ALL Win App crashes on big context size on API call HOT 4
- GP4all2.7.3 Local Doc indexing is not happening on windows setup HOT 1
- system_prompt in Python Bindings does not work HOT 2
- [Feature] Add a new model HOT 1
- crashes at launch HOT 2
- [Feature] Allow different context window lengths in python binding (i.e., n_ctx set by user) HOT 1
- C# build instructions for MSVC on Windows are unclear
- Can not install gpt4all on windows 7 HOT 1
- Hello, I have installed Starcoder. How do I use it? HOT 1
- How to configure models specifically for RAG (localdocs)?
- Build success, but packed failed using QT deploy tool both win & mac HOT 13
- GPT4All v2.7.3: List of Chats: Visible construction of the short summary/main idea of a Chat
- [Feature] Add llama 3 model HOT 7
- Load model failed and report PrefetchVirtualMemory unavailable HOT 1
- No local models available for download in chat UI? HOT 1
- SDK for Game Engine HOT 2
- [Feature] Add Dolphin 2.9 Llama 3 8b (uncensored) HOT 2
- [Feature] add the ability to use plugins
- C# bindings need to be updated HOT 2
- [Feature] Apple Silicon Neural Engine: Core ML model package format support HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gpt4all.