Git Product home page Git Product logo

Comments (17)

daleevans avatar daleevans commented on May 25, 2024 13

It's very easy, just change the dataset_path in the config (configs/train/finetune_lora.yaml or configs/train/finetune.yaml). The format of the data is trivial, just jsonl like this

{"prompt": "Russia Finishes Building Iran Nuclear Plant  MOSCOW (Reuters) - Russia and Iran said Thursday they had  finished construction of an atomic power plant in the Islamic  Republic -- a project the United States fears Tehran could use  to make nuclear arms. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "This is a piece of news regarding world politics and science and technology.", "source": "bigscience/p3"}
{"prompt": "Goosen brings a sparkle to the gloom With overnight rain causing a 2-hour delay to the start of play in the HSBC World Matchplay Championship at Wentworth, Retief Goosen did his bit to help to make up ground. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"}
{"prompt": "Geiberger has share of clubhouse lead at Greensboro Brent Geiberger took advantage of ideal morning conditions to earn a share of the clubhouse lead with a 6-under-par 66 during the first  \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"}

from gpt4all.

intc-hharshtk avatar intc-hharshtk commented on May 25, 2024 7

WHat if i do not have formatted data, just lots of pages of knowledge. Do I have to convert it to above mentioned method? Ig yes how

from gpt4all.

suzhidong avatar suzhidong commented on May 25, 2024 4

I'm also interested

from gpt4all.

intc-hharshtk avatar intc-hharshtk commented on May 25, 2024 3

Same, would be awesome to work on private knowledge base

from gpt4all.

bstadt avatar bstadt commented on May 25, 2024 3

updated title here to make it more like a feature request. we plan on offering this soon

from gpt4all.

imrankh46 avatar imrankh46 commented on May 25, 2024 1

It's very easy, just change the dataset_path in the config (configs/train/finetune_lora.yaml or configs/train/finetune.yaml). The format of the data is trivial, just jsonl like this

{"prompt": "Russia Finishes Building Iran Nuclear Plant  MOSCOW (Reuters) - Russia and Iran said Thursday they had  finished construction of an atomic power plant in the Islamic  Republic -- a project the United States fears Tehran could use  to make nuclear arms. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "This is a piece of news regarding world politics and science and technology.", "source": "bigscience/p3"}
{"prompt": "Goosen brings a sparkle to the gloom With overnight rain causing a 2-hour delay to the start of play in the HSBC World Matchplay Championship at Wentworth, Retief Goosen did his bit to help to make up ground. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"}
{"prompt": "Geiberger has share of clubhouse lead at Greensboro Brent Geiberger took advantage of ideal morning conditions to earn a share of the clubhouse lead with a 6-under-par 66 during the first  \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"}

Thanks buddy. Unfortunately
My data set format base on prompt and completion.
I'm doing this for pashto poetry generation.

But I want to run the model on single GPU. Is something like this possible?

from gpt4all.

daleevans avatar daleevans commented on May 25, 2024 1

But what do we set as model name in the file, and wandb entities? These are also requested, but not explained, and if we understood those, I'd be up for a documentation PR eyes

There seem to be another questions to answer to be able to train custom data. Could we reopen this issue? It would be helpful if these terms be in the documentation to other be able to train their own chat with their own data.

Put the filesystem path to the directory containing your hf formatted model and tokenizer files in those fields. If you don’t have a wandb account, which I assume is the case since otherwise it would be obvious, disable wandb. You may have to comment out a line or two if you disable wandb tracking. You may find it easier to just get a wandb account.

from gpt4all.

farnazgh avatar farnazgh commented on May 25, 2024 1

WHat if i do not have formatted data, just lots of pages of knowledge. Do I have to convert it to above mentioned method? Ig yes how

There is this blog regarding your question
https://levelup.gitconnected.com/training-your-own-llm-using-privategpt-f36f0c4f01ec

from gpt4all.

tprinty avatar tprinty commented on May 25, 2024

Also interested. Could be an additional DOC file to do this.

from gpt4all.

ivanol55 avatar ivanol55 commented on May 25, 2024

But what do we set as model name in the file, and wandb entities? These are also requested, but not explained, and if we understood those, I'd be up for a documentation PR πŸ‘€

from gpt4all.

imrankh46 avatar imrankh46 commented on May 25, 2024

updated title here to make it more like a feature request. we plan on offering this soon

Thanks, I will also contribute...

from gpt4all.

guijarro avatar guijarro commented on May 25, 2024

I have tons of documentation in the form of Gitlab pages, such as https://batchdocs.web.cern.ch/ . Has anyone worked on the automatic preprocessing of Markup language pages to be feed as train dataset to gpt4all?

from gpt4all.

marcosjoao37 avatar marcosjoao37 commented on May 25, 2024

But what do we set as model name in the file, and wandb entities? These are also requested, but not explained, and if we understood those, I'd be up for a documentation PR eyes

There seem to be another questions to answer to be able to train custom data. Could we reopen this issue? It would be helpful if these terms be in the documentation to other be able to train their own chat with their own data.

from gpt4all.

marcosjoao37 avatar marcosjoao37 commented on May 25, 2024

@daleevans Thank you four your feedback! I will try your instructions soon! πŸ€—

from gpt4all.

pinsystem avatar pinsystem commented on May 25, 2024

I cannot find any 'yaml' or 'finetune...' or 'config' folder or files. I'm using Windows version. Is it possible to train with custom data in Windows?

from gpt4all.

giorgionetg avatar giorgionetg commented on May 25, 2024

under configs folder...

from gpt4all.

dominikj111 avatar dominikj111 commented on May 25, 2024

It's very easy, just change the dataset_path in the config (configs/train/finetune_lora.yaml or configs/train/finetune.yaml). The format of the data is trivial, just jsonl like this

{"prompt": "Russia Finishes Building Iran Nuclear Plant  MOSCOW (Reuters) - Russia and Iran said Thursday they had  finished construction of an atomic power plant in the Islamic  Republic -- a project the United States fears Tehran could use  to make nuclear arms. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "This is a piece of news regarding world politics and science and technology.", "source": "bigscience/p3"}
{"prompt": "Goosen brings a sparkle to the gloom With overnight rain causing a 2-hour delay to the start of play in the HSBC World Matchplay Championship at Wentworth, Retief Goosen did his bit to help to make up ground. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"}
{"prompt": "Geiberger has share of clubhouse lead at Greensboro Brent Geiberger took advantage of ideal morning conditions to earn a share of the clubhouse lead with a 6-under-par 66 during the first  \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"}

Is there any step by step documentation?
It looks easy, I just wondering how to use the downloaded model + my additional set.

from gpt4all.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.