Git Product home page Git Product logo

Comments (11)

benleetownsend avatar benleetownsend commented on September 3, 2024 2

@xuy2 This code is merged into master now.

from finetune.

madisonmay avatar madisonmay commented on September 3, 2024 1

lt means the latter -- randomly choosing 512 contiguous tokens from an article. A random slice of text.

from finetune.

madisonmay avatar madisonmay commented on September 3, 2024

Thanks for the ticket @elyase!

I have support for language model pretraining up in a branch now: https://github.com/IndicoDataSolutions/finetune/tree/madison/lm-pretraining.

Wasn't too hard to add, we actually already had the interface support for this but unintentionally stopped supporting it during a past refactor.

Still need to put together some documentation, but it works as you've requested so for now you can work directly off of that branch if you'd like.

from finetune.

madisonmay avatar madisonmay commented on September 3, 2024

@elyase

Thought about this a bit more, the PR we have in will not allow you to train on different languages, but will give you some benefit in the scenario where you have a lot of unlabeled data for a specific English domain and a limited amount of a labeled training data. Just wanted to clarify what half of your issue we would be able to resolve.

from finetune.

elyase avatar elyase commented on September 3, 2024

@madisonmay , thanks a lot, great work. Is this line:

self.nlp = spacy.load('en', disable=['parser', 'tagger', 'ner', 'textcat'])

the only missing piece in order to train on lets say German or is there something else missing?

from finetune.

madisonmay avatar madisonmay commented on September 3, 2024

@elyase there are a few pieces that would need to be modified in order to support a new language.

One of those changes would be swapping out the english tokenizer for a German tokenizer. You can find a German tokenizer as part of spacy so that portion should be relatively straightforward.

Secondly, the byte-pair encoder (that's used to decide how to split up words into subword pieces to have a useful fallback for out of vocabulary words) was "fit" on English text. This means that the current pretrained model's vocabulary primarily contains English text and uses word pieces that are based on English frequencies.

Finally, there would be some minimal changes to make in order to ensure that you're starting from randomly initialized weights rather than the weights learned by the English model.

There might be some other required changes that I'm overlooking but that's what comes to mind right now.

Know that training a language model from scratch on a new language will be a pretty big computational investment -- think along the lines of 4-8 GPUs + a week of training time.

from finetune.

xuy2 avatar xuy2 commented on September 3, 2024

Hi, I cannot open the branch for pretraining language model anymore. Can you kindly tell me how can I find it?

from finetune.

madisonmay avatar madisonmay commented on September 3, 2024

Closing this issue as finetuning the language model only is now fully supported on the master branch as of #58. Thanks again for the feature request / bug report @xuy2! Feel free to open another issue if there's something else we can help out with.

from finetune.

xuy2 avatar xuy2 commented on September 3, 2024

@madisonmay @benleetownsend I have a question for the pre-trained model. The paper said it uses "randomly sampled, contiguous sequences of 512 tokens" to pre-train. Does it mean doing padding for a sentence to 512 tokens, or randomly choosing contiguous 512 tokens from an article?

from finetune.

madisonmay avatar madisonmay commented on September 3, 2024

@xuy2 that's actually a valid point -- I'm not sure if the model ever received inputs of < 512 tokens at train time.

from finetune.

xuy2 avatar xuy2 commented on September 3, 2024

@madisonmay Thank you! In the latter method, it seems that we can only train the model by the batch instead of the epoch. How can we evaluate the model performance using test data? If using the random slice, it's hard to guarantee that the whole test set is evaluated.

from finetune.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.