Is your feature request related to a problem? Please describe. In

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks for the ticket <a class="user-mention notranslate" data-hovercard-type="user" d

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Support for pre-training the language model about finetune HOT 11 CLOSED

indicodatasolutions commented on September 3, 2024

Support for pre-training the language model

from finetune.

Comments (11)

benleetownsend commented on September 3, 2024 2

@xuy2 This code is merged into master now.

from finetune.

madisonmay commented on September 3, 2024 1

lt means the latter -- randomly choosing 512 contiguous tokens from an article. A random slice of text.

from finetune.

madisonmay commented on September 3, 2024

Thanks for the ticket @elyase!

I have support for language model pretraining up in a branch now: https://github.com/IndicoDataSolutions/finetune/tree/madison/lm-pretraining.

Wasn't too hard to add, we actually already had the interface support for this but unintentionally stopped supporting it during a past refactor.

Still need to put together some documentation, but it works as you've requested so for now you can work directly off of that branch if you'd like.

from finetune.

madisonmay commented on September 3, 2024

@elyase

Thought about this a bit more, the PR we have in will not allow you to train on different languages, but will give you some benefit in the scenario where you have a lot of unlabeled data for a specific English domain and a limited amount of a labeled training data. Just wanted to clarify what half of your issue we would be able to resolve.

from finetune.

elyase commented on September 3, 2024

@madisonmay , thanks a lot, great work. Is this line:

finetune/finetune/encoding.py

Line 67 in da624ae

self.nlp = spacy.load('en', disable=['parser', 'tagger', 'ner', 'textcat'])

the only missing piece in order to train on lets say German or is there something else missing?

from finetune.

madisonmay commented on September 3, 2024

@elyase there are a few pieces that would need to be modified in order to support a new language.

One of those changes would be swapping out the english tokenizer for a German tokenizer. You can find a German tokenizer as part of spacy so that portion should be relatively straightforward.

Secondly, the byte-pair encoder (that's used to decide how to split up words into subword pieces to have a useful fallback for out of vocabulary words) was "fit" on English text. This means that the current pretrained model's vocabulary primarily contains English text and uses word pieces that are based on English frequencies.

Finally, there would be some minimal changes to make in order to ensure that you're starting from randomly initialized weights rather than the weights learned by the English model.

There might be some other required changes that I'm overlooking but that's what comes to mind right now.

Know that training a language model from scratch on a new language will be a pretty big computational investment -- think along the lines of 4-8 GPUs + a week of training time.

from finetune.

xuy2 commented on September 3, 2024

Hi, I cannot open the branch for pretraining language model anymore. Can you kindly tell me how can I find it?

from finetune.

madisonmay commented on September 3, 2024

Closing this issue as finetuning the language model only is now fully supported on the master branch as of #58. Thanks again for the feature request / bug report @xuy2! Feel free to open another issue if there's something else we can help out with.

from finetune.

xuy2 commented on September 3, 2024

@madisonmay @benleetownsend I have a question for the pre-trained model. The paper said it uses "randomly sampled, contiguous sequences of 512 tokens" to pre-train. Does it mean doing padding for a sentence to 512 tokens, or randomly choosing contiguous 512 tokens from an article?

from finetune.

madisonmay commented on September 3, 2024

@xuy2 that's actually a valid point -- I'm not sure if the model ever received inputs of < 512 tokens at train time.

from finetune.

xuy2 commented on September 3, 2024

@madisonmay Thank you! In the latter method, it seems that we can only train the model by the batch instead of the epoch. How can we evaluate the model performance using test data? If using the random slice, it's hard to guarantee that the whole test set is evaluated.

from finetune.

Support for pre-training the language model about finetune HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent