A model that predicts the punctuation of English, Italian, French and German texts.

Home Page: https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large

License: MIT License

Python 91.83% Shell 8.17%

deep-learning multilingual nlp transformer

fullstop-deep-punctuation-prediction's People

Contributors

Stargazers

Watchers

Forkers

jordimas orlink zevaverbach nimanov enescigdem body123 mnfutao anntabueva kaymay2 pooryaasg pooryack yparvej

fullstop-deep-punctuation-prediction's Issues

what is this model field

hello
what is this model field
I am training in new language and I couldn't know what to put inside it

another question in the tutorial I think there is conflict between those instructions

thanks

Hello,
I've enjoyed reading your paper "FullStop: Multilingual Deep Models for Punctuation Prediction".
In the paper, you stated that your code is publicly available. But in the repo, I couldn't see it. Do you plan to share it?

Kind regards

how to train in another languages?

Thank you for this amazing project.
Is it possible to be trained in another language, for example Mandarin? Thank you.

Info on training

What sort of provider and hardware did you use to train it? How long should I expect the training to take? A little bit of info just to orient myself..
Thank you very much for the great project!

Support for Tensorflow serving

How do we use the PunctuationModel in production for faster inference? Is there a Tensorflow serving support for this?

How to use?

How to use beyond info on card? https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large/tree/main

from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("oliverguhr/fullstop-punctuation-multilang-large")
model = AutoModelForTokenClassification.from_pretrained("oliverguhr/fullstop-punctuation-multilang-large")

thanks in advance

EDIT: nevermind, I found a related issue on Github

huggingface/hub-docs#11

True-casing support?

Hey.

Thanks a lot for this project!

So, I have a question. Do we have true-casing (capitalization) support in fullstop-deep-punctuation-prediction? If not, what do you recommend to solve it?

TIA for any help!

Issues with CUDA , how to run on CPU only?

Hi guys, I have an issue using CUDA on my machine, how can I tell the model to use CPU only?

thank you very much

Performance differences

with German sentences, the model seems to perform better on the huggingface card than elsewhere

Huggingface.co hosted inference API output is correct, elsewhere (Colab), the same example hiccups on Umlaut-characters (or nearby).

man merkt ja schon in den ersten Sä,tzen,,, dass dieser Halbexot mit der Patentanwältin sch.mus.t.

my code

from transformers import pipeline

pipe = pipeline("token-classification", "oliverguhr/fullstop-punctuation-multilang-large")
text ="man merkt ja schon in den ersten Sätzen dass dieser Halbexot mit der Patentanwältin schmust"
output_json = pipe(text)
output_json
s = ''

for n in output_json:
  result = n['word'].replace('▁',' ') + n['entity'].replace('0','')
  s += result
s

PS: thanks for all your work and for inquiring about the 512-characters PyTorch limit :-)

Error when training on new languages and reproducing results.

Hi Thank you for the great work.
I followed you guide on preparing my own dataset and training on new languages.
But during training I got this error:
Unexpected error: <class 'RuntimeError'>
It was right after print the training info and progress bars.

So, then I tried to train on the dataset you provided to see if I've somehow prepared the dataset wrong. And I still got an error in the same place, through it is a different error:
***** Running training *****
Num examples = 107145
Num Epochs = 2
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 1
Total optimization steps = 6698
0%| | 0/6698 [00:00<?, ?it/s]/home/yidansun/anaconda3/envs/new_env/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
Unexpected error: <class 'AttributeError'>
0%| | 0/6698 [00:10<?, ?it/s]

As the training is conducted with the transformers Trainer, is why hard to debug the training process. Did you ever encounter similar problems during training?

Thanks in advance.

What's the word limit for the model?

Hi, I'm trying to parse some texts that is pretty long. I run into this error.

AssertionError                            Traceback (most recent call last)
Cell In[47], line 1
----> 1 restored_text=df.loc[df['unpunc'] == True, 0].map(model.restore_punctuation)

File /work/reddit-policomp/miniconda3/envs/my-py310env/lib/python3.10/site-packages/pandas/core/series.py:4539, in Series.map(self, arg, na_action)
   4460 def map(
   4461     self,
   4462     arg: Callable | Mapping | Series,
   4463     na_action: Literal["ignore"] | None = None,
   4464 ) -> Series:
   4465     """
   4466     Map values of Series according to an input mapping or function.
   4467 
   (...)
   4537     dtype: object
   4538     """
-> 4539     new_values = self._map_values(arg, na_action=na_action)
   4540     return self._constructor(new_values, index=self.index).__finalize__(
   4541         self, method="map"
   4542     )

File /work/reddit-policomp/miniconda3/envs/my-py310env/lib/python3.10/site-packages/pandas/core/base.py:890, in IndexOpsMixin._map_values(self, mapper, na_action)
    887         raise ValueError(msg)
    889 # mapper is a function
--> 890 new_values = map_f(values, mapper)
    892 return new_values

File /work/reddit-policomp/miniconda3/envs/my-py310env/lib/python3.10/site-packages/pandas/_libs/lib.pyx:2924, in pandas._libs.lib.map_infer()

File /work/reddit-policomp/miniconda3/envs/my-py310env/lib/python3.10/site-packages/deepmultilingualpunctuation/punctuationmodel.py:21, in PunctuationModel.restore_punctuation(self, text)
     20 def restore_punctuation(self,text):        
---> 21     result = self.predict(self.preprocess(text))
     22     return self.prediction_to_text(result)

File /work/reddit-policomp/miniconda3/envs/my-py310env/lib/python3.10/site-packages/deepmultilingualpunctuation/punctuationmodel.py:49, in PunctuationModel.predict(self, words)
     47 text = " ".join(batch)
     48 result = self.pipe(text)      
---> 49 assert len(text) == result[-1]["end"], "chunk size too large, text got clipped"
     51 char_index = 0
     52 result_index = 0

AssertionError: chunk size too large, text got clipped

I didn't use any other config, just the default model and predict function. It looks like the texts is too long or the chunk_size is too long (which I didn't configure)? Is there anything I should do to have it properly function?

continue trainnig

for other language if the epoch get the specific number how can I continue trainnig from the last epoch

What is the best model configuration to use?

I have already trained and fine-tuned using both my native language and the dataset (sh download-dataset.sh). However, both outputs produced results that are not appropriate, as shown below
I have followed all the steps and used the provided configuration as well as modifying some configurations, but I'm still getting the same results. Could you help me with how to address this?"

for NLG Dataset :
from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel(model='models/xlm-roberta-base-en-1-task2/final')
text = "My name is Clara and I live in Berkeley California "
result = model.restore_punctuation(text)
print(result)
#MynameisClaraandIliveinBerkeleyCalifornia

For my native languange
from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel(model='models/xlm-roberta-base-id-1-task2/final')
text = "Apakah kamu sudah makan"
result = model.restore_punctuation(text)
print(result)
#Apakahkamusudahmakan

Fail to test model

Excuse me Sir, I tried to train using my native language and it worked. But when I try thE model after push to huggingface, the result is like this

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel(model="candra/punctuatorid")
text = "Nama saya candra Saya kerja di Sleman Yogyakarta"
result = model.restore_punctuation(text)
labeled_words = model.predict(text)
print(result)
print(labeled_words)

NamasayacandraSayakerjadiSlemanYogyakarta
[['N', 'LABEL_0', 0.92213744], ['a', 'LABEL_0', 0.88981384], ['m', 'LABEL_0', 0.82173777], ['a', 'LABEL_0', 0.8361282], [' ', 'LABEL_0', 0.8732005], ['s', 'LABEL_0', 0.87928015], ['a', 'LABEL_0', 0.8990202], ['y', 'LABEL_0', 0.8346261], ['a', 'LABEL_0', 0.8432487], [' ', 'LABEL_0', 0.87998515], ['c', 'LABEL_0', 0.838636], ['a', 'LABEL_0', 0.90778375], ['n', 'LABEL_0', 0.81340396], ['d', 'LABEL_0', 0.88555896], ['r', 'LABEL_0', 0.8849649], ['a', 'LABEL_0', 0.84936905], [' ', 'LABEL_0', 0.8485689], ['S', 'LABEL_0', 0.913819], ['a', 'LABEL_0', 0.9147434], ['y', 'LABEL_0', 0.84088546], ['a', 'LABEL_0', 0.84060436], [' ', 'LABEL_0', 0.87031215], ['k', 'LABEL_0', 0.8766002], ['e', 'LABEL_0', 0.8716935], ['r', 'LABEL_0', 0.8410843], ['j', 'LABEL_0', 0.8109067], ['a', 'LABEL_0', 0.87118506], [' ', 'LABEL_0', 0.88127214], ['d', 'LABEL_0', 0.88965994], ['i', 'LABEL_0', 0.8115389], [' ', 'LABEL_0', 0.83680826], ['S', 'LABEL_0', 0.89733636], ['l', 'LABEL_0', 0.82763183], ['e', 'LABEL_0', 0.86441326], ['m', 'LABEL_0', 0.78594685], ['a', 'LABEL_0', 0.8554612], ['n', 'LABEL_0', 0.7310551], [' ', 'LABEL_0', 0.86217946], ['Y', 'LABEL_0', 0.8851356], ['o', 'LABEL_0', 0.8768013], ['g', 'LABEL_0', 0.8701269], ['y', 'LABEL_0', 0.8675242], ['a', 'LABEL_0', 0.86862904], ['k', 'LABEL_0', 0.8152715], ['a', 'LABEL_0', 0.8852416], ['r', 'LABEL_0', 0.8483949], ['t', 'LABEL_0', 0.8477803], ['a', 'LABEL_0', 0.8666682]]

I'm having trouble analyzing the error. Any suggestions? Thank you Sir

Error when training new language

I am still a beginner in language modeling, and I have tried to follow the step-by-step training process for other languages. However, I encountered an error as shown below. Could you please help me on how to resolve this error?

"""
invoking test run for model: xlm-roberta-large-id-1-task2
loading data from pickle data/sepp_nlg_2021_train_dev_data_v5.zip_dev_id_2.pickle
loading data from pickle data/sepp_nlg_2021_train_dev_data_v5.zip_train_id_2.pickle
tokenize training data
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:08<00:00, 8.45s/it]
tokenize validation data
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.90s/it]
Unexpected error: <class 'KeyError'>
"""

Missing datasets

Hello Oliver,

thank you for working on punctuation.
I tried to execute the published scripts, but unfortunatley the datasets.py file is missing.
Could you upload it?

With best regards
Rene

oliverguhr / fullstop-deep-punctuation-prediction Goto Github PK

fullstop-deep-punctuation-prediction's People

Contributors

Stargazers

Watchers

Forkers

fullstop-deep-punctuation-prediction's Issues

Recommend Projects

Recommend Topics

Recommend Org