Git Product home page Git Product logo

fullstop-deep-punctuation-prediction's People

Contributors

jordimas avatar oliverguhr avatar zevaverbach avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

fullstop-deep-punctuation-prediction's Issues

what is this model field

hello
what is this model field
I am training in new language and I couldn't know what to put inside it
55

another question in the tutorial I think there is conflict between those instructions
555

thanks

Publicliy available code

Hello,
I've enjoyed reading your paper "FullStop: Multilingual Deep Models for Punctuation Prediction".
In the paper, you stated that your code is publicly available. But in the repo, I couldn't see it. Do you plan to share it?

Kind regards

Info on training

What sort of provider and hardware did you use to train it? How long should I expect the training to take? A little bit of info just to orient myself..
Thank you very much for the great project!

Support for Tensorflow serving

How do we use the PunctuationModel in production for faster inference? Is there a Tensorflow serving support for this?

True-casing support?

Hey.

Thanks a lot for this project!

So, I have a question. Do we have true-casing (capitalization) support in fullstop-deep-punctuation-prediction? If not, what do you recommend to solve it?

TIA for any help!

Performance differences

with German sentences, the model seems to perform better on the huggingface card than elsewhere

Huggingface.co hosted inference API output is correct, elsewhere (Colab), the same example hiccups on Umlaut-characters (or nearby).

man merkt ja schon in den ersten Sä,tzen,,, dass dieser Halbexot mit der Patentanwältin sch.mus.t.

my code

from transformers import pipeline

pipe = pipeline("token-classification", "oliverguhr/fullstop-punctuation-multilang-large")
text ="man merkt ja schon in den ersten Sätzen dass dieser Halbexot mit der Patentanwältin schmust"
output_json = pipe(text)
output_json
s = ''

for n in output_json:
  result = n['word'].replace('▁',' ') + n['entity'].replace('0','')
  s += result
s

PS: thanks for all your work and for inquiring about the 512-characters PyTorch limit :-)

Error when training on new languages and reproducing results.

Hi Thank you for the great work.
I followed you guide on preparing my own dataset and training on new languages.
But during training I got this error:
Unexpected error: <class 'RuntimeError'>
It was right after print the training info and progress bars.

So, then I tried to train on the dataset you provided to see if I've somehow prepared the dataset wrong. And I still got an error in the same place, through it is a different error:
***** Running training *****
Num examples = 107145
Num Epochs = 2
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 1
Total optimization steps = 6698
0%| | 0/6698 [00:00<?, ?it/s]/home/yidansun/anaconda3/envs/new_env/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
Unexpected error: <class 'AttributeError'>
0%| | 0/6698 [00:10<?, ?it/s]

As the training is conducted with the transformers Trainer, is why hard to debug the training process. Did you ever encounter similar problems during training?

Thanks in advance.

What's the word limit for the model?

Hi, I'm trying to parse some texts that is pretty long. I run into this error.

AssertionError                            Traceback (most recent call last)
Cell In[47], line 1
----> 1 restored_text=df.loc[df['unpunc'] == True, 0].map(model.restore_punctuation)

File /work/reddit-policomp/miniconda3/envs/my-py310env/lib/python3.10/site-packages/pandas/core/series.py:4539, in Series.map(self, arg, na_action)
   4460 def map(
   4461     self,
   4462     arg: Callable | Mapping | Series,
   4463     na_action: Literal["ignore"] | None = None,
   4464 ) -> Series:
   4465     """
   4466     Map values of Series according to an input mapping or function.
   4467 
   (...)
   4537     dtype: object
   4538     """
-> 4539     new_values = self._map_values(arg, na_action=na_action)
   4540     return self._constructor(new_values, index=self.index).__finalize__(
   4541         self, method="map"
   4542     )

File /work/reddit-policomp/miniconda3/envs/my-py310env/lib/python3.10/site-packages/pandas/core/base.py:890, in IndexOpsMixin._map_values(self, mapper, na_action)
    887         raise ValueError(msg)
    889 # mapper is a function
--> 890 new_values = map_f(values, mapper)
    892 return new_values

File /work/reddit-policomp/miniconda3/envs/my-py310env/lib/python3.10/site-packages/pandas/_libs/lib.pyx:2924, in pandas._libs.lib.map_infer()

File /work/reddit-policomp/miniconda3/envs/my-py310env/lib/python3.10/site-packages/deepmultilingualpunctuation/punctuationmodel.py:21, in PunctuationModel.restore_punctuation(self, text)
     20 def restore_punctuation(self,text):        
---> 21     result = self.predict(self.preprocess(text))
     22     return self.prediction_to_text(result)

File /work/reddit-policomp/miniconda3/envs/my-py310env/lib/python3.10/site-packages/deepmultilingualpunctuation/punctuationmodel.py:49, in PunctuationModel.predict(self, words)
     47 text = " ".join(batch)
     48 result = self.pipe(text)      
---> 49 assert len(text) == result[-1]["end"], "chunk size too large, text got clipped"
     51 char_index = 0
     52 result_index = 0

AssertionError: chunk size too large, text got clipped

I didn't use any other config, just the default model and predict function. It looks like the texts is too long or the chunk_size is too long (which I didn't configure)? Is there anything I should do to have it properly function?

continue trainnig

for other language if the epoch get the specific number how can I continue trainnig from the last epoch

What is the best model configuration to use?

I have already trained and fine-tuned using both my native language and the dataset (sh download-dataset.sh). However, both outputs produced results that are not appropriate, as shown below
I have followed all the steps and used the provided configuration as well as modifying some configurations, but I'm still getting the same results. Could you help me with how to address this?"

for NLG Dataset :
from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel(model='models/xlm-roberta-base-en-1-task2/final')
text = "My name is Clara and I live in Berkeley California "
result = model.restore_punctuation(text)
print(result)
#MynameisClaraandIliveinBerkeleyCalifornia


For my native languange
from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel(model='models/xlm-roberta-base-id-1-task2/final')
text = "Apakah kamu sudah makan"
result = model.restore_punctuation(text)
print(result)
#Apakahkamusudahmakan

Fail to test model

Excuse me Sir, I tried to train using my native language and it worked. But when I try thE model after push to huggingface, the result is like this

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel(model="candra/punctuatorid")
text = "Nama saya candra Saya kerja di Sleman Yogyakarta"
result = model.restore_punctuation(text)
labeled_words = model.predict(text)
print(result)
print(labeled_words)

NamasayacandraSayakerjadiSlemanYogyakarta
[['N', 'LABEL_0', 0.92213744], ['a', 'LABEL_0', 0.88981384], ['m', 'LABEL_0', 0.82173777], ['a', 'LABEL_0', 0.8361282], [' ', 'LABEL_0', 0.8732005], ['s', 'LABEL_0', 0.87928015], ['a', 'LABEL_0', 0.8990202], ['y', 'LABEL_0', 0.8346261], ['a', 'LABEL_0', 0.8432487], [' ', 'LABEL_0', 0.87998515], ['c', 'LABEL_0', 0.838636], ['a', 'LABEL_0', 0.90778375], ['n', 'LABEL_0', 0.81340396], ['d', 'LABEL_0', 0.88555896], ['r', 'LABEL_0', 0.8849649], ['a', 'LABEL_0', 0.84936905], [' ', 'LABEL_0', 0.8485689], ['S', 'LABEL_0', 0.913819], ['a', 'LABEL_0', 0.9147434], ['y', 'LABEL_0', 0.84088546], ['a', 'LABEL_0', 0.84060436], [' ', 'LABEL_0', 0.87031215], ['k', 'LABEL_0', 0.8766002], ['e', 'LABEL_0', 0.8716935], ['r', 'LABEL_0', 0.8410843], ['j', 'LABEL_0', 0.8109067], ['a', 'LABEL_0', 0.87118506], [' ', 'LABEL_0', 0.88127214], ['d', 'LABEL_0', 0.88965994], ['i', 'LABEL_0', 0.8115389], [' ', 'LABEL_0', 0.83680826], ['S', 'LABEL_0', 0.89733636], ['l', 'LABEL_0', 0.82763183], ['e', 'LABEL_0', 0.86441326], ['m', 'LABEL_0', 0.78594685], ['a', 'LABEL_0', 0.8554612], ['n', 'LABEL_0', 0.7310551], [' ', 'LABEL_0', 0.86217946], ['Y', 'LABEL_0', 0.8851356], ['o', 'LABEL_0', 0.8768013], ['g', 'LABEL_0', 0.8701269], ['y', 'LABEL_0', 0.8675242], ['a', 'LABEL_0', 0.86862904], ['k', 'LABEL_0', 0.8152715], ['a', 'LABEL_0', 0.8852416], ['r', 'LABEL_0', 0.8483949], ['t', 'LABEL_0', 0.8477803], ['a', 'LABEL_0', 0.8666682]]

I'm having trouble analyzing the error. Any suggestions? Thank you Sir

Error when training new language

I am still a beginner in language modeling, and I have tried to follow the step-by-step training process for other languages. However, I encountered an error as shown below. Could you please help me on how to resolve this error?

"""
invoking test run for model: xlm-roberta-large-id-1-task2
loading data from pickle data/sepp_nlg_2021_train_dev_data_v5.zip_dev_id_2.pickle
loading data from pickle data/sepp_nlg_2021_train_dev_data_v5.zip_train_id_2.pickle
tokenize training data
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:08<00:00, 8.45s/it]
tokenize validation data
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.90s/it]
Unexpected error: <class 'KeyError'>
"""

Missing datasets

Hello Oliver,

thank you for working on punctuation.
I tried to execute the published scripts, but unfortunatley the datasets.py file is missing.
Could you upload it?

With best regards
Rene

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.