nkrnrnk / bertpunc Goto Github PK
View Code? Open in Web Editor NEWSOTA punctation restoration (for e.g. automatic speech recognition) deep learning model based on BERT pre-trained model
License: Apache License 2.0
SOTA punctation restoration (for e.g. automatic speech recognition) deep learning model based on BERT pre-trained model
License: Apache License 2.0
Hi, I think there is a bug in the train.py
file, just when the main function starts the variable punctuation_enc
is defined twice as you can see below. The second definition needs to be commented out in order to use train the model with LREC dataset.
punctuation_enc = {
'O': 0,
'COMMA': 1,
'PERIOD': 2,
'QUESTION': 3
}
punctuation_enc = {
'O': 0,
'PERIOD': 1,
}
I found if using insert_target()
in data.py, the input data will be split to many sequences which have a lot of overlapping words to each other.
I would like to know why process like this? I think it makes a lot of repeating data.
@nkrnrnk : Could you please add the format for the input data?
Could you please upload an example of the datasets you load in train.py, lines 190-192?
Given train and evaluate code is in pytorch which is difficult to someone like me to read or convert into Tensorflow since, with pytorch it goes OOM error.
Hi,
I have trained the model with segment size of 32. Now i want to apply the inference on unpuntucated text of length less than the segment size. I got struck here, can anyone help me for the same.
Thanks in advance,
Venkatesh
Just fyi for future users, I got this code running in April 2020 in a colab notebook by reverting to some earlier versions of libraries. I'm not sure what was originally used, so I was guessing based on the original code being from ~March 2019.
!pip install -q torch==1.0.0 torchvision==0.2.0
!pip install pytorch_pretrained_bert==0.5.0
Warning! I don't know if it actually worked as a match to the original experiment since I don't have an exact dataset match.
Hi,
As this model is fine-tuned on a pretrained reimplementation of BERT. Can anyone please share model weights generated after this fine tuning experiment.
Thanks in advance!
Would it be possible to include one of the datasets? It's difficult to tell the data format without an included dataset.
I tried to reproduce results using your Jupyter notebook but for some reasons, I got only:
ย | COMMA | PERIOD | QUESTION | OVERALL |
---|---|---|---|---|
0.062041 | 0.063562 | 0.001647 | 0.042417 | |
0.307018 | 0.231150 | 0.171429 | 0.236532 | |
0.103223 | 0.099707 | 0.003264 | 0.068731 |
(for test2011asr)
Could you please tell me why the results are so bad?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.