snakers4 / russian_stt_text_normalization Goto Github PK

View Code? Open in Web Editor NEW

116.0 7.0 15.0 3.11 MB

Russian text normalization pipeline for speech-to-text and other applications based on tagging s2s networks

License: GNU General Public License v3.0

Python 100.00%

speech russian-language python3 pytorch torchscript text-normalization speech-to-text

russian_stt_text_normalization's People

Contributors

Stargazers

Watchers

Forkers

nestyme ccppprogrammer iiidimaiii evios sfominx turchaev spikeninja toriningen laituan245 sashka1991 slava715 chgpbase gromina nikitos dmitryermilov

russian_stt_text_normalization's Issues

Примеры, на которых модель падает

Два предложения из примера через точку и \n работают, через точку и пробел или запятую или пробел - падают:
"С 12.01.1943 г. площадь сельсовета — 1785,5 га. С 12.01.1943 г. площадь сельсовета — 1785,5 га".
А такие два предложения через \n не работают, а через пробел - работают:
"""Для нач+ала работы введите Ваш текст сюда.
Для нач+ала работы введите Ваш текст сюда"""

What about pip package?

I think, that that model will more helpful for using if I can install it by pip.

Окончания чисел

from normalizer import Normalizer
 
text_list = [
    'в 23 кабинете', 
    'разделить на 2 части', 
    'нет 2 части', 
    'я хочу попасть в 156 квартиру'
]
norm = Normalizer()
 
results = [norm.norm_text(text) for text in text_list]
print(results)

Код выше выдает:

[
    'в двадцать три кабинете', 
    'разделить на два части', 
    'нет два части', 
    'я хочу попасть в сто пятьдесят шесть квартиру'
]

Часть примеров взяты из https://habr.com/ru/post/491260 -- возможно не тот pretrained выложен?

RuntimeError в ноутбуке при повторном вызове norm_text

В ноутбке при повторном вызове метода модель падает с RuntimeError.
torch==1.8.0

Воспроизведение:

Python 3.8.0 (default, Jul 24 2020, 06:59:58)                                                                                                                  
Type 'copyright', 'credits' or 'license' for more information                                                                                                  
IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help. 

In [1]: from russian_stt_text_normalization.normalizer import Normalizer                                                                                       
                                                                                                                                                               
In [2]: norm = Normalizer(jit_model='../src/russian_stt_text_normalization/jit_s2s.pt')                                                                        
                                                                                                                                                               
In [3]: norm.norm_text('тестовый текст про 101 проблему')                                                                                                      
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.95s/it]
Out[3]: 'тестовый текст про сто один проблему'                                                                                                                 
                                                                                                                                                               
In [4]: norm.norm_text('тестовый текст про 101 проблему')                                                                                                      
  0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]
---------------------------------------------------------------------------                                                                                    
RuntimeError                              Traceback (most recent call last)                                                                                    
<ipython-input-4-11fb846935c0> in <module>                                                                                                                     
----> 1 norm.norm_text('тестовый текст про 101 проблему')                                                                                                      
                                                                                                                                                               
~/projects/asr/src/russian_stt_text_normalization/normalizer.py in norm_text(self, text)                                                                       
     95                 weighted_len = sum(weighted_string)                                                                                                    
     96                 if sum(weighted_string) <= self.max_len:                                                                                               
---> 97                     norm_parts.append(self._norm_string(part))                                                                                         
     98                 else:                                                                                                                                  
     99                     spaces = [m.start() for m in re.finditer(' ', part)]                                                                               
                                                                                                                                                               
~/projects/asr/src/russian_stt_text_normalization/normalizer.py in _norm_string(self, string)                                                                  
     70                                                                                                                                                        
     71         src = torch.LongTensor(src).unsqueeze(0).to(self.device)                                                                                       
---> 72         out = self.model(src, src2tgt)                                                                                                                 
     73         pred_words = self.decode_words(out, unk_list)                                                                                                  
     74         if len(pred_words) > 199:                                                                                                                      
                                                                                                                                                               
~/projects/asr/venv/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)                                               
    887             result = self._slow_forward(*input, **kwargs)
    888         else:                                                          
--> 889             result = self.forward(*input, **kwargs)    
    890         for hook in itertools.chain(                                   
    891                 _global_forward_hooks.values(),
                                                                               

RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/test_jit2.py", line 333, in forward
            _120 = torch.select(scores0, 0, b0)
            _121 = torch.select(torch.select(_120, 0, 0), 0, d)
            _122 = torch.copy_(_121, _119)
                   ~~~~~~~~~~~ <--- HERE
          else:
            pass

Traceback of TorchScript, original code (most recent call last):
  File "/home/keras/notebook/nvme/islanna/ruhe_mono/models/seq2seq/jit_model.py", line 128, in forward
            for d in range(scores.shape[2]):
                if int(mask[b, 0, d].item()) == 0:
                    scores[b, 0, d] = -float('inf')
                    ~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE

        # Turn scores to probabilities. 
RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

Забыли сделать with torch.no_grad(): ...:
https://discuss.pytorch.org/t/leaf-variable-was-used-in-an-inplace-operation/308

Не могу запустить

Подскажите в чем может быть проблема, я новичек в python, клонировал этот репозиторий, установил зависимости, создал test.py с примером из README, при запуске получаю ошибку:
Legacy model format is not supported on mobile.
File "C:\Users\Mi\Documents\GitHub\russian_stt_text_normalization\normalizer.py", line 18, in init
self.model = torch.jit.load(jit_model, map_location=device)
File "C:\Users\Mi\Documents\GitHub\russian_stt_text_normalization\test.py", line 5, in
norm = Normalizer()

What the intuition behind using string.punctuation and uppercase or lowercase at the same time? Should I provide this(below) as labels or left only space and chars (e.g. lowercase)?

# punctuation + space + rus
self.tgt_vocab = {token: i+5 for i, token in enumerate(punctuation + rus_letters + ' ' + '«»—')}

Пример из README выдает другой результат

from normalizer import Normalizer

text = 'С 12.01.1943 г. площадь сельсовета — 1785,5 га.'

norm = Normalizer()
result = norm.norm_text(text)
print(result)

В README:

>>> С двенадцатого января тысяча девятьсот сорок третьего года площадь сельсовета
>>> — тысяча семьсот восемьдесят пять целых и пять десятых гектара

Но выдает:

С двенадцати.один.тысяча девятьсот сорок третий год. площадь сельсовета — тысяча семьсот восемьдесят пять целых и пять десятых гектара.

snakers4 / russian_stt_text_normalization Goto Github PK

russian_stt_text_normalization's People

Contributors

Stargazers

Watchers

Forkers

russian_stt_text_normalization's Issues

Примеры, на которых модель падает

What about pip package?

Окончания чисел

RuntimeError в ноутбуке при повторном вызове norm_text

Не могу запустить

Примеры, на которых модель отрабатывает неверно

Installation

What the best practises of using this lib for stt?

Пример из README выдает другой результат

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent