nvidia / nemo-text-processing Goto Github PK

NeMo text processing for ASR and TTS

Home Page: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/text_normalization/wfst/wfst_text_normalization.html

License: Apache License 2.0

Python 89.06% Shell 1.79% Dockerfile 0.06% Jupyter Notebook 8.88% C++ 0.21%

text-normalization inverse-text-n

nemo-text-processing's Introduction

NeMo Text Processing

Introduction

nemo-text-processing is a Python package for text normalization and inverse text normalization.

Documentation

NeMo-text-processing (text normalization and inverse text normalization).

Tutorials

Google Collab Notebook	Description
Text_(Inverse)_Normalization.ipynb	Quick-start guide
WFST_Tutorial	In-depth tutorial on grammar customization

Getting help

If you have a question which is not answered in the Github discussions, encounter a bug or have a feature request, please create a Github issue. We also welcome you to directly open a pull request to fix a bug or add a feature.

Installation

Conda virtual environment

We recommend setting up a fresh Conda environment to install NeMo-text-processing.

conda create --name nemo_tn python==3.10
conda activate nemo_tn

(Optional) To use hybrid text normalization install PyTorch using their configurator.

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

NOTE: The command used to install PyTorch may depend on your system.

Pip

Use this installation mode if you want the latest released version.

pip install nemo_text_processing

NOTE: This should work on any Linux OS with x86_64. Pip installation on MacOS and Windows are not supported due to the dependency Pynini. On a platform other than Linux x86_64, installing from Pip tries to compile Pynini from scratch, and requires OpenFst headers and libraries to be in the expected place. So if it's working for you, it's because you happen to have installed OpenFst in the right way in the right place. So if you want to Pip install Pynini on MacOS, you have to have pre-compiled and pre-installed OpenFst. The Pynini README for that version should tell you which version it needs and what --enable-foo flags to use. Instead, we recommend you to use conda-forge to install Pynini on MacOS or Windows: conda install -c conda-forge pynini=2.1.6.post1.

Pip from source

Use this installation mode if you want the a version from particular GitHub branch (e.g main).

pip install Cython
python -m pip install git+https://github.com/NVIDIA/NeMo-text-processing.git@{BRANCH}#egg=nemo_text_processing

From source

Use this installation mode if you are contributing to NeMo-text-processing.

git clone https://github.com/NVIDIA/NeMo-text-processing
cd NeMo-text-processing
./reinstall.sh

NOTE: If you only want the toolkit without additional conda-based dependencies, you may replace reinstall.sh with pip install -e . with the NeMo-text-processing root directory as your current working director.

Contributing

We welcome community contributions! Please refer to the CONTRIBUTING.md for guidelines.

Citation

@inproceedings{zhang21ja_interspeech,
  author={Yang Zhang and Evelina Bakhturina and Boris Ginsburg},
  title={{NeMo (Inverse) Text Normalization: From Development to Production}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={4857--4859}
}

@inproceedings{bakhturina22_interspeech,
  author={Evelina Bakhturina and Yang Zhang and Boris Ginsburg},
  title={{Shallow Fusion of Weighted Finite-State Transducer and Language Model for
Text Normalization}},
  year=2022,
  booktitle={Proc. Interspeech 2022}
}

License

NeMo-text-processing is under Apache 2.0 license.

nemo-text-processing's People

Contributors

Stargazers

Watchers

nemo-text-processing's Issues

fst_alignment for ITN

Hi,

nemo_text_processing/fst_alignment/alignment.py works fine for TN case when we are aligning input to output words:

inp string: |1994|
out string: |tokens { date { year: "nineteen ninety four" } }|
inp indices: [0:4] out indices: [23:43]
in: |1994| out: |nineteen ninety four|

But in ITN case, the alignment seems to be broken as the input words that are inverse normalized are mapped to empty strings:

inp string: |nineteen ninety four|
out string: |tokens { date { year: "1994" preserve_order: true } }|
inp indices: [0:8] out indices: [23:23]
in: |nineteen| out: ||
inp indices: [9:15] out indices: [25:25]
in: |ninety| out: ||
inp indices: [16:20] out indices: [26:26]
in: |four| out: ||

Is there a way to get below form in ITN case?

in: |nineteen ninety four| out: |1994|

Thank you very much.

zh TN is very slow and bad accuracy

one simple zh-CN sentence costs 1.32 sec and the result is not right.

>python normalize.py --text="123" --language=en
INFO:NeMo-text-processing:one hundred and twenty three
WARNING:NeMo-text-processing:Execution time: 0.02 sec

>python normalize.py --text="我出生于1998年7月22日" --language=zh
INFO:NeMo-text-processing:我出生于1998年7月22日
WARNING:NeMo-text-processing:Execution time: 1.32 sec

>python normalize.py --text="I'm born in 22/3/1990" --language=en
INFO:NeMo-text-processing:I'm born in the twenty second of march nineteen ninety
WARNING:NeMo-text-processing:Execution time: 0.02 sec

unexpected normalized text for Arabic

Describe the bug

Unable to convert some currency correctly for Arabic

Steps/Code to reproduce bug

normalizer = Normalizer(input_case='cased',
                        lang=lang,
                        cache_dir=normalize_cache_dir,
                        overwrite_cache=False)

texts = 'aed1.2'
normalized_text = normalizer.normalize(text=text, verbose=True)
print(normalized_text)

Then:

normalizer/escape: aed1.2
normalizer/select_tag: tokens { money { integer_part: "واحد" currency_maj: "درهم إماراتي" fractional_part: "عشرون" preserve_order: true } }
ERROR: StringFstToOutputLabels: Invalid start state
Traceback (most recent call last):
  File "run_normalize_file.py", line 24, in <module>
    normalized_text = normalizer.normalize(text=text, verbose=True)
  File "~/text_normalization/normalize.py", line 320, in normalize
    output += ' ' + Normalizer.select_verbalizer(verbalizer_lattice)
  File "~/text_normalization/normalize.py", line 479, in select_verbalizer
    output = pynini.shortestpath(lattice, nshortest=1, unique=True).string()
  File "extensions/_pynini.pyx", line 462, in _pynini.Fst.string
  File "extensions/_pynini.pyx", line 507, in _pynini.Fst.string
_pywrapfst.FstOpError: Operation failed

Expected behavior

No error.

Environment overview (please complete the following information)

Environment location: Bare-metal
Method of NeMo install: pip install or from source

Italian TN: Wrong path to measurements.tsv

Italian TN crashes in the non deterministic case due to a wrong path to 'measurements.tsv' .
In the line below, it should be data/measure/measurements.tsv instead of data/measures/measurements.tsv

NeMo-text-processing/nemo_text_processing/text_normalization/it/taggers/whitelist.py

Line 59 in 084952b

 units_graph = _get_whitelist_graph(input_case, file=get_abs_path("data/measures/measurements.tsv")) 

French normalizer crashes

Describe the bug

Instantiating a Normalizer object with lang set to "fr" leads to a crash after waiting for 2-3 min.

Steps/Code to reproduce bug

Please list minimal steps or code snippet for us to be able to reproduce the bug.

>>> from nemo_text_processing.text_normalization import Normalizer
>>> norm = Normalizer(input_case="cased", lang="fr")
 NeMo-text-processing :: INFO     :: Creating ClassifyFst grammars. This might take some time...
Killed

Expected behavior

A Normalizer object for normalizing French text would be instantiated.

Environment overview (please complete the following information)

Environment location: local
Method of NeMo install: pip install nemo_text_processing (I tried releases 0.2.2rc0 and 0.3.0rc0)

Environment details

OS version: Windows 10.0.19045 (using WSL: Linux version 5.15.146.1-microsoft-standard-WSL2)
PyTorch version: 2.1.2
Python version: 3.10.13

Additional context

It works fine for other languages (English, Italian, German, Spanish).

[zh] WARNING:NeMo-text-processing:Failed text: 免除GOOGLE在一桩诽谤官司中的法律责任。Key: integer_part Value: None

Received warning message when normalizing text. Could you pls provide what the message indicates?

Reproduciple code:

from nemo_text_processing.text_normalization.normalize import Normalizer
text_normalizer = Normalizer(lang="zh", input_case="cased", overwrite_cache=True, cache_dir=str("cache_dir"))
text_normalizer_call_kwargs = {"punct_pre_process": True, "punct_post_process": True}
normalizer_call = lambda x: text_normalizer.normalize(x, **text_normalizer_call_kwargs)

text = "免除GOOGLE在一桩诽谤官司中的法律责任。"
normed_text = normalizer_call(text)
print(normed_text)

Output

NeMo-text-processing :: INFO     :: Created cache_dir[/zh_tn_True_deterministic__tokenize.far](http://localhost:8889/zh_tn_True_deterministic__tokenize.far)
INFO:NeMo-text-processing:Created cache_dir[/zh_tn_True_deterministic__tokenize.far](http://localhost:8889/zh_tn_True_deterministic__tokenize.far)
 NeMo-text-processing :: WARNING  :: Failed text: 免除GOOGLE在一桩诽谤官司中的法律责任。Key: integer_part Value: None
WARNING:NeMo-text-processing:Failed text: 免除GOOGLE在一桩诽谤官司中的法律责任。Key: integer_part Value: None
免除GOOGLE在一桩诽谤官司中的法律责任。

text field

Describe the bug
NeMo-text-processing/nemo_text_processing/text_normalization/normalize_with_audio.py", line 294, in normalize_line
text=line["text"],
KeyError: 'text'

A clear and concise description of what the bug is.
Why it's hardcoded here?

How to fix:
place text_field variable instead of "text"

Steps/Code to reproduce bug

Please list minimal steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Expected behavior

A clear and concise description of what you expected to happen.

Environment overview (please complete the following information)

Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
Method of NeMo install: [pip install or from source]. Please specify exact commands you used to install.
If method of install is [Docker], provide docker pull & docker run commands used

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

OS version
PyTorch version
Python version

Additional context

Add any other context about the problem here.
Example: GPU model

SV Missing file

@jimregan https://github.com/NVIDIA/NeMo-text-processing/blob/main/nemo_text_processing/text_normalization/sv/taggers/telephone.py#L138 is missing, could you please add it?

bug in graph_utils.py of zh ITN and decimal tagger of ar TN

In the ./nemo_text_processing/inverse_text_normalization/zh/graph_utils.py line 79, load_labels() method is called but it is not imported, So it raises error. It could be simply resolved by adding the following method in ./nemo_text_processing/inverse_text_normalization/zh/utils.py:

def load_labels(abs_path):
    """
    loads relative path file as dictionary

    Args:
        abs_path: absolute path

    Returns dictionary of mappings
    """
    with open(abs_path, encoding="utf-8") as label_tsv:
        labels = list(csv.reader(label_tsv, delimiter="\t"))
    return labels

and import it in ./nemo_text_processing/inverse_text_normalization/zh/graph_utils.py:

from nemo_text_processing.inverse_text_normalization.zh.utils import load_labels

Also there is another bug in arabic TN tagger:
In nemo_text_processing/text_normalization/ar/taggers/decimal.py line 40, quantities is not defined. So it could be resolved by adding the following line before it (in line 33):

quantities = pynini.string_file(get_abs_path("data/numbers/quantities.tsv"))

Decade Pluralization Doesn't Work For Years Pre-1000

Describe the bug

Nemo text processing 0.1.7rc0 will pluralize e.g., 1980s as "nineteen eighties" (correct) but 830s becomes "Eight Thirty S" (incorrect).

Steps/Code to reproduce bug

from nemo_text_processing.text_normalization.normalize import Normalizer
text = "In the 1980s personal computers became more widely available. In the 830s the Abbasid Caliphate started military excursions culminating with a victory in the Sack of Amorium."
normalizer = Normalizer(input_case='cased', lang='en' )
normalized_text = normalizer.normalize (text, verbose=False, punct_post_process=True)
print ( normalized_text )

Expected output

In the nineteen eighties personal computers became more widely available. In the eight hundred and thirties the Abbasid Caliphate started military excursions culminating with a victory in the Sack of Amorium.

Actual output

In the nineteen eighties personal computers became more widely available. In the eight hundred and thirty S the Abbasid Caliphate started military excursions culminating with a victory in the Sack of Amorium.

Environment overview (please complete the following information)

Environment location: metal
Method of NeMo install: pip

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

OS version Ubuntu 22.04.2 LTS
PyTorch version 1.13.1+cu117
Python version 3.9

Poetry support

Is your feature request related to a problem? Please describe.

Poetry cannot install nemo-text-processing. See following script:

#!/bin/bash

set -ex

# Demo the failure of installing pynini==2.1.5 and that this also blocks installing nemo text processing

# Install poetry if not already installed
if ! command -v poetry &> /dev/null
then
    echo "poetry not found, installing poetry"
    curl -sSL https://install.python-poetry.org | python3 - -y
fi

# Add poetry to path
PATH="$PATH:$HOME/.local/bin"

# Make project and install pynini==2.1.5
# Set clean up on exit using trap
trap "rm -rf /tmp/pynini-test" EXIT

cd /tmp
rm -rf pynini-test
mkdir pynini-test
cd pynini-test
poetry init -n

# Install nemo text processing will show an error
poetry add nemo_text_processing==0.2.2rc0

Describe the solution you'd like

No error on poetry add.

Describe alternatives you've considered

Alternatives are pip install, but poetry happens to be a widely adopted package manager, so support for this would be great.

Additional context

I've opened a PR to support poetry here #144

Missing taggers, verbalizers, and data from Italian TN in pip release

The pip release of nemo-text-processing 0.2.0rc0 is missing the taggers, verbalizers, and data dirs for Italian TN.

Sparrowhawk slower than Python implementation

Describe the bug

As you guys suggested, I tried exporting the grammars and run the normalizer with Sparrowhawk. But it actually takes even longer than Python.
n_utts vs time_taken for Sparrowhawk:
100: 0m27.430s
50: 0m15.666s
10: 0m4.690s

For python:
100: 11s
50: 5.6s
10: 0.85s

The time taken for Sparrowhawk seems a bit non-linear.

Steps/Code to reproduce bug

Exported my custom grammar and ran the Sparrowhawk docker.
There was another issue reporting this slowdown: #82

Expected behavior

C++ supposed to be faster.

Environment overview (please complete the following information)

Environment location: GCP
Method of NeMo install: poetry

Environment details

Additional context

English text normalization MoneyFst conflict with SerialFst and small weight does not take effect

Rule conflicting between MoneyFst and SerialFst tagger

Steps/Code to reproduce bug

Command:

python nemo_text_processing/text_normalization/normalize.py --verbose --text 'Thank you for the quantities. Now, lets talk about the pricing. The price for each canned salmon is $5, each bottle of peanut butter is $3'

Output:

Thank you for the quantities. Now, lets talk about the pricing. The price for each canned salmon is five dollars, each bottle of peanut butter is dollar three

Expected behavior

Expected output:

Thank you for the quantities. Now, lets talk about the pricing. The price for each canned salmon is five dollars, each bottle of peanut butter is three dollar

Environment overview

Environment location: Bare-metal
Method of NeMo install: pip install

Environment details

OS version: Fedora 38
PyTorch version: 2.0.0
Python version: 3.10.10

Additional information
I found that there is a conflict between MoneyFst and SerialFst taggers.
Both tagger returns the same weight==2404.29785
Computed using pynini.shortestdistance(tagged_lattice, delta=10**-8)[-1]})

NeMo-text-processing/nemo_text_processing/text_normalization/normalize.py

Line 337 in 5dd753a

tagged_lattice = self.find_tags(text)

Due to the this code:

NeMo-text-processing/nemo_text_processing/text_normalization/en/taggers/tokenize_and_classify.py

Lines 163 to 176 in 5dd753a

 classify = ( 

 pynutil.add_weight(whitelist_graph, 1.01) 

 | pynutil.add_weight(time_graph, 1.1) 

 | pynutil.add_weight(date_graph, 1.09) 

 | pynutil.add_weight(decimal_graph, 1.1) 

 | pynutil.add_weight(measure_graph, 1.1) 

 | pynutil.add_weight(cardinal_graph, 1.1) 

 | pynutil.add_weight(ordinal_graph, 1.1) 

 | pynutil.add_weight(money_graph, 1.1) 

 | pynutil.add_weight(telephone_graph, 1.1) 

 | pynutil.add_weight(electonic_graph, 1.1) 

 | pynutil.add_weight(fraction_graph, 1.1) 

 | pynutil.add_weight(range_graph, 1.1) 

 | pynutil.add_weight(serial_graph, 1.1001) # should be higher than the rest of the classes

I think that serial_graph's weight should be higher money_graph but it is not, so I disabled MoneyFst to get the weight from SerialFst (changed its olabel to ensure that the weight is from the best path contains SerialFst) for this text and here is the weight with corresponding SerialFst's weight in ClassifyFst.classify:

1.1000	2404.29785
1.1001	2404.29785
1.1002	2404.2981
1.1003	2404.29858
1.1004	2404.29858
1.1005	2404.29883
1.1006	2404.29883
1.1007	2404.29907

English is not my native language, so please forgive me if there is any ambiguity.

"Creating ClassifyFst grammars" message to stdout even with verbose=false

I'm sure I'll figure out how to configure logger to make this not go to stdout, so not a huge issue. But still it would be nice if the default was that I wouldn't get info messages mixed in with my program's output when "Verbose=False" is specified. I'm using, basically from the tutorials:

from nemo_text_processing.text_normalization.normalize import Normalizer
normalizer = Normalizer(input_case='cased', lang='en')
normalizer.normalize (text, verbose=False, punct_post_process=True)

Note that verbose is False. But this still outputs

[NeMo I 2023-02-02 03:32:26 tokenize_and_classify:87] Creating ClassifyFst grammars.

to stdout. Please consider not printing this to stdout when "Verbose=False" was specified. Thank you.

fwd: bug report from https://github.com/NVIDIA/NeMo/discussions/8076

I guess this potential bug is related to text normalization in this repo. please have a look.

ref: NVIDIA/NeMo#8076

Digits Remain Unnormalized in European Languages Output

Hello,

I have observed an issue where digits remain unnormalized in the output text when using the Nemo text normalization library, specifically with European languages such as German (de), Italian (it), and French (fr). This behavior occurs even though the expected output should not contain any digits.

Here is an example:

from nemo_text_processing.text_normalization.normalize import Normalizer
normalizer = Normalizer(input_case="cased", lang="it")
text = "il 48% ha risposto che avrebbe dovuto provenire dal proprio budget."
norm_text = normalizer.normalize(text, punct_post_process=True)
print(norm_text)

Expected output: No digits in the normalized text.
Actual output: 'il 48% ha risposto che avrebbe dovuto provenire dal proprio budget.'

Additional Examples:

Other examples with similart behavior in format (text, normalized_text):

[('Hier zoome ich auf die Läsion. Wir befinden uns also auf der 2D-Mammographie.',
  'Hier zoome ich auf die Läsion. Wir befinden uns also auf der 2D-Mammographie.'),
 ('Aber die Tatsache, dass andere Leute bieten nur 800.000 zu diesem Zeitpunkt der Marktpreis ist auch 800.000.',
  'Aber die Tatsache, dass andere Leute bieten nur 800.000 zu diesem Zeitpunkt der Marktpreis ist auch 800.000.'),
 ('Les Tech Clippings seront diffusés en exclusivité sur la chaîne Youtube DIGITIMES tous les vendredis à 20h.',
  'Les Tech Clippings seront diffusés en exclusivité sur la chaîne Youtube DIGITIMES tous les vendredis à 20h.'),
 ('Ich gebe Ihnen ein anderes Beispiel: Wenn Sie einmal unseren OPP sprechen und ich gebe Ihnen auf der Stelle 1.000 Dollar.',
  'Ich gebe Ihnen ein anderes Beispiel: Wenn Sie einmal unseren OPP sprechen und ich gebe Ihnen auf der Stelle 1.000 Dollar.'),
 ('Il y a 1,08 milliard de vaches dans le monde qui émettent 18% des émissions de carbone.',
  'Il y a un virgule zéro huit milliard de vaches dans le monde qui émettent 18% des émissions de carbone'),
 ('Ci sono 1,08 miliardi di mucche nel mondo che emettono il 18% delle emissioni di carbonio.',
  'Il y a un virgule zéro huit milliard de vaches dans le monde qui émettent 18% des émissions de carbone.')]

Expected Behavior:
The normalized text should not contain any digits.

Actual Behavior:
Digits are retained in the normalized output, which contradicts the expected behavior of a text normalization tool. This
issue does not occur consistently but appears sometimes which is particularly problematic for tasks that require clean, digit-free text—such as grapheme-to-phoneme (g2p) conversion.

Environment:

Nemo version: I use nemo_text_processing with version==0.3.0rc0.
Python version: Python 3.11.8

New release version?

Hi! I see a lot new changes in main. Is it possible to create new release version soon?

Some bugs in English, German, Spanish, Italian normalizers

Hi!

I found a bug in English normalization. The following code is applied:

normalizer = Normalizer(
  input_case="cased",
  lang="en",
  deterministic=True,
)
norm_text = normalizer.normalize(text, punct_post_process=True)

text=Here is mail.nasa.gov.
norm_text=Here is mail dot nasa dot gov dot
expected output=Here is mail dot nasa dot gov.

Similar bug can be reached in German normalization. The following code is applied:

normalizer = Normalizer(
  input_case="cased",
  lang="de",
)
norm_text = normalizer.normalize(text, punct_post_process=True)

text=Here is brettspielversand.de.
norm_text=Here is b r e t t s p i e l v e r s a n d punkt de punkt
expected output=Here is brettspielversand punkt de.

Similar problem with text=KIM.com-Specials..
I got same problem with website in text on Spanish and Italian.

I also found a specific bug in Spanish normalization. The following code is applied:

normalizer = Normalizer(
  input_case="cased",
  lang="es",
)
norm_text = normalizer.normalize(text, punct_post_process=True)

text=El texto de Li Qin en este libro ahora está disponible en forma de libro electrónico.
norm_text=El texto de quincuagésimo primero Qin en este libro ahora está disponible en forma de libro electrónico.
Not sure what is expected output, but current norm_text looks not okay.

cherry-pick https://github.com/NVIDIA/NeMo/pull/5960

Just realized that the PR NVIDIA/NeMo#5960 should be cherry-picked to this repo because all TN-related codes have been removed before that PR.

Any rough ETA for releasing NeMo-text-processing?

c++ deploy based on sparrowhawk run even slower than python-dev ,is this normal， is there some way to run faster?

c++ deploy based on sparrowhawk run 6-8 times slower than python-dev ,is this normal?

logs added directly to the root logger making it harder to mute or control

I believe this line here adds the logs directly to the root logger. Shouldn't it be logger.info instead of logging.info?

I'm using nemo-text-processing within another process and want to change the log level just with nemo-text-processing

NeMo-text-processing/nemo_text_processing/text_normalization/data_loader_utils.py

Line 347 in 0ded21a

 logging.info(f"Skipping post-processing of {''.join(normalized_text)} for '{punct}'") 

zh text normalizer cannot handle "

In [12]: written = "你好\""

In [13]: normalizer.normalize(written, verbose=True)
tokens { name: "你好"" }
ERROR: StringFstToOutputLabels: Invalid start state
---------------------------------------------------------------------------
FstOpError                                Traceback (most recent call last)
Cell In[13], line 1
----> 1 normalizer.normalize(written, verbose=True)

File ~/LocalCodes/NeMo-text-processing/nemo_text_processing/text_normalization/normalize.py:354, in Normalizer.normalize(self, text, verbose, punct_pre_process, punct_post_process)
    352     if verbalizer_lattice is None:
    353         raise ValueError(f"No permutations were generated from tokens {s}")
--> 354     output += ' ' + Normalizer.select_verbalizer(verbalizer_lattice)
    355 output = SPACE_DUP.sub(' ', output[1:])
    357 if self.lang == "en" and hasattr(self, 'post_processor'):

File ~/LocalCodes/NeMo-text-processing/nemo_text_processing/text_normalization/normalize.py:642, in Normalizer.select_verbalizer(lattice)
    632 @staticmethod
    633 def select_verbalizer(lattice: 'pynini.FstLike') -> str:
    634     """
    635     Given verbalized lattice return shortest path
    636
   (...)
    640     Returns: shortest path
    641     """
--> 642     output = pynini.shortestpath(lattice, nshortest=1, unique=True).string()
    643     # lattice = output @ self.verbalizer.punct_graph
    644     # output = pynini.shortestpath(lattice, nshortest=1, unique=True).string()
    645     return output

File extensions/_pynini.pyx:462, in _pynini.Fst.string()

File extensions/_pynini.pyx:507, in _pynini.Fst.string()

FstOpError: Operation failed

Question: How does the logic for TimeFst for En work?

Hi.
Sorry if this is a basic question, but I am a beginner with pynini and confused with the logic behind the TimeFst in En inverse text normalization.

In the file nemo_text_processing/inverse_text_normalization/en/taggers/time.py,
the time components are tagged into minutes and hours as required
e.g. twelve thirty -> time { hours: "12" minutes: "30" }
e.g. twelve past one -> time { minutes: "12" hours: "1" }

In the file nemo_text_processing/inverse_text_normalization/en/verbalizers/time.py,
the same tagged string are verbalized and we remain only with the time
eg. time { hours: "12" minutes: "30" } -> 12:30

I am unable to understand how the second case from tagged sentence -> 'time { minutes: "12" hours: "1" }' is handled in the code.
Is it by reversing the terms while processing or is it done during the final processing in FinalVerbFst?

Would be glad if anyone could help. Thank you!

malloc error on initialization of inverse normalizer

In the initialization of inverse normalizer for English language, sometimes the code crashed with the following error at the initialization:
malloc(): unaligned tcache chunk detected
I traced the code and found this occasional error occured in the importing some modules from pynini package in nemo_text_processing/text_normalization/en/graph_utils.py lines 23 to 25:

I fixed this issue by updating pynini to version 2.1.6, however its version in requirements file is explicitly set to 2.1.5 and this version seems to be reason of such crash.

I think the pynini version in requirements/requirements.txt file should be updated to 2.1.6

OS: Ubuntu 22.04
HW: CPU
python 3.10

German TN: normalized numbers wrongly include spaces

Describe the bug

In German, numbers are currently normalized with spaces between each digit and unit, although these should normally be written without spaces. In TTS systems, this leads to unnatural pauses in the output.

Steps/Code to reproduce bug

Normalize "18940722"
Output is "achtzehn millionen neun hundert vierzig tausend sieben hundert zwei und zwanzig", see

NeMo-text-processing/tests/nemo_text_processing/de/data_text_normalization/test_cases_cardinal.txt

Line 48 in 9afbaf2

achtzehn millionen neun hundert vierzig tausend sieben hundert zwei und zwanzig~18940722

Expected behavior

Output should be "achtzehn millionen neunhundertvierzigtausendsiebenhundertzweiundzwanzig" (spaces are introduced for millions and above).

Environment details

NVIDIA NeMo Text Processing 0.1.7rc0

French TN: crash in non-deterministic case

French TN crashes in the non-deterministic case due to the line below, given that "data/measures/measurements.tsv" doesn't exist for FR.

NeMo-text-processing/nemo_text_processing/text_normalization/fr/taggers/whitelist.py

Line 57 in 084952b

 units_graph = _get_whitelist_graph(input_case, file=get_abs_path("data/measures/measurements.tsv")) 

Supported languages not listed in args

NeMo-text-processing/nemo_text_processing/text_normalization/normalize_with_audio.py

Line 424 in 0894c70

 "--language", help="Select target language", choices=["en", "ru", "de", "es", "sv"], default="en", type=str 

russian inverse text normalization is broke for numbers less than 10

Example:

from nemo_text_processing.inverse_text_normalization.inverse_normalize import InverseNormalizer
inv_norm = InverseNormalizer(lang='ru')

inv_norm.normalize('тридцать')
'30' 

inv_norm.normalize('три')
'три' # expected '3'

inv_norm.normalize('два')
'два' expected '2'

how to run Zh context-aware TN

I successfully run the EN context-aware TN in the documentation.

wfst_lm_rescoring.py

I create a file "ChineseTN.json"

{"text": "全国有211所211高校。", "gt_normalized": "全国有二百一十一所二幺幺高校。"}

Change the data and lang args in wfst_lm_rescoring.py , I got the following error.

FileNotFoundError: [Errno 2] No such file or directory: '/**/NeMo-text-processing/nemo_text_processing/text_normalization/zh/data/measure/measurements.tsv'

how to solve this problem?

unexpected normalized text for Chinese.

Describe the bug

detailed information please refer to NVIDIA/NeMo#7627

	classify = (
	pynutil.add_weight(whitelist_graph, 1.01)
	\| pynutil.add_weight(time_graph, 1.1)
	\| pynutil.add_weight(date_graph, 1.09)
	\| pynutil.add_weight(decimal_graph, 1.1)
	\| pynutil.add_weight(measure_graph, 1.1)
	\| pynutil.add_weight(cardinal_graph, 1.1)
	\| pynutil.add_weight(ordinal_graph, 1.1)
	\| pynutil.add_weight(money_graph, 1.1)
	\| pynutil.add_weight(telephone_graph, 1.1)
	\| pynutil.add_weight(electonic_graph, 1.1)
	\| pynutil.add_weight(fraction_graph, 1.1)
	\| pynutil.add_weight(range_graph, 1.1)
	\| pynutil.add_weight(serial_graph, 1.1001) # should be higher than the rest of the classes

nvidia / nemo-text-processing Goto Github PK

nemo-text-processing's Introduction

NeMo Text Processing

Introduction

Documentation

Tutorials

Getting help

Installation

Conda virtual environment

Pip

Pip from source

From source

Contributing

Citation

License

nemo-text-processing's People

Contributors

Stargazers

Watchers

Forkers

nemo-text-processing's Issues

Recommend Projects

Recommend Topics

Recommend Org