scrapinghub / number-parser Goto Github PK

View Code? Open in Web Editor NEW

104.0 9.0 22.0 441 KB

Parse numbers written in natural language

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

hacktoberfest

number-parser's People

Contributors

Stargazers

Watchers

number-parser's Issues

`ignore` parameter

I think it could be really cool to add an optional parameter to ignore some words.

Example:

>>> parse('twenty one')
'21'

>>> parse('twenty one', ignore=["one"])
'20 one'

>>> parse('I have three apples and one pear.')
'I have 3 apples and 1 pear.'

>>> parse('I have three apples and one pear.', ignore=["three"])
'I have three apples and 1 pear.'

Documentation and Understanding of the code

Does there exist any documentation of number-parser library? I was not able to find and hence thinking to write one.
Also how to understand the each step in the code, for now I was understanding every commit from the start.

Remove the egg-info folder from the repository

I believe the number_parser.egg-info folder should not be in the repository.

After you remove it, you should probably add *.egg-info/ to .gitignore. See protego’s .gitignore for reference.

Add version attribute

In [21]: number_parser.__version__
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[21], line 1
----> 1 number_parser.__version__

AttributeError: module 'number_parser' has no attribute '__version__'

In [22]: dir(number_parser)
Out[22]:
['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 'data',
 'parse',
 'parse_fraction',
 'parse_number',
 'parse_ordinal',
 'parser']

Add support for other numeral systems.

At this moment, the main number-parser goal is to return the number equivalences from different languages, but only when those words are representing the number using the "decimal numeral system" (https://en.wikipedia.org/wiki/Decimal).

However, there are some numeral systems that don't rely on the decimal numeral system and uses other structures. That's the case of the Roman Numeral System (https://en.wikipedia.org/wiki/Numeral_system) or the Chinese/Japanese/Korean/Vietnamese Numeral System (https://en.wikipedia.org/wiki/Chinese_numerals and https://en.wikipedia.org/wiki/Suzhou_numerals).

We could probably add support for them in a future version, as they will probably need another kind of parser.

For more on this, you can also check this: https://en.wikipedia.org/wiki/Numeral_system

README tests are failing

When running tox some errors are raised from the README.rst file:

This:

>>> output = parse_number("not_a_number")
>>> output
None

should be:

>>> output = parse_number("not_a_number")
>>> output

(removing the None)

This:

>>> parse("two thousand thousand")
2000 1000

should be:

>>> parse("two thousand thousand")
'2000 1000'

And there is a trailing space after 1432524 in >>> parse_number("चौदह लाख बत्तीस हज़ार पाँच सौ चौबीस", language='hi')

Feature to handle individual digits in parse_number

So one of the use cases I think could be parsing phone numbers or zip codes. These might be written in the form of two three zero two five eight etc with each of the digit spelled out. Using parse would return space separated string 2 3 0 2 5 8 while parse_number would give None. Neither gives the wanted output 230258 (number). (Of-course the user can do some additional processing on parse output which will work but having a feature it in the library itself might be better)

We can have a parameter in parse_number say relaxed which when set to true will build this number up as one large number.

handle case when language is not known

When adding a non existent locale the parse_number method returns a FileNotFoundError error:

>>> parse_number('one', lang='aaa')
...
FileNotFoundError: [Errno 2] No such file or directory: '../number_parser/translation_data_merged/aaa.json'

We should handle this case by raising a ValueError with a message about the error ("locale not supported").

Use of " And " with numeric word and symbols are different

For the given two examples, the results are different but they should be similar.

>>> parse("I will eat banana and apple in two and three minutes")
'I will eat banana and apple in 2 and 3 minutes'
>>> parse("I will eat banana and apple in two and 3 minutes")
'I will eat banana and apple in 2 3 minutes'

Errors in generated files (CLDR/supplementary)

The es-419.py file doesn't exist (while the JSONs do)
root is not a locale (as explained here: http://cldr.unicode.org/core-spec) so the JSON files for root should be deleted

Support ordinals

I open this ticket to track the ordinal's feature.

From my understanding, what we should achieve is:

>>> parse('first')
'1st'

>>> parse('second')
'2nd'

>>> parse('third')
'3rd'

>>> parse('twenty-third')
'23rd'

>>> parse('thirtieth')
'30th'

However, as we support other words in the sentence, we should probably take care of some ambiguous words. I would take special care to "second". I think it should be translated to "2nd" only when it's not preceded by:

1 (example: "1 second")
one (example: "one second")
a (example "a second").
another ordinal (examples: "first second" --> "1st second" or "fourth second" --> "4th second").

Of course, this logic would be probably necessary to be applied only to some languages, so it shouldn't be inside the main logic but in a language-specific section.

First day is not 1 day or second day is not 2 day

Hi,

We can't really use "First day as 1 day or second day as 2 day" in real life context. For this reason we can't use this wonderful library.

Please leave them as they are or think of an alternative. Temporarily, give us an exception list option.

Prabhat

Add support for multiple locales

This library should support multiple languages.

As a first approach, we could support English (default) and three more languages that could be Spanish, Russian, and Hindi, as they are broadly used and have different alphabets.

We could use a similar approach than the approach used in dateparser.

It works like this:

json files coming from CLDR
yaml files containing specific-language exceptions
py files merging both sources.

The script used in dateparser is this one: https://github.com/scrapinghub/dateparser/blob/master/dateparser_scripts/write_complete_data.py but it's not a good example, as there are a lot of things to be improved and some bad practices.

To allow this library to support it, we could just add a locale or similar argument to the parse() function (defaulting to English). I don't expect it to autodetect the language, at least in this first iteration.

@arnavkapoor feel free to implement it in the way you think is better. We can also achieve this in separated PRs, no need to do just one PR.

Adding extensive automated tests

I’ve been thinking about how we could extensively test our library against a language to be able to affirm that the library officially supports it.

My first ideas were:

We should explicitly test all numbers from 0 to 100, as they are really common.
We should test some different combinations of digits and different lengths of numbers.
We should test edge cases or special cases for each language.

Doing the first it’s easy, but doing the second is not as easy. We can’t test all numbers, so we need to select a set of different numbers to check. After some time doing different, crazy, things, I got this list:

1234
23451
345612
4567123
56781234
678912345
7890123456
89091234567
909812345678
987123456789
98761234567890
876512345678909
7654123456789098
65431234567890987
543212345678909876
4321123456789098765
32101234567890987654
210012345678909876543
1000123456789098765432
1234567890987654321
12345678909876543210
123456789098765432100
1234567890987654321000

It’s created by appending the next digit to the end and then moving the first digit to the end (and then doing the same but with the digit before).

It’s not perfect, but I think that it’s diverse enough to be able to say that if the parser works for these cases, it will probably work for the most common existing combinations.

What do you think guys? @arnavkapoor @lopuhin @kishan3

I built some spiders to scrape different websites to get these numbers in words and added here the datasets: https://github.com/noviluni/numbers-data

Sources:

Unfortunately, they don't support Hindi, but I can search a website supporting Hindi and create a new spider if it's necessary.

If you like this idea, we can use these CSVs directly in the tests (don't worry about this, @arnavkapoor, I could show you how I would do it).

In case you have another idea of input numbers for the tests, I can generate for you a dataset for a lot of locales with the numbers you want.

Let me know what you think or if you have any other idea/approach. 🙂

Originally posted by @noviluni in #15 (comment)

normalization of the language data

Some languages have accents, etc. and they are commonly missing, so when trying to do something like this:

>>> parse_number('veintiun', lang='es')

it doesn't work, but doing this:

>>> parse_number('veintiún', lang='es')

it works. We should normalize the language data to avoid this kind of issue.

Support for simple fractions

We can add support for the simple fractions as well like three-fourth, half, etc.
Example:

parse("I have a three-fourth cup of glass.")
I have a 3/4 cup of glass.

parse_number("half")
1/2

Mixing between digits and character number

Thank you for your package: but there is a case which wasn't handled, the case of mixing the numbers and characters.

I tried this:
1- print(parse_number("2.4 million"))
It gives None

2- parse_ordinal("2.4 million")
It gave None

3- parse("2.4 million")
It gave me : 2.4 1000000

My expectation is to give me 2400000 or 2.4*10**6

dangling "and" is swallowed

when I enter "hundredandfive thirtyone and some other text" I get ['105', '31', 'some', 'other', 'text']
It seems, as if the dangling "and" is swallowed

`parse()` - whitespace around sentence separators consumed after numbers

Small bug / issue. In normal operation the parse() function respects white space around sentence separators ...

>>> parse("foo , bar , baz .") 
'foo , bar , baz .'

This is not the case when the white space follows a number ...

>>> parse("one , two , three .")
'1, 2, 3.'

Ideally the behaviour would be consistent and respect the original whitespace around any separators / words.

I have implemented a fix and pull request for this (issue #77).

[Query] number parser

Number parser example

can we make that 1 as 1st .
Something like this 🤔

Integrate number-parser into dateparser and price-parser

One of the major use after building the number-parser might be to integrate it with price-parser and dateparser as suggested by @noviluni .
There are many features that are similar in these libraries, consider:

#4 in number-parser and #1 in price-parser. In both of these, the basic idea is to return the string of numbers mixed with words to return it as a number.
Example:

parse("1 million")                                                              
>>> '1000000'

parse("2k in USD")                                                              
>>> '2000 in USD'

This feature seems to be more favourable for number-parser and then integrate it with price-parser.

There are many similar features that are related to each other in many ways, hence integrating might be a good option, but I have a few questions related to it.

Does this integrating of various features means (1) taking the idea used in one library and implementing it in another or (2) directly using both libraries parallelly?
Example:

Instead of 
>>> parse_price('115 millions de dollars (estimation)') # parse_price of price parser
Price(amount=Decimal('115000000'), currency='dollars')

using both libraries in application or implementing this in backend
>>> price1 = parse('115 millions de dollars (estimation)') # parse of number parser 
>>> parse_price(price1) # parse_price of price parser
Price(amount=Decimal('115000000'), currency='dollars')

What are other features that can be added (especially integrating dateparser and number-parser)?

I was thinking to take this as a part of the GSoC 2021 also, would like to hear your views.

SKIP_TOKENS at the end of a number

Hi @arnavkapoor!

I was trying the package, and I found a side effect when trying to parse sentences in Spanish.

>>> parse('Avisté tres y luego nos fuimos.', language='es')
'Avisté 3 luego nos fuimos.'

this would be approx. equivalent to:

>>> parse('I sighted three and then we left')
'I sighted 3 then we left'

As you can see, the "y" (or "and") disappears, but it should remain. We should probably avoid taking the "SKIP_TOKENS" when they are at the end of the number. What do you think?

Doesn't recognize a million in Ukrainian

In [11]: parse_number("мільйон") is None
Out[11]: True

In [12]: parse_number("міліон") is None
Out[12]: True

http://sum.in.ua/s/miljjon

These are two versions of the writing of the word million in the Ukrainian language.

Is it a problem that Ukrainian is not in supported languages? https://github.com/scrapinghub/number-parser/blob/master/number_parser/parser.py#L5
But I see Ukrainian here https://github.com/scrapinghub/number-parser/blob/master/number_parser/data/uk.py#L53

Thank you.

After updating to 0.3.1, can no longer import number_parser

Hello!

After updating to 0.3.1, I can no longer import the library. This happens because the VERSION file is not included in the built distribution, and the call to pkgutil.get_data() raises an error in that case. I'm not sure why, though, as it's in your MANIFEST.in.

In [1]: import number_parser
------------------------------------------------------------------------
FileNotFoundError                      Traceback (most recent call last)
Cell In[1], line 1
----> 1 import number_parser

File /opt/homebrew/anaconda3/envs/venv/lib/python3.10/site-packages/number_parser/__init__.py:6
      1 import pkgutil
      3 from number_parser.parser import (parse, parse_fraction, parse_number,
      4                                   parse_ordinal)
----> 6 __version__ = (pkgutil.get_data(__package__, "VERSION") or b"").decode("ascii").strip()
      8 del pkgutil

File /opt/homebrew/anaconda3/envs/venv/lib/python3.10/pkgutil.py:639, in get_data(package, resource)
    637 parts.insert(0, os.path.dirname(mod.__file__))
    638 resource_name = os.path.join(*parts)
--> 639 return loader.get_data(resource_name)

File <frozen importlib._bootstrap_external>:1073, in get_data(self, path)

FileNotFoundError: [Errno 2] No such file or directory: '/opt/homebrew/anaconda3/envs/venv/lib/python3.10/site-packages/number_parser/VERSION'

I'm on Python 3.10.9, number-parser-0.3.1, MacOS 13.2.1.

Following pep8

We should follow PEP 8.

To do it, I would proceed by adding flake8 and running it. We can add a pipeline later to automatically run it when creating a new PR.

For the line-length, I suggest to set it at 119 characters (as Django or Scrapy do).

Idiosyncracies in each language.

I looked at German/French and both have some idiosyncrasies which needs to be handled. For French, using quatre-vingt for 80 and then allowing numbers from 1 to 19 as the suffix , eg) quatre-vingt-dix-neuf for 99, (80+19), would need to be handled. With German, it's a more fundamental issue, as they tend to build numbers from left to right. achtunddreißig (28) which is like 8 and twenty. Can refer to this for more details about this building method and other languages also use it. This might be fixed by reversing the list of tokens. (But need to look more into it). Also for larger numbers (greater than one thousand I believe) it does revert back to left to right.

While this does only mention two languages there would definitely be such cases and exceptions in other languages too.

Add language-specific tests

We should add tests for the "officially" supported languages, that are (at this moment): Hindi, Spanish, Russian (English have already tests).

Update package on PyPI for parsing a fraction

Please update package with latest features.
Parsing a fraction is not available in current number-parser library on PyPI.

from number_parser import parse_fraction

The links at the end of README.rst file are redirecting

The following links:
Source code: https://github.com/arnavkapoor/number-parser
Issue tracker: https://github.com/arnavkapoor/number-parser/issues
are redirecting to:
https://github.com/scrapinghub/number-parser
https://github.com/scrapinghub/number-parser/issues
I was thinking to change it to the direct these links.

Adding support for decimal and negative numbers

Opening a new ticket for adding support for decimal and negative numbers. More details for the same discussed here.

Expected outcomes

> parse("he scored fifty three point six percent in the finals")
he scored 53.6 percent in the finals

> parse_number("twelve point seven")
12.7

> parse_number("minus 9.4")
-9.4

Edge cases that are not working

Some edge cases that should be fixed:

"A hundred" should return "100" (similar than "one hundred").
"twentyone" should return "21" (reference: https://en.wiktionary.org/wiki/twentyone)

On the other side, when having "5 hundred" I think it should return "500". What do you think?

Example:

parse("one million")                                                              
>>> '1000000'

parse("1 million")                                                              
>>> '1 1000000'

Please, add a test with each of these strings when fixing it :)

On the other hand, I think there are some currently working cases that should be added to the tests:

word that includes a number but it's not a number and shouldn't be translated. Examples: "weight" (includes "eight") or "network" (includes "two").
"twelve hundred" notation (returns "1200" which is correct).

Let me know if you have any doubt. 🙂

Add a `parse_number()` function

I think it could be really useful to have a function called parse_number() that expects a written number and returns the equivalent number (Python number):

>>> parse_number("one")
1

>>> parse_number("twenty six")
26

It's easy to imagine use cases, and it will be useful to refactor the current code.

With the currently implemented logic, it shouldn't be difficult to create it by using the number_builder.

>>> number_builder(["twenty", "six"])                                                                                                                                                 
['26']

>>> number_builder(["one"])                                                                                                                                                           
['1']

We should decide what to do if we pass something else. I think we could return just None.

>>> parse_number("one second")
None

>>> parse_number("twenty six cars")
None

Of course, this will be multi-locale and it would be useful to accept a locales or locale argument:

>>> parse_number("uno", locales=["en"])
None

>>> parse_number("uno", locales=["es"])
1

Update README.rst

We should update README with more details like installing, the example uses, etc.

Replace OrderedDict to dict

Makes sense wherever it does not affect external APIs.
e.g. if a function uses OrderedDict internally, then yes; if a function returns OrderedDict, we probably should follow a deprecation approach, e.g. deprecate that function (but keep it for a time) and provide a new one that returns a dict.

Adding support for languages with discernible delimiters

Languages without delimiters - Japanese and Chinese (Simplified, Traditional) and possibly other east Asian languages don't have any delimiter. eg) 九千九百九十九 (9999 in Japanese). These actually have a very similar structure compared to English but the lack of a delimiter makes it tougher.
Also, there isn't a delimiter as such (upto a certain number) for German and Dutch .

One approach in mind for the delimiter thing is reading words character by character and as soon as we have a match in any of the words we insert a space and after this pre-processing step, we can follow the same logic. This does increase the complexity O(string_length ^ 2) which shouldn't be a major issue I believe. (We can use this function only for certain languages without delimiters).

Concrete example

five thousand nine hundred and thirteen - English (5913) 
fünftausendneunhundertdreizehn - German (5913)

nine hundred and thirteen - English (913)
negenhonderddertien - Dutch (913)

To handle this we first check f , fü, fün and finally hit fünf = 5 and similary get negen = 9 and insert a space and then start again from the next character.

Russian ordinal number examples

@arnavkapoor some examples of ordinal numbers in russian, without any fancy stuff and in nominative case (the "main" one):

0,ноль
1,один
2,два
3,три
4,четыре
6,шесть
30,тридцать
40,сорок
90,девяносто
100,сто
200,двести
596,пятьсот девяносто шесть
25,двадцать пять
15,пятнадцать
44,сорок четыре
357,триста пятьдесят семь
891,восемьсот девяносто один
2000,две тысячи
2020,две тысячи двадцать
7004,семь тысяч четыре
21000000,двадцать один миллион
21000073,двадцать один миллион семьдесят три

Other cases:

1,одного
1,одному
1,одним
1,одном
2,двух
2,двум
2,двумя
3,трех
3,трем
3,тремя
4,четырех
4,четырем
4,четырьмя
6,шести
6,шестью
30,тридцати
30,тридцать
30,тридцатью
40,сорока
90,девяноста
100,ста
200,двухсот
200,двумстам
200,двумястами
200,двухстах
596,пятисот девяноста шести
596,пятистам девяноста шести
596,пятьсот девяносто шесть
596,пятьюстами девяноста шестью
596,пятистах девяноста шести
500,пятьюстами
1000,тысяче
256,двести пятьдесят шесть
256,двухсот пятидесяти шести
256,двумястами пятьюдесятью шестью
21000000,двадцать одним миллионом

I tried to not make any typos (mostly copying examples from various grammatical guides) but if something does not work please tell and I'll double-check.

Updating variable names

The current variables/structure for the data files are as follows :-

info = {
    "UNIT_NUMBERS": { },
    "BASE_NUMBERS": { },
    "MTENS": { },
    "MHUNDREDS": { },
    "MULTIPLIERS": { },
    "VALID_TOKENS": { }
}

More information on what each variable stores is here.
However the variable names aren't as descriptive and based on the discussion here, I plan to change them to the following.

info = {
    "UNIT_NUMBERS": { },
    "DIRECT_NUMBERS": { },
    "TENS": { },
    "HUNDREDS": { },
    "BIG_POWERS_OF_TEN": { },
    "SKIP_TOKENS": [ ]
}

The structural change is that skip tokens is no longer a dictionary but directly a list. The rest are variable name changes.
Opinions ?

Unicode error in pytest for Hindi, Spansish and Russian in Windows.

To recreate this run pytest in Windows. This error is taken from test_language_hi.py

tests\test_language_hi.py:42 (test_parse_number_till_hundred)
def test_parse_number_till_hundred():
>       _test_files(HUNDREDS_DIRECTORY, LANG)

test_language_hi.py:44: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
__init__.py:25: in _test_files
    for row in csv_reader:
~\Python\Python39\lib\csv.py:110: in __next__
    self.fieldnames
~\Python\Python39\lib\csv.py:97: in fieldnames
    self._fieldnames = next(self.reader)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <encodings.cp1252.IncrementalDecoder object at 0x000001743C3E54F0>
input = b'number,text\n0,\xe0\xa4\xb6\xe0\xa5\x82\xe0\xa4\xa8\xe0\xa5\x8d\xe0\xa4\xaf\n1,\xe0\xa4\x8f\xe0\xa4
\x95\n2,\xe0\xa4\...x8d\xe0\xa4\xaf\xe0\xa4\xbe\xe0\xa4\xa8\xe0\xa4\xac\xe0\xa5\x87\n100,\xe0\xa4\x8f\xe0\xa4
\x95 \xe0\xa4\xb8\xe0\xa5\x8c'
final = False

    def decode(self, input, final=False):
>       return codecs.charmap_decode(input,self.errors,decoding_table)[0]
E       UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 25: character maps to <undefined>

~\Python\Python39\lib\encodings\cp1252.py:23: UnicodeDecodeError
FAILED               [100%]
tests\test_language_hi.py:46 (test_parse_number_permutations)
def test_parse_number_permutations():
>       _test_files(PERMUTATION_DIRECTORY, LANG)

test_language_hi.py:48: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
__init__.py:25: in _test_files
    for row in csv_reader:
~\Python\Python39\lib\csv.py:110: in __next__
    self.fieldnames
~\Python\Python39\lib\csv.py:97: in fieldnames
    self._fieldnames = next(self.reader)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <encodings.cp1252.IncrementalDecoder object at 0x000001743C477700>
input = b'number,text\r\n1234,\xe0\xa4\x8f\xe0\xa4\x95 \xe0\xa4\xb9\xe0\xa4\x9c\xe0\xa4\xbe\xe0\xa4
\xb0 \xe0\xa4\xa6\xe0\xa5\x...0\xa4\xb8\xe0\xa5\x8c \xe0\xa4\xaa\xe0\xa5\x88\xe0\xa4\x82\xe0\xa4\xa4
\xe0\xa4\xbe\xe0\xa4\xb2\xe0\xa5\x80\xe0\xa4\xb8'
final = False

    def decode(self, input, final=False):
>       return codecs.charmap_decode(input,self.errors,decoding_table)[0]
E       UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 20: character maps to <undefined>

~\Python\Python39\lib\encodings\cp1252.py:23: UnicodeDecodeError

This only happens in Windows not linux. This can be solved by adding encoding='utf8' to the open() function on line 23 in __init__.py

Parsing for numbers with an apostrophe in Ukrainian does not work

number-parser-0.3.0 version.

In [17]: parse("п'ять")
Out[17]: "п'ять"

In [18]: parse("п’ять")
Out[18]: 'п’ять'

scrapinghub / number-parser Goto Github PK

number-parser's People

Contributors

Stargazers

Watchers

Forkers

number-parser's Issues

Recommend Projects

Recommend Topics

Recommend Org