scrapinghub / number-parser Goto Github PK
View Code? Open in Web Editor NEWParse numbers written in natural language
License: BSD 3-Clause "New" or "Revised" License
Parse numbers written in natural language
License: BSD 3-Clause "New" or "Revised" License
I think it could be really cool to add an optional parameter to ignore some words.
Example:
>>> parse('twenty one')
'21'
>>> parse('twenty one', ignore=["one"])
'20 one'
or
>>> parse('I have three apples and one pear.')
'I have 3 apples and 1 pear.'
>>> parse('I have three apples and one pear.', ignore=["three"])
'I have three apples and 1 pear.'
Does there exist any documentation of number-parser library? I was not able to find and hence thinking to write one.
Also how to understand the each step in the code, for now I was understanding every commit from the start.
I believe the number_parser.egg-info
folder should not be in the repository.
After you remove it, you should probably add *.egg-info/
to .gitignore
. See protego’s .gitignore for reference.
In [21]: number_parser.__version__
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[21], line 1
----> 1 number_parser.__version__
AttributeError: module 'number_parser' has no attribute '__version__'
In [22]: dir(number_parser)
Out[22]:
['__builtins__',
'__cached__',
'__doc__',
'__file__',
'__loader__',
'__name__',
'__package__',
'__path__',
'__spec__',
'data',
'parse',
'parse_fraction',
'parse_number',
'parse_ordinal',
'parser']
At this moment, the main number-parser
goal is to return the number equivalences from different languages, but only when those words are representing the number using the "decimal numeral system" (https://en.wikipedia.org/wiki/Decimal).
However, there are some numeral systems that don't rely on the decimal numeral system and uses other structures. That's the case of the Roman Numeral System (https://en.wikipedia.org/wiki/Numeral_system) or the Chinese/Japanese/Korean/Vietnamese Numeral System (https://en.wikipedia.org/wiki/Chinese_numerals and https://en.wikipedia.org/wiki/Suzhou_numerals).
We could probably add support for them in a future version, as they will probably need another kind of parser.
For more on this, you can also check this: https://en.wikipedia.org/wiki/Numeral_system
When running tox some errors are raised from the README.rst file:
This:
>>> output = parse_number("not_a_number")
>>> output
None
should be:
>>> output = parse_number("not_a_number")
>>> output
(removing the None
)
This:
>>> parse("two thousand thousand")
2000 1000
should be:
>>> parse("two thousand thousand")
'2000 1000'
And there is a trailing space after 1432524
in >>> parse_number("चौदह लाख बत्तीस हज़ार पाँच सौ चौबीस", language='hi')
So one of the use cases I think could be parsing phone numbers or zip codes. These might be written in the form of two three zero two five eight
etc with each of the digit spelled out. Using parse
would return space separated string 2 3 0 2 5 8
while parse_number
would give None
. Neither gives the wanted output 230258 (number)
. (Of-course the user can do some additional processing on parse
output which will work but having a feature it in the library itself might be better)
We can have a parameter in parse_number
say relaxed
which when set to true will build this number up as one large number.
When adding a non existent locale the parse_number
method returns a FileNotFoundError
error:
>>> parse_number('one', lang='aaa')
...
FileNotFoundError: [Errno 2] No such file or directory: '../number_parser/translation_data_merged/aaa.json'
We should handle this case by raising a ValueError
with a message about the error ("locale not supported").
For the given two examples, the results are different but they should be similar.
>>> parse("I will eat banana and apple in two and three minutes")
'I will eat banana and apple in 2 and 3 minutes'
>>> parse("I will eat banana and apple in two and 3 minutes")
'I will eat banana and apple in 2 3 minutes'
The es-419.py
file doesn't exist (while the JSONs do)
root
is not a locale (as explained here: http://cldr.unicode.org/core-spec) so the JSON files for root
should be deleted
I open this ticket to track the ordinal's feature.
From my understanding, what we should achieve is:
>>> parse('first')
'1st'
>>> parse('second')
'2nd'
>>> parse('third')
'3rd'
>>> parse('twenty-third')
'23rd'
>>> parse('thirtieth')
'30th'
However, as we support other words in the sentence, we should probably take care of some ambiguous words. I would take special care to "second". I think it should be translated to "2nd" only when it's not preceded by:
1
(example: "1 second"
)one
(example: "one second"
)a
(example "a second"
)."first second"
--> "1st second"
or "fourth second"
--> "4th second"
).Of course, this logic would be probably necessary to be applied only to some languages, so it shouldn't be inside the main logic but in a language-specific section.
Hi,
We can't really use "First day as 1 day or second day as 2 day" in real life context. For this reason we can't use this wonderful library.
Please leave them as they are or think of an alternative. Temporarily, give us an exception list option.
Prabhat
This library should support multiple languages.
As a first approach, we could support English
(default) and three more languages that could be Spanish
, Russian
, and Hindi
, as they are broadly used and have different alphabets.
We could use a similar approach than the approach used in dateparser
.
It works like this:
json
files coming from CLDRyaml
files containing specific-language exceptionspy
files merging both sources.The script used in dateparser
is this one: https://github.com/scrapinghub/dateparser/blob/master/dateparser_scripts/write_complete_data.py but it's not a good example, as there are a lot of things to be improved and some bad practices.
To allow this library to support it, we could just add a locale
or similar argument to the parse()
function (defaulting to English). I don't expect it to autodetect the language, at least in this first iteration.
@arnavkapoor feel free to implement it in the way you think is better. We can also achieve this in separated PRs, no need to do just one PR.
I’ve been thinking about how we could extensively test our library against a language to be able to affirm that the library officially supports it.
My first ideas were:
Doing the first it’s easy, but doing the second is not as easy. We can’t test all numbers, so we need to select a set of different numbers to check. After some time doing different, crazy, things, I got this list:
1234
23451
345612
4567123
56781234
678912345
7890123456
89091234567
909812345678
987123456789
98761234567890
876512345678909
7654123456789098
65431234567890987
543212345678909876
4321123456789098765
32101234567890987654
210012345678909876543
1000123456789098765432
1234567890987654321
12345678909876543210
123456789098765432100
1234567890987654321000
It’s created by appending the next digit to the end and then moving the first digit to the end (and then doing the same but with the digit before).
It’s not perfect, but I think that it’s diverse enough to be able to say that if the parser works for these cases, it will probably work for the most common existing combinations.
What do you think guys? @arnavkapoor @lopuhin @kishan3
I built some spiders to scrape different websites to get these numbers in words and added here the datasets: https://github.com/noviluni/numbers-data
Sources:
Unfortunately, they don't support Hindi, but I can search a website supporting Hindi and create a new spider if it's necessary.
If you like this idea, we can use these CSVs directly in the tests (don't worry about this, @arnavkapoor, I could show you how I would do it).
In case you have another idea of input numbers for the tests, I can generate for you a dataset for a lot of locales with the numbers you want.
Let me know what you think or if you have any other idea/approach. 🙂
Originally posted by @noviluni in #15 (comment)
Some languages have accents, etc. and they are commonly missing, so when trying to do something like this:
>>> parse_number('veintiun', lang='es')
it doesn't work, but doing this:
>>> parse_number('veintiún', lang='es')
it works. We should normalize the language data to avoid this kind of issue.
We can add support for the simple fractions as well like three-fourth, half, etc.
Example:
parse("I have a three-fourth cup of glass.")
I have a 3/4 cup of glass.
parse_number("half")
1/2
Thank you for your package: but there is a case which wasn't handled, the case of mixing the numbers and characters.
I tried this:
1- print(parse_number("2.4 million"))
It gives None
2- parse_ordinal("2.4 million")
It gave None
3- parse("2.4 million")
It gave me : 2.4 1000000
My expectation is to give me 2400000 or 2.4*10**6
when I enter "hundredandfive thirtyone and some other text" I get ['105', '31', 'some', 'other', 'text']
It seems, as if the dangling "and" is swallowed
Small bug / issue. In normal operation the parse()
function respects white space around sentence separators ...
>>> parse("foo , bar , baz .")
'foo , bar , baz .'
This is not the case when the white space follows a number ...
>>> parse("one , two , three .")
'1, 2, 3.'
Ideally the behaviour would be consistent and respect the original whitespace around any separators / words.
I have implemented a fix and pull request for this (issue #77).
One of the major use after building the number-parser
might be to integrate it with price-parser
and dateparser
as suggested by @noviluni .
There are many features that are similar in these libraries, consider:
#4 in number-parser
and #1 in price-parser
. In both of these, the basic idea is to return the string of numbers mixed with words to return it as a number.
Example:
parse("1 million")
>>> '1000000'
parse("2k in USD")
>>> '2000 in USD'
This feature seems to be more favourable for number-parser
and then integrate it with price-parser
.
There are many similar features that are related to each other in many ways, hence integrating might be a good option, but I have a few questions related to it.
Instead of
>>> parse_price('115 millions de dollars (estimation)') # parse_price of price parser
Price(amount=Decimal('115000000'), currency='dollars')
using both libraries in application or implementing this in backend
>>> price1 = parse('115 millions de dollars (estimation)') # parse of number parser
>>> parse_price(price1) # parse_price of price parser
Price(amount=Decimal('115000000'), currency='dollars')
dateparser
and number-parser
)?I was thinking to take this as a part of the GSoC 2021 also, would like to hear your views.
Hi @arnavkapoor!
I was trying the package, and I found a side effect when trying to parse sentences in Spanish.
>>> parse('Avisté tres y luego nos fuimos.', language='es')
'Avisté 3 luego nos fuimos.'
this would be approx. equivalent to:
>>> parse('I sighted three and then we left')
'I sighted 3 then we left'
As you can see, the "y" (or "and") disappears, but it should remain. We should probably avoid taking the "SKIP_TOKENS" when they are at the end of the number. What do you think?
In [11]: parse_number("мільйон") is None
Out[11]: True
In [12]: parse_number("міліон") is None
Out[12]: True
These are two versions of the writing of the word million in the Ukrainian language.
Is it a problem that Ukrainian is not in supported languages? https://github.com/scrapinghub/number-parser/blob/master/number_parser/parser.py#L5
But I see Ukrainian here https://github.com/scrapinghub/number-parser/blob/master/number_parser/data/uk.py#L53
Thank you.
Hello!
After updating to 0.3.1, I can no longer import the library. This happens because the VERSION file is not included in the built distribution, and the call to pkgutil.get_data()
raises an error in that case. I'm not sure why, though, as it's in your MANIFEST.in.
In [1]: import number_parser
------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 import number_parser
File /opt/homebrew/anaconda3/envs/venv/lib/python3.10/site-packages/number_parser/__init__.py:6
1 import pkgutil
3 from number_parser.parser import (parse, parse_fraction, parse_number,
4 parse_ordinal)
----> 6 __version__ = (pkgutil.get_data(__package__, "VERSION") or b"").decode("ascii").strip()
8 del pkgutil
File /opt/homebrew/anaconda3/envs/venv/lib/python3.10/pkgutil.py:639, in get_data(package, resource)
637 parts.insert(0, os.path.dirname(mod.__file__))
638 resource_name = os.path.join(*parts)
--> 639 return loader.get_data(resource_name)
File <frozen importlib._bootstrap_external>:1073, in get_data(self, path)
FileNotFoundError: [Errno 2] No such file or directory: '/opt/homebrew/anaconda3/envs/venv/lib/python3.10/site-packages/number_parser/VERSION'
I'm on Python 3.10.9, number-parser-0.3.1, MacOS 13.2.1.
I looked at German/French and both have some idiosyncrasies which needs to be handled. For French, using quatre-vingt
for 80 and then allowing numbers from 1 to 19 as the suffix , eg) quatre-vingt-dix-neuf for 99, (80+19)
, would need to be handled. With German, it's a more fundamental issue, as they tend to build numbers from left to right. achtunddreißig (28) which is like 8 and twenty. Can refer to this for more details about this building method and other languages also use it. This might be fixed by reversing the list of tokens. (But need to look more into it). Also for larger numbers (greater than one thousand I believe) it does revert back to left to right.
While this does only mention two languages there would definitely be such cases and exceptions in other languages too.
We should add tests for the "officially" supported languages, that are (at this moment): Hindi, Spanish, Russian (English have already tests).
Please update package with latest features.
Parsing a fraction is not available in current number-parser library on PyPI.
from number_parser import parse_fraction
The following links:
Source code: https://github.com/arnavkapoor/number-parser
Issue tracker: https://github.com/arnavkapoor/number-parser/issues
are redirecting to:
https://github.com/scrapinghub/number-parser
https://github.com/scrapinghub/number-parser/issues
I was thinking to change it to the direct these links.
Opening a new ticket for adding support for decimal and negative numbers. More details for the same discussed here.
Expected outcomes
> parse("he scored fifty three point six percent in the finals")
he scored 53.6 percent in the finals
> parse_number("twelve point seven")
12.7
> parse_number("minus 9.4")
-9.4
Some edge cases that should be fixed:
On the other side, when having "5 hundred" I think it should return "500". What do you think?
Example:
parse("one million")
>>> '1000000'
parse("1 million")
>>> '1 1000000'
Please, add a test with each of these strings when fixing it :)
On the other hand, I think there are some currently working cases that should be added to the tests:
Let me know if you have any doubt. 🙂
I think it could be really useful to have a function called parse_number()
that expects a written number and returns the equivalent number (Python number):
>>> parse_number("one")
1
>>> parse_number("twenty six")
26
It's easy to imagine use cases, and it will be useful to refactor the current code.
With the currently implemented logic, it shouldn't be difficult to create it by using the number_builder
.
>>> number_builder(["twenty", "six"])
['26']
>>> number_builder(["one"])
['1']
We should decide what to do if we pass something else. I think we could return just None
.
>>> parse_number("one second")
None
>>> parse_number("twenty six cars")
None
Of course, this will be multi-locale and it would be useful to accept a locales
or locale
argument:
>>> parse_number("uno", locales=["en"])
None
>>> parse_number("uno", locales=["es"])
1
We should update README with more details like installing, the example uses, etc.
Makes sense wherever it does not affect external APIs.
e.g. if a function uses OrderedDict internally, then yes; if a function returns OrderedDict, we probably should follow a deprecation approach, e.g. deprecate that function (but keep it for a time) and provide a new one that returns a dict.
Languages without delimiters - Japanese and Chinese (Simplified, Traditional) and possibly other east Asian languages don't have any delimiter. eg) 九千九百九十九 (9999 in Japanese). These actually have a very similar structure compared to English but the lack of a delimiter makes it tougher.
Also, there isn't a delimiter as such (upto a certain number) for German and Dutch .
One approach in mind for the delimiter thing is reading words character by character and as soon as we have a match in any of the words we insert a space and after this pre-processing step, we can follow the same logic. This does increase the complexity O(string_length ^ 2) which shouldn't be a major issue I believe. (We can use this function only for certain languages without delimiters).
Concrete example
five thousand nine hundred and thirteen - English (5913)
fünftausendneunhundertdreizehn - German (5913)
nine hundred and thirteen - English (913)
negenhonderddertien - Dutch (913)
To handle this we first check f , fü, fün and finally hit fünf = 5
and similary get negen = 9
and insert a space and then start again from the next character.
@arnavkapoor some examples of ordinal numbers in russian, without any fancy stuff and in nominative case (the "main" one):
0,ноль
1,один
2,два
3,три
4,четыре
6,шесть
30,тридцать
40,сорок
90,девяносто
100,сто
200,двести
596,пятьсот девяносто шесть
25,двадцать пять
15,пятнадцать
44,сорок четыре
357,триста пятьдесят семь
891,восемьсот девяносто один
2000,две тысячи
2020,две тысячи двадцать
7004,семь тысяч четыре
21000000,двадцать один миллион
21000073,двадцать один миллион семьдесят три
Other cases:
1,одного
1,одному
1,одним
1,одном
2,двух
2,двум
2,двумя
3,трех
3,трем
3,тремя
4,четырех
4,четырем
4,четырьмя
6,шести
6,шестью
30,тридцати
30,тридцать
30,тридцатью
40,сорока
90,девяноста
100,ста
200,двухсот
200,двумстам
200,двумястами
200,двухстах
596,пятисот девяноста шести
596,пятистам девяноста шести
596,пятьсот девяносто шесть
596,пятьюстами девяноста шестью
596,пятистах девяноста шести
500,пятьюстами
1000,тысяче
256,двести пятьдесят шесть
256,двухсот пятидесяти шести
256,двумястами пятьюдесятью шестью
21000000,двадцать одним миллионом
I tried to not make any typos (mostly copying examples from various grammatical guides) but if something does not work please tell and I'll double-check.
The current variables/structure for the data files are as follows :-
info = {
"UNIT_NUMBERS": { },
"BASE_NUMBERS": { },
"MTENS": { },
"MHUNDREDS": { },
"MULTIPLIERS": { },
"VALID_TOKENS": { }
}
More information on what each variable stores is here.
However the variable names aren't as descriptive and based on the discussion here, I plan to change them to the following.
info = {
"UNIT_NUMBERS": { },
"DIRECT_NUMBERS": { },
"TENS": { },
"HUNDREDS": { },
"BIG_POWERS_OF_TEN": { },
"SKIP_TOKENS": [ ]
}
The structural change is that skip tokens is no longer a dictionary but directly a list. The rest are variable name changes.
Opinions ?
To recreate this run pytest in Windows. This error is taken from test_language_hi.py
tests\test_language_hi.py:42 (test_parse_number_till_hundred)
def test_parse_number_till_hundred():
> _test_files(HUNDREDS_DIRECTORY, LANG)
test_language_hi.py:44:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
__init__.py:25: in _test_files
for row in csv_reader:
~\Python\Python39\lib\csv.py:110: in __next__
self.fieldnames
~\Python\Python39\lib\csv.py:97: in fieldnames
self._fieldnames = next(self.reader)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <encodings.cp1252.IncrementalDecoder object at 0x000001743C3E54F0>
input = b'number,text\n0,\xe0\xa4\xb6\xe0\xa5\x82\xe0\xa4\xa8\xe0\xa5\x8d\xe0\xa4\xaf\n1,\xe0\xa4\x8f\xe0\xa4
\x95\n2,\xe0\xa4\...x8d\xe0\xa4\xaf\xe0\xa4\xbe\xe0\xa4\xa8\xe0\xa4\xac\xe0\xa5\x87\n100,\xe0\xa4\x8f\xe0\xa4
\x95 \xe0\xa4\xb8\xe0\xa5\x8c'
final = False
def decode(self, input, final=False):
> return codecs.charmap_decode(input,self.errors,decoding_table)[0]
E UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 25: character maps to <undefined>
~\Python\Python39\lib\encodings\cp1252.py:23: UnicodeDecodeError
FAILED [100%]
tests\test_language_hi.py:46 (test_parse_number_permutations)
def test_parse_number_permutations():
> _test_files(PERMUTATION_DIRECTORY, LANG)
test_language_hi.py:48:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
__init__.py:25: in _test_files
for row in csv_reader:
~\Python\Python39\lib\csv.py:110: in __next__
self.fieldnames
~\Python\Python39\lib\csv.py:97: in fieldnames
self._fieldnames = next(self.reader)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <encodings.cp1252.IncrementalDecoder object at 0x000001743C477700>
input = b'number,text\r\n1234,\xe0\xa4\x8f\xe0\xa4\x95 \xe0\xa4\xb9\xe0\xa4\x9c\xe0\xa4\xbe\xe0\xa4
\xb0 \xe0\xa4\xa6\xe0\xa5\x...0\xa4\xb8\xe0\xa5\x8c \xe0\xa4\xaa\xe0\xa5\x88\xe0\xa4\x82\xe0\xa4\xa4
\xe0\xa4\xbe\xe0\xa4\xb2\xe0\xa5\x80\xe0\xa4\xb8'
final = False
def decode(self, input, final=False):
> return codecs.charmap_decode(input,self.errors,decoding_table)[0]
E UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 20: character maps to <undefined>
~\Python\Python39\lib\encodings\cp1252.py:23: UnicodeDecodeError
This only happens in Windows not linux. This can be solved by adding encoding='utf8'
to the open()
function on line 23 in __init__.py
number-parser-0.3.0 version.
In [17]: parse("п'ять")
Out[17]: "п'ять"
In [18]: parse("п’ять")
Out[18]: 'п’ять'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.