Git Product home page Git Product logo

price-parser's Introduction

price-parser

PyPI Version

Supported Python Versions

Build Status

Coverage report

price-parser is a small library for extracting price and currency from raw text strings.

Features:

  • robust price amount and currency symbol extraction
  • zero-effort handling of thousand and decimal separators

The main use case is parsing prices extracted from web pages. For example, you can write a CSS/XPath selector which targets an element with a price, and then use this library for cleaning it up, instead of writing custom site-specific regex or Python code.

License is BSD 3-clause.

Installation

pip install price-parser

price-parser requires Python 3.6+.

Usage

Basic usage

>>> from price_parser import Price >>> price = Price.fromstring("22,90 €") >>> price Price(amount=Decimal('22.90'), currency='€') >>> price.amount # numeric price amount Decimal('22.90') >>> price.currency # currency symbol, as appears in the string '€' >>> price.amount_text # price amount, as appears in the string '22,90' >>> price.amount_float # price amount as float, not Decimal 22.9

If you prefer, Price.fromstring has an alias price_parser.parse_price, they do the same:

>>> from price_parser import parse_price >>> parse_price("22,90 €") Price(amount=Decimal('22.90'), currency='€')

The library has extensive tests (900+ real-world examples of price strings). Some of the supported cases are described below.

Supported cases

Unclean price strings with various currencies are supported; thousand separators and decimal separators are handled:

>>> Price.fromstring("Price: $119.00") Price(amount=Decimal('119.00'), currency='$')

>>> Price.fromstring("15 130 Р") Price(amount=Decimal('15130'), currency='Р')

>>> Price.fromstring("151,200 تومان") Price(amount=Decimal('151200'), currency='تومان')

>>> Price.fromstring("Rp 1.550.000") Price(amount=Decimal('1550000'), currency='Rp')

>>> Price.fromstring("Běžná cena 75 990,00 Kč") Price(amount=Decimal('75990.00'), currency='Kč')

Euro sign is used as a decimal separator in a wild:

>>> Price.fromstring("1,235€ 99") Price(amount=Decimal('1235.99'), currency='€')

>>> Price.fromstring("99 € 95 €") Price(amount=Decimal('99'), currency='€')

>>> Price.fromstring("35€ 999") Price(amount=Decimal('35'), currency='€')

Some special cases are handled:

>>> Price.fromstring("Free") Price(amount=Decimal('0'), currency=None)

When price or currency can't be extracted, corresponding attribute values are set to None:

>>> Price.fromstring("") Price(amount=None, currency=None)

>>> Price.fromstring("Foo") Price(amount=None, currency=None)

>>> Price.fromstring("50% OFF") Price(amount=None, currency=None)

>>> Price.fromstring("50") Price(amount=Decimal('50'), currency=None)

>>> Price.fromstring("R$") Price(amount=None, currency='R$')

Currency hints

currency_hint argument allows to pass a text string which may (or may not) contain currency information. This feature is most useful for automated price extraction.

>>> Price.fromstring("34.99", currency_hint="руб. (шт)") Price(amount=Decimal('34.99'), currency='руб.')

Note that currency mentioned in the main price string may be preferred over currency specified in currency_hint argument; it depends on currency symbols found there. If you know the correct currency, you can set it directly:

>>> price = Price.fromstring("1 000") >>> price.currency = 'EUR' >>> price Price(amount=Decimal('1000'), currency='EUR')

Decimal separator

If you know which symbol is used as a decimal separator in the input string, pass that symbol in the decimal_separator argument to prevent price-parser from guessing the wrong decimal separator symbol.

>>> Price.fromstring("Price: $140.600", decimal_separator=".") Price(amount=Decimal('140.600'), currency='$')

>>> Price.fromstring("Price: $140.600", decimal_separator=",") Price(amount=Decimal('140600'), currency='$')

Contributing

Use tox to run tests with different Python versions:

tox

The command above also runs type checks; we use mypy.

price-parser's People

Contributors

choryuidentify avatar emarondan avatar gallaecio avatar kmike avatar lopuhin avatar manycoding avatar noviluni avatar rpalsaxena avatar starrify avatar uttam1998 avatar wintercomes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

price-parser's Issues

wish support chinese parse

eg:

from price_parser import Price

# ok
price = Price.fromstring("¥36,000")
print(price)
# Price(amount=Decimal('36000'), currency='¥')

# not ok
price = Price.fromstring("36元人民币")
print(price)
# Price(amount=Decimal('36'), currency=None)

Parsing wrong price in presence of $

Hi,

I came across this example:

In [14]: Price.fromstring("2 pairs from $349", decimal_separator=".")
Out[14]: Price(amount=Decimal('2'), currency='$')

given we consider extracting currency as $ would it make sense to handle this case inside price parser?

Support for ISO 4217 currency codes

First of all: Nice library, thanks for creating it.

For converting between major currencies it would be nice to have the ISO 4217 code of the parsed price (EUR, USD, AUD, ...) as this is easier for handling exchange rates.

Is there any plan to support that?

"." incorrectly interpreted as a thousands separator

from price_parser import Price
print(Price.fromstring('3.350').amount_float)
print(Price.fromstring('3.3500').amount_float)
print(Price.fromstring('3.35').amount_float)
print(Price.fromstring('3.355').amount_float)

Results:

3350.0
3.35
3.35
3355.0
  • python 3.8.6
  • price-parser 0.3.4

Get 3 letter currency name

I have an example:

price = Price.fromstring("122$")

Is possible to get the USD as currency instead of $?

Wrong currency extracted in case of long strings containing "$"

These examples:

Price.fromstring('180', currency_hint='$${product.price.currency}')
Price.fromstring('180', currency_hint='qwerty $')
Price.fromstring('180', currency_hint='qwerty $ asdfg')

All of them return Price(amount=Decimal('180'), currency='$').

From what I understand in the code, this happens in _search_safe_currency:

_search_unsafe_currency('asd $ qwe')
# <re.Match object; match='$'>

Would it be in the scope of the library to match the currency more strictly?
This would cover the use-case when we are pretty sure that the currency hint is exact, not fuzzy.

Can not find currency from string

When I type:
Price.fromstring("Today I buy 3 coats with 300.000 VND")
Your library returned:
Price(amount=Decimal('3'), currency='VND')
but the result I want is: Price(amount=Decimal('300.000'), currency='VND')

Feature update to convert from one currency to another

Currently the price-parser library call only extract prices and store them with it's currency tag. I would like to propose the ability of converting one price to another. If this proposal is approved I would start working on the idea.

Update supported Python versions

The lists of supported Python versions are not up to date and are also different: I see the tests run on 3.6-3.10, the setup.py metadata declares 3.6-3.8 support, and the currently supported versions should be 3.8-3.11.

Unable to understand the context

I tried to test the library on this:

from price_parser import parse_price
parse_price("The price of 2000 plates is $13,004")

Output:

Price(amount=Decimal('2000'), currency='$')

It's unable to understand the context. I have gone through the code in parser.py, it's using regex to extract prices as well as the currency symbol.
Could you please share the vision of creating this library?

  1. Was it to process a column of a database that contains the prices like this:
    image

This way it can definitely resolve the minor human errors and help data scientists while preprocessing.

Or
The plan was to create a library that can extract prices from normal text like this:
"The price of 2000 plates is $13,004"?

Returning currency without price

Does it makes sense to return a Price instance with currency set and without a price value?

Actually if we have a substring inside a string that matches with an existing currency, it will be returned even if we are not mentioning the currency:

In [1]: Price.fromstring("STRING WITH NO PRICE BUT HAS A WORD THAT CONTAINS PART OF A CURRENCY NAME: EUROPE")
Out[1]: Price(amount=None, currency='EUR')

This happens because we use regexes to match the currencies inside a string. But this approach would create a problem when single-letter currencies is handled better (#3) .

I can see two options to handle this scenario:

  1. Consider as currency only if the substring is surrounded by whitespaces:
In [2]: Price.fromstring("SOMETHING EUROPE SOMETHING")
Out[2]: Price(amount=None, currency=None)

In [3]: Price.fromstring("SOMETHING EUR SOMETHING")
Out[3]: Price(amount=None, currency='EUR')
  1. Consider as currency only if the entire string matches exactly with the currency name:
In [4]: Price.fromstring("EUR")
Out[4]: Price(amount=None, currency='EUR')

In [5]: Price.fromstring("SOMETHING EUR SOMETHING")
Out[5]: Price(amount=None, currency=None)

API does not handle negative values

The API does not handle negative values:

from price_parser import Price
def test_price_parser():
    value = Price.fromstring("R$ 1,00")
    assert value.amount_float == 1.0

    value = Price.fromstring("R$ -2,00")
    assert value.amount_float == -2.0 # Fails

allow to override decimal separator

Sometimes values like 140.000 can mean 140, not 140K; this happens when e.g. website authors put price in semantic markup with 3 digit precision. It looks like the only way to fix it is to allow customizing price extraction; e.g. it can be fixed by having "decimal_separator_hint" argument for price parsing functions, which one can pass for these "bad" websites.

Support numbers expressed in words in different languages

Some expectations:

>>> parse_price('$ 4 millones')
Price(amount=Decimal('4000000'), currency='$')
>>> parse_price('$ 400 mil')
Price(amount=Decimal('400000'), currency='$')
>>> parse_price('1.45 milliards INR')
Price(amount=Decimal('1450000000'), currency='INR')
>>> parse_price('115 millions de dollars (estimation)')
Price(amount=Decimal('115000000'), currency='dollars')
>>> parse_price('Drei 000 000 $')
Price(amount=Decimal('3000000'), currency='$')

MyPy error thrown when importing using from

When type checking with MyPy, an error is thrown when using from price_parser import parse_price, namely:

Module "price_parser" does not explicitly export attribute "parse_price"; implicit reexport disabled

I suspect this is due to the entities not being defined in __all__

Unable to parse values with scientific notation

First of all, thank you for creating it. it scales very well. overall nice library.

I came across this issue when currency value has E notation.
Ex:

from price_parser import Price
Price.fromstring("3.891506499E8")
> Price(amount=Decimal('3.891506499'), currency=None)

Don't parse dates as prices

Dates like July, 2004 or 15.08.2017 should not be parsed as prices, we should detect them and return amount=None currency=None.

Prefer price number which is close to currency symbol

Currently amount of items may be picked as a price, instead of the price number:

In [5]: Price.fromstring('3 Ausgaben für nur 14,85 EUR')
Out[5]: Price(amount=Decimal('3'), currency='EUR')

In [6]: Price.fromstring('Buy Now - 2 Litre Was $120.00 Now $60.00')
Out[6]: Price(amount=Decimal('2'), currency='$')

I think making price-parser to prefer numbers close to currency symbol (instead of taking the first one) may help to fix such cases - maybe not all of them, but at least some.

Improve handling of single-letter currencies

Some of the single-letter currencies are not picked up:

In [7]: Price.fromstring('R273.00')
Out[7]: Price(amount=Decimal('273.00'), currency=None)

In [9]: Price.fromstring('61.858 L')  # Romanian New Leu
Out[9]: Price(amount=Decimal('61858'), currency=None)

We should support them. The main issue is that they're ambiguous; we should be careful not to pick e.g. L or R letter from the extra text appearing in price string passed.

Problem with CHF and '

I noticed that for the Swiss Franc (CHF) for values above 1,000, the notation CHF 1'000 is used.

parse_price("CHF 1'200.20")
Out[27]: Price(amount=Decimal('1'), currency='CHF')

It can't parse Asian text currency like '원', '円'.

Korean, and Japanese uses each specific money character like '₩', '¥'.
But that country also using their text based money character to represent currency.
Korean uses '원', Japanese uses '円'.
but this parser can't parse that character.

Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.