Git Product home page Git Product logo

Comments (10)

madisonmay avatar madisonmay commented on May 18, 2024

Yup that's another feature that's on my agenda. Thanks for opening this issue -- it's something I need to address.

from commonregex.

talyssonoc avatar talyssonoc commented on May 18, 2024

The port of CommonRegex to Java (CommonRegexJava) now supports multilang (although only english is implemented currently), maybe could give you some ideas.

from commonregex.

madisonmay avatar madisonmay commented on May 18, 2024

So I'm intending on adding support for international phone numbers formatted according to the E.123 specs: http://en.wikipedia.org/wiki/E.123. It appears there's significant variation country to country, though.

from commonregex.

madisonmay avatar madisonmay commented on May 18, 2024

That last commit added nominal support for international phone numbers -- it's certainly not complete, but it's a move in the right direction. Any chance you could give me a few test cases that you would like to see supported, @jackhooper?

I'll take a look at the Java port, talyssonoc. Thanks!

from commonregex.

jackhooper avatar jackhooper commented on May 18, 2024

@madisonmay Being Australian I'd obviously like to see Australian phone numbers supported. Our phone numbers begin with a two-digit area code (optional - and rarely used - if the caller and receiver are in the same area code), followed by an eight-digit phone number (from memory I'm pretty sure the first four represent the exchange, and the latter four represent the individual line). So it goes (XX) XXXX XXXX. The formatting varies - the most common formats would be XXXX XXXX, XXXXXXXX, and maybe XXXX-XXXX, though I've also seen/heard XXX XXX XX.

Australian mobile/cellular phone numbers are ten digits long, they start with 04, and are usually formatted 04XX XXX XXX, but I have seen other formattings.

There are other types of phone numbers, such as 1800 (1800 XXX XXX), 1300 (1300 XXX XXX), 13 (13 XX XX), 1902 (1902 XXX XXX), 1900 (1900 XXX XXX), as well as the rarely used 1802 (1802 XXX - also sometimes formatted 180 2XXX).

As of right now, CommonRegex can detect the US international dialing code (1), so long as it isn't prefixed with a '+', which international calling codes often are. It doesn't seem to be working for other calling codes. It also doesn't yet recognize international access numbers, which are prefexed onto international numbers, as they are necessary for a telephone exchange to know that you're calling an overseas number. In Australia this code is 0011, in the US it's 011.

Last but certainly not least, it does not yet recognize three-digit numbers, such as those for emergency services (911 in the US, 999 in the UK, 000 in Australia, etc.).

Insomuch as I can tell, phone number formatting in the US is pretty standardized - all the numbers I've seen have given number of digits per block, separated by hyphens. This isn't really the case here. In Australia the use of hyphens is much less common, and the use of spaces is much more popular. Additionally, here (as in many other countries), the number of digits per block is not quite as standardized. There is definitely a most common way of doing it, but some people don't follow it.

Given all of this, adding full international support is likely to be quite a task. Heck, even adding support for one other country, such as Australia, would likely be less than trivial. I certainly don't envy you in having to do so. At present, my Regex skills are rudimentary at best, so I can't help you out much with that side of things. I would be more than happy to test things out, though.

Happy coding,
JH

from commonregex.

madisonmay avatar madisonmay commented on May 18, 2024

Thanks so much for the explanation -- I've got a bit of API design thinking to do before I make any serious changes. For the time being, I've added support for international calling codes -- although I haven't pushed that change to pip as of yet.

I'm afraid that adding support for all formats by default would be detrimental because of the increase in complexity and the increase in false positives. Not adding support for international formats is likewise a poor choice. However, I think I might have found a good middle ground. I'm toying with the idea of adding an initialization argument to the CommonRegex class that controls how strict the regular expressions used are. I could maintain two separate sets of regexes -- one designed for low false positive rates, and another designed for low false negative rates. The set of regexes with low false negative rates could use a much less strict phone number regex to ensure that all phone numbers are captured.

What are your thoughts, JH?

Thanks again,

Madison

from commonregex.

jackhooper avatar jackhooper commented on May 18, 2024

@madisonmay That sounds like a reasonable way of doing. You're more than welcome for the explanation, too - any excuse for me to waffle on about something ;-)

Kind regards,
JH

from commonregex.

jackhooper avatar jackhooper commented on May 18, 2024

It occurred to me that there are a couple of other types of phone numbers which don't currently work. Both are of the (partially) non-numeric variety.

  1. Numbers with * and/or # in them. In Australia, there is a number *10# (pronounced "star-ten-hash"), for example.
  2. Numbers with words in them. A US example might be 1-800-PHONE-THX (the only reason I know that one is because it appears towards the end of the Star Wars credits).

Supporting these types of numbers would almost certainly increase the chance of false positives, so your proposed approach of having two sets of regular expressions - one for low false positives, the other for low false negatives - is looking very good right now.

JH

from commonregex.

madisonmay avatar madisonmay commented on May 18, 2024

Hi @jackhooper.

I've been trying to think how I should expose the two different sets of regular expressions through the commonregex API. Currently, from commonregex import email gets you the compiled regular expression to manipulate to your hearts content. What do we call the second set of regular expressions that is publicly exposed.? Also, should there be a flag that switches all regular expressions from low false positives to low false negatives, or should that be handled on a case by case basis (each regular expression has a different setting)?

Let me know your thoughts.

from commonregex.

jackhooper avatar jackhooper commented on May 18, 2024

@madisonmay I've thought about these questions, and I'm afraid I don't really have any definitive answers for you.

Certainly, on the the matter of the first question, I honestly have no idea.

I'm probably closer to having an answer on the second one, though: perhaps there could be an optional parameter (which would default to False) when you initialise the CommonRegex class to switch all regular expressions in that instance of the CommonRegex class to default to returning low false negatives; and to have an optional parameter in each method to override the default setting? So, if you have an instance of CommonRegex which is set to low false negatives, there would be an optional parameter in each method to return low false positives, and vice versa if the CommonRegex object is set low false positives.

from commonregex.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.