Git Product home page Git Product logo

content-analyzer-regex's Introduction

Samples-RegEx-Patterns

Regular Expressions are used to find out a sequence of characters that define a search pattern. This repository contains artifacts to extract the valid email, url, US phone number, currency, UK postal code, US zipcode and date formats from the given input text file.

The python script Regex.py contains the regular expression validations which when applied to the given input file will extract the valid patterns.

Disclaimer: The python script extracts all the possible words (in the input file) that match the regular expression pattern. It does not validate the extracted word. So, there can be instances where the same word could be matched against multiple regular expression patterns.

Usage

Run the python script by passing in the command line input arguments as below:

Regex.py input_file_name [arguments]
where Regex.py is the name of the python script file (provided)
input_file_name - the input text file from where the text matching the patterns has to be extracted from
[arguments] - specify one or more of the following arguments
-e extracts all the valid email occurrences
-u extracts all the valid url occurrences
-c extracts all the currency value (USD, EUR, CAN) occurrences
-p extracts all the US phone number occurrences
-k extracts all the UK postal code occurrences
-d extracts all the date occurrences
-z extracts all the US zipcode occurrences

Supported formats

Here are the various formats supported for different types.

Note: The regular expressions used in the python script have a trailing space at the end when compared with the expressions provided here in the readme. The trailing space in the expression(s) is needed for python script so as to extract the words from a sentence. Also, for US Zipcode, additionally the dollar($) symbol is needed for the python script to extract the words from a sentence.

Email

Regular Expression:
[a-zA-Z0-9][a-zA-Z0-9!#$%&\'*+/=.?^_`{|}~-]*@(?:[a-zA-Z](?:[a-zA-Z0-9-]*)?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z])?

The email addresses are generally in the form local-part@domain.

The local-part of the email address may use any of these ASCII characters:

  1. Uppercase and lowercase letters A to Z and a to z
  2. Digits 0 to 9
  3. Printable characters !#$%&'*+-/=?^_`{|}~

The domain could be in the following formats:

  1. Uppercase and lowercase letters A to Z and a to z
  2. Digits 0 to 9, provided that top-level domain names are not all-numeric
  3. Hyphen - provided that it is not the first or last character

Ref: https://en.wikipedia.org/wiki/Email_address

URL

Regular Expression:
http[s]?://(?:[a-zA-Z0-9$-_@.&+!*\(\),]|(?:%[0-9a-zA-Z]))+

A typical URL could have the form http://www.example.com/index.html, which indicates a protocol (http), a hostname (www.example.com), and a file name (index.html). The regular expression is intended to match URL that starts with either http or https.
Ref: https://en.wikipedia.org/wiki/URL

US Phone Numbers

Regular Expression:
[+]?[1\s-]*[\(-]?[0-9]{3}[-\)]?[\s-]?[0-9]{3}[\s-]?[0-9]{4}

The regular expression recognises the US phone number formats in the below patterns:

  • (###) ###-####
  • +(#)(###) ###-####
  • (#)(###) ###-####

Ref: https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers#North_America

Currency

Regular Expression:
(?:\$|can\$|C\$|€|USD|CAD|EUR|ATS|BEF|DEM|EEK|ESP|FIM|FRF|GRD|IEP|ITL|LUF|NLG|PTE|can)[\s\S]?\d{1,3}(?:,\d{3})*(?:\.\d{1,3})?

The regular expression validates only US Dollar (USD), Euro (EUR) and Canadian Dollar (CAD) currency formats as below: Considered the following currency symbols

  • US currency format starting with $ symbol.
  • Canadian currency formats starting with can$, CAD, can, C$ symbols.
  • Euro currency formats starting with EUR, ATS, BEF, DEM, EEK, ESP, FIM, FRF, GRD, IEP, ITL, LUF, NLG, PTE, symbols.

UK Postal Code

Regular Expression:
[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][ABD-HJLNP-UW-Z]{2}

UK Postal code format is as follows, where A signifies a letter and 9 a digit:

Format Coverage Example
AA9A 9AA WC postcode area; EC1–EC4, NW1W, SE1P, SW1 EC1A 1BB
A9A 9AA E1W, N1C, N1P W1A 0AX
A9 9AA B, E, G, L, M, N, S, W M1 1AE
A99 9AA B, E, G, L, M, N, S, W B33 8TH
AA9 9AA All other postcodes CR2 6XH
AA99 9AA All other postcodes DN55 1PT

Ref: https://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom

US Zipcode

Regular Expression:
[0-9]{5}(-[0-9]{4})?

The regular expression matches US Zipcode that includes nine digits in the format ddddd-dddd Ref: https://en.wikipedia.org/wiki/List_of_postal_codes

Date

The regular expression validates the dates that appear in the following formats

Regular Expressions:
dd/mm/yyyy - \s?(?:0?[1-9]|[1,2][0-9]|3[0-1])(?:/)(?:0?[1-9]|1[0-2])(?:/)(?:\d{4})

mm/dd/yyyy - \s?(?:0?[1-9]|1[0-2])(?:/)(?:0?[1-9]|[1,2][0-9]|3[0-1])(?:/)(?:\d{4})

1st mon/month yyyy - \s?(?:0?[1-9]|[1,2][0-9]|3[0-1])(?:nd|rd|th|st)?(?:[\s|,|]?\s?)(?:[Jj]an(?:uary)?|[Ff]eb(?:ruary)?|[Mm]ar(?:ch)?|[Aa]pr(?:il)?|[Mm]ay|[Jj]une|[Jj]uly|[Aa]ug(?:ust)?|[Ss]ept(?:ember)?|[Oo]ct(?:ober)?|[Nn]ov(?:ember)?|[Dd]ec(?:ember)?)\s\d{4}

mon/month, 1, yyyy - \s?(?:[Jj]an(?:uary)?|[Ff]eb(?:ruary)?|[Mm]ar(?:ch)?|[Aa]pr(?:il)?|[Mm]ay|[Jj]une|[Jj]uly|[Aa]ug(?:ust)?|[Ss]ept(?:ember)?|[Oo]ct(?:ober)?|[Nn]ov(?:ember)?|[Dd]ec(?:ember)?)(?:[\s|,]?\s?)(?:0?[1-9]|[1,2][0-9]|3[0-1])(?:[\s|,]?\s?)\s\d{4}

Note - The above specified date format Regular Expression patterns works individually and also can be merged with pipe symbol to make it as single Regular Expression for ease of use.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.