Regular Expressions are used to find out a sequence of characters that define a search pattern. This repository contains artifacts to extract the valid email, url, US phone number, currency, UK postal code, US zipcode and date formats from the given input text file.
The python script Regex.py
contains the regular expression validations which when applied to the given input file will extract the valid patterns.
Disclaimer: The python script extracts all the possible words (in the input file) that match the regular expression pattern. It does not validate the extracted word. So, there can be instances where the same word could be matched against multiple regular expression patterns.
Run the python script by passing in the command line input arguments as below:
Regex.py input_file_name [arguments]
where Regex.py is the name of the python script file (provided)
input_file_name - the input text file from where the text matching the patterns has to be extracted from
[arguments] - specify one or more of the following arguments
-e
extracts all the valid email occurrences
-u
extracts all the valid url occurrences
-c
extracts all the currency value (USD, EUR, CAN) occurrences
-p
extracts all the US phone number occurrences
-k
extracts all the UK postal code occurrences
-d
extracts all the date occurrences
-z
extracts all the US zipcode occurrences
Here are the various formats supported for different types.
Note: The regular expressions used in the python script have a trailing space at the end when compared with the expressions provided here in the readme. The trailing space in the expression(s) is needed for python script so as to extract the words from a sentence. Also, for US Zipcode, additionally the dollar($) symbol is needed for the python script to extract the words from a sentence.
Regular Expression:
[a-zA-Z0-9][a-zA-Z0-9!#$%&\'*+/=.?^_`{|}~-]*@(?:[a-zA-Z](?:[a-zA-Z0-9-]*)?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z])?
The email addresses are generally in the form local-part@domain.
The local-part of the email address may use any of these ASCII characters:
- Uppercase and lowercase letters
A
toZ
anda
toz
- Digits
0
to9
- Printable characters
!#$%&'*+-/=?^_`{|}~
The domain could be in the following formats:
- Uppercase and lowercase letters
A
toZ
anda
toz
- Digits
0
to9
, provided that top-level domain names are not all-numeric - Hyphen
-
provided that it is not the first or last character
Ref: https://en.wikipedia.org/wiki/Email_address
Regular Expression:
http[s]?://(?:[a-zA-Z0-9$-_@.&+!*\(\),]|(?:%[0-9a-zA-Z]))+
A typical URL could have the form http://www.example.com/index.html
, which indicates a protocol (http
), a hostname (www.example.com
), and a file name (index.html
). The regular expression is intended to match URL that starts with either http
or https
.
Ref: https://en.wikipedia.org/wiki/URL
Regular Expression:
[+]?[1\s-]*[\(-]?[0-9]{3}[-\)]?[\s-]?[0-9]{3}[\s-]?[0-9]{4}
The regular expression recognises the US phone number formats in the below patterns:
- (###) ###-####
- +(#)(###) ###-####
- (#)(###) ###-####
Ref: https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers#North_America
Regular Expression:
(?:\$|can\$|C\$|€|USD|CAD|EUR|ATS|BEF|DEM|EEK|ESP|FIM|FRF|GRD|IEP|ITL|LUF|NLG|PTE|can)[\s\S]?\d{1,3}(?:,\d{3})*(?:\.\d{1,3})?
The regular expression validates only US Dollar (USD), Euro (EUR) and Canadian Dollar (CAD) currency formats as below: Considered the following currency symbols
- US currency format starting with
$
symbol. - Canadian currency formats starting with
can$
,CAD
,can
,C$
symbols. - Euro currency formats starting with
EUR
,ATS
,BEF
,DEM
,EEK
,ESP
,FIM
,FRF
,GRD
,IEP
,ITL
,LUF
,NLG
,PTE
,€
symbols.
Regular Expression:
[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][ABD-HJLNP-UW-Z]{2}
UK Postal code format is as follows, where A signifies a letter and 9 a digit:
Format | Coverage | Example |
---|---|---|
AA9A 9AA | WC postcode area; EC1–EC4, NW1W, SE1P, SW1 | EC1A 1BB |
A9A 9AA | E1W, N1C, N1P | W1A 0AX |
A9 9AA | B, E, G, L, M, N, S, W | M1 1AE |
A99 9AA | B, E, G, L, M, N, S, W | B33 8TH |
AA9 9AA | All other postcodes | CR2 6XH |
AA99 9AA | All other postcodes | DN55 1PT |
Ref: https://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom
Regular Expression:
[0-9]{5}(-[0-9]{4})?
The regular expression matches US Zipcode that includes nine digits in the format ddddd-dddd
Ref: https://en.wikipedia.org/wiki/List_of_postal_codes
The regular expression validates the dates that appear in the following formats
Regular Expressions:
dd/mm/yyyy - \s?(?:0?[1-9]|[1,2][0-9]|3[0-1])(?:/)(?:0?[1-9]|1[0-2])(?:/)(?:\d{4})
mm/dd/yyyy - \s?(?:0?[1-9]|1[0-2])(?:/)(?:0?[1-9]|[1,2][0-9]|3[0-1])(?:/)(?:\d{4})
1st mon/month yyyy - \s?(?:0?[1-9]|[1,2][0-9]|3[0-1])(?:nd|rd|th|st)?(?:[\s|,|]?\s?)(?:[Jj]an(?:uary)?|[Ff]eb(?:ruary)?|[Mm]ar(?:ch)?|[Aa]pr(?:il)?|[Mm]ay|[Jj]une|[Jj]uly|[Aa]ug(?:ust)?|[Ss]ept(?:ember)?|[Oo]ct(?:ober)?|[Nn]ov(?:ember)?|[Dd]ec(?:ember)?)\s\d{4}
mon/month, 1, yyyy - \s?(?:[Jj]an(?:uary)?|[Ff]eb(?:ruary)?|[Mm]ar(?:ch)?|[Aa]pr(?:il)?|[Mm]ay|[Jj]une|[Jj]uly|[Aa]ug(?:ust)?|[Ss]ept(?:ember)?|[Oo]ct(?:ober)?|[Nn]ov(?:ember)?|[Dd]ec(?:ember)?)(?:[\s|,]?\s?)(?:0?[1-9]|[1,2][0-9]|3[0-1])(?:[\s|,]?\s?)\s\d{4}
Note - The above specified date format Regular Expression patterns works individually and also can be merged with pipe symbol to make it as single Regular Expression for ease of use.