By Anthony Nguyen
I was learning about string tokenization for NLP and I wrote this as a sort of application of the concepts I'm learning.
It will stop if it does not find a token at the beginning of the data it is looking at.
import pylexer, re
lexer = pylexer.Lexer([
("word", "[a-z]+"),
("shortdate", "\d{1,2}\/\d{1,2}\/\d{4}")
], re.I)
for token in lexer.scan("I was born on 01/01/1970", True):
print token
Lexer.__init__
takes two arguments: a list of tokens as tuples("name", "regex")
, and any regex flagsLexer.addTokens
takes a list of tokens as tuples("name", "regex")
Lexer.addFlags
takes any number of regex flagsLexer.scan
is an iterator and takes two arguments: the string to scan, and whether or not to ignore whitespace (optional, disabled by default). It returns objects of theToken
class.
The Token
class is basically a variable container. It holds the following information:
name
- The token's namerule
- The regex used to match the tokendata
- The token's matched datastart
- The start index (in the original string) of the token's dataend
- The end index (in the original string) of the token's data
The Scanner
class is the class that's doing the actual scanning work, but it should only ever need to be used from the Lexer
class.
MIT licensed. See LICENSE.