Skip to content

The elusive Python sub-string extractor

#!/usr/bin/env python
”’
my_regex.py

functions to perform regular expression tasks

The monkey-see/monkey-do method (imitate, then figure out how to extend and vary)

1. Drive yourself nuts trying to get from the manual to being able to do a simple sub-string extraction
2. Cobble together some examples from google hits
3. Tinker, tinker, tinker
4. Find something that works and memorialize it as a function
5. After you can use the magic is when you can start understanding it
6. re.compile is magic
7. pattern = re.compile(’(January|February|March|April|May|June|July|August|September|October|November|December)\s\d+,\s\d+’) translates: create an object (don’t ask’) that will take a string as an argument and match January or February or … or December followed by whitespace followed by one or more digits followed by a comma followed by whitespace followed by one or more digits
8. Given the pattern object, apply an atribute
9. The datestract function uses the search attribute, which will pick up the first match only, using the group() attribute
10. The datelist function uses the finditer attribute, which will pick up all matches, including the final non-match ‘None’

datestract
extract the first date sub-string where the date is in the form January 1, 2007
returns ‘None’ if no dates found
example
dreck = “Pooling and Servicing Agreement, dated as of October 1, 2007″
print datestract(dreck)
datelist
extract ALL date sub-strings where the date is in the form January 1, 2007
returns ‘None’ if no dates found, otherwise iterates over the object until it exhausts matches
example
dreck = “Pooling and Servicing Agreement, dated as of October 1, 2007 and MLPA dated September 30, 2005″
print datelist(dreck)
”’
import re

dreck = “Pooling and Servicing Agreement, dated as of October 1, 2007 and MLPA dated September 30, 2005″

def datestract(dreck):
pattern = re.compile(’(January|February|March|April|May|June|July|August|September|October|November|December)\s\d+,\s\d+’)
p = pattern.search(dreck)
m = p.group(0)
return m

# print datestract(dreck)
# >>> October 1, 2007 # only picks up first date, use when you expect only one or none
# or you chould p=pattern.match(dreck)
# if you are expecting the match to come at the
# beginning of the string or not at all

def datelist(dreck):
pattern = re.compile(’(January|February|March|April|May|June|July|August|September|October|November|December)\s\d+,\s\d+’)
for hit in pattern.finditer(dreck):
print hit.group()

# print datelist(dreck) # picks up all dates
# October 1, 2007 # first hit
# September 30, 2005 # second hit
# None # last result is the non-hit, as in “None more”

Post a Comment

You must be logged in to post a comment.