If you're looking to match and capture part of a string that matches your pattern, you have to be very careful to ensure that you match the correct part of the incoming string. If - for example - I were to ask you what 3 digit numbers the text "I live at 404 and my phone number is 708225" contains, you would naturally reply "404", but there are more three digit numbers there - 404, 708, 082, 822 and 225. And if I were to ask you for all numbers of 3 or more digits, you would add 7082 0822 8225 70822 08225 and 708225 to that list. This is unimportant when you ask "does this string contain some numbers of 3 or more digits", but it is critically important when you then say "what are those numbers because I want to process them".
So what are the rules for making selections in regular expressions? They default to:
• Leftmost starting point and then
• Longest possible match
and then if global / multiple matches are found, it will look for non-overlapping ones.
So ... looking in
I live at 404 and my phone number is 708225
for a number of three or more digits, a regular expression system would find
404
and if it was told to do a global match it would find
404 and 708225
These defaults are sensible - they provide what you want in 95% of cases, and they apply to the following counts:
* (zero or more)
? (zero or one)
+ (one or more)
{3,6} (from 3 to 6)
(3,} (3 or more)
and this is known as "greedy matching" because it will grab as many characters as possible.
But there are a few occasions where greedy matching is not what you want. Look at this XML:
<grand>Aeryn</grand><grand>Zyliana</grand>
with a greedy match to <(.*)> , you'll get a single capture:
grand>Aeryn</grand><grand>Zyliana</grand
whereas you're likely to be wanting to match each of the tag elements:
grand, /grand, grand and /grand
You can achieve this though sparse matching, adding an extra ? after the count - thus:
* (zero or more, as few as possible)
? (zero or one, preferably one rather that zero)
+ (one or more, as few as possible)
{3,6} (from 3 to 6, as few as possible)
(3,} (3 or more, as few as possible)
That's "sparse" v "greedy" but there's one further element to consider - the
.* that I'be used in the regular expression examples, and what it actually means ...
You'll often find you're told that the dot (".") matches any character, but that's not quite true by default; the dot usually matches any character
except a new line. That exception is written into the default regular expression engines to stop the sequence ".*" running on from one line to another within a multiline string, and that's a good default, but not always what you want to do. In Perl / PHP [preg style], you can use an "s" modifier - it stands for single line mode - to specify that you truely want the dot to match absolutely any character. In Python, you'll normall do it by adding the
re.DOTALL flag onto the end of your regular expression compile.
The principles in this article apply across almost all of the regular expression implementations; I've chosen to add an example in Python on to our web site to illustrate them - source code and sample output is
[here] (written 2010-12-17)
Associated topics are indexed as below, or enter http://melksh.am/nnnn for individual articles
Y108 - Python - String Handling [324] The backtick operator in Python and Perl - (2005-05-25)
[463] Splitting the difference - (2005-10-13)
[496] Python printf - (2005-11-15)
[560] The fencepost problem - (2006-01-10)
[773] Breaking bread - (2006-06-22)
[903] Pieces of Python - (2006-10-23)
[943] Matching within multiline strings, and ignoring case in regular expressions - (2006-11-25)
[954] Splitting Pythons in Bradford - (2006-11-29)
[970] String duplication - x in Perl, * in Python and Ruby - (2006-12-07)
[1110] Python - two different splits - (2007-03-15)
[1195] Regular Express Primer - (2007-05-20)
[1517] Python - formatting objects - (2008-01-24)
[1608] Underlining in Perl and Python - the x and * operator in use - (2008-04-12)
[1876] Python Regular Expressions - (2008-11-08)
[2284] Strings as collections in Python - (2009-07-12)
[2406] Pound Sign in Python Program - (2009-09-15)
[2692] Flexible search and replace in Python - (2010-03-25)
[2721] Regular Expressions in Python - (2010-04-14)
[2765] Running operating system commands from your Python program - (2010-05-14)
[2780] Formatted Printing in Python - (2010-05-25)
[2814] Python - splitting and joining strings - (2010-06-16)
[3218] Matching a license plate or product code - Regular Expressions - (2011-03-28)
[3349] Formatting output in Python through str.format - (2011-07-07)
[3468] Python string formatting - the move from % to str.format - (2011-10-08)
[3469] Teaching dilemma - old tricks and techniques, or recent enhancements? - (2011-10-08)
[3796] Backquote, backtic, str and repr in Python - conversion object to string - (2012-07-05)
[3886] Formatting output - why we need to, and first Python example - (2012-10-09)
[4027] Collections in Python - list tuple dict and string. - (2013-03-04)
[4152] Why are bus fares so high? - (2013-08-18)
[4213] Formatting options in Python - (2013-11-16)
[4307] Identifying and clearing denial of service attacks on your Apache server - (2014-09-27)
[4360] Python - comparison of old and new string formatters - (2014-12-22)
[4593] Command line parameter handling in Python via the argparse module - (2015-12-08)
[4595] Python formatting update - including named completions - (2015-12-10)
[4659] Prining a pound sign from Python AND running from the command line at the same time - (2016-03-03)
Q804 - Object Orientation and General technical topics - Regular Expression Internals [1480] Next course - 7th January 2008, Regular Expressions - (2007-12-21)
[2727] Making a Lua program run more than 10 times faster - (2010-04-16)
[2806] Macho matching - do not do it! - (2010-06-13)
[3091] How do regular expressions work / Regular Expression diagrams - (2010-12-17)
Some other Articles
Setting your user_agent in PHP - telling back servers who you areHow many toilet rolls - hotel inventory and useagewxPython geometry - BoxSizer exampleMatching to a string - what if it matches in many possible ways?Python regular expressions - repeating, splitting, lookahead and lookbehindMelksham - two many councils?Making the most of critical emails - reading behind the sceneSizers (geometry control) in a wxPython GUI - a first exampleObject Oriented Programming for Structured Programmers - conversion training