Home Accessibility Courses Twitter The Mouth Facebook Resources Site Map About Us Contact
Matching to a string - what if it matches in many possible ways?

If you're looking to match and capture part of a string that matches your pattern, you have to be very careful to ensure that you match the correct part of the incoming string. If - for example - I were to ask you what 3 digit numbers the text "I live at 404 and my phone number is 708225" contains, you would naturally reply "404", but there are more three digit numbers there - 404, 708, 082, 822 and 225. And if I were to ask you for all numbers of 3 or more digits, you would add 7082 0822 8225 70822 08225 and 708225 to that list. This is unimportant when you ask "does this string contain some numbers of 3 or more digits", but it is critically important when you then say "what are those numbers because I want to process them".

So what are the rules for making selections in regular expressions? They default to:
• Leftmost starting point and then
• Longest possible match
and then if global / multiple matches are found, it will look for non-overlapping ones.

So ... looking in
  I live at 404 and my phone number is 708225
for a number of three or more digits, a regular expression system would find
  404
and if it was told to do a global match it would find
  404 and 708225

These defaults are sensible - they provide what you want in 95% of cases, and they apply to the following counts:
  * (zero or more)
  ? (zero or one)
  + (one or more)
  {3,6} (from 3 to 6)
  (3,} (3 or more)
and this is known as "greedy matching" because it will grab as many characters as possible.

But there are a few occasions where greedy matching is not what you want. Look at this XML:
  <grand>Aeryn</grand><grand>Zyliana</grand>
with a greedy match to <(.*)> , you'll get a single capture:
  grand>Aeryn</grand><grand>Zyliana</grand
whereas you're likely to be wanting to match each of the tag elements:
  grand, /grand, grand and /grand

You can achieve this though sparse matching, adding an extra ? after the count - thus:
  * (zero or more, as few as possible)
  ? (zero or one, preferably one rather that zero)
  + (one or more, as few as possible)
  {3,6} (from 3 to 6, as few as possible)
  (3,} (3 or more, as few as possible)

That's "sparse" v "greedy" but there's one further element to consider - the .* that I'be used in the regular expression examples, and what it actually means ...

You'll often find you're told that the dot (".") matches any character, but that's not quite true by default; the dot usually matches any character except a new line. That exception is written into the default regular expression engines to stop the sequence ".*" running on from one line to another within a multiline string, and that's a good default, but not always what you want to do. In Perl / PHP [preg style], you can use an "s" modifier - it stands for single line mode - to specify that you truely want the dot to match absolutely any character. In Python, you'll normall do it by adding the re.DOTALL flag onto the end of your regular expression compile.

The principles in this article apply across almost all of the regular expression implementations; I've chosen to add an example in Python on to our web site to illustrate them - source code and sample output is [here]
(written 2010-12-17)

 
Associated topics are indexed as below, or enter http://melksh.am/nnnn for individual articles
Q804 - Object Orientation and General technical topics - Regular Expression Internals
  [3091] How do regular expressions work / Regular Expression diagrams - (2010-12-17)
  [2806] Macho matching - do not do it! - (2010-06-13)
  [2727] Making a Lua program run more than 10 times faster - (2010-04-16)
  [1480] Next course - 7th January 2008, Regular Expressions - (2007-12-21)

Y108 - Python - String Handling
  [4307] Identifying and clearing denial of service attacks on your Apache server - (2014-09-27)
  [4213] Formatting options in Python - (2013-11-16)
  [4152] Why are bus fares so high? - (2013-08-18)
  [4027] Collections in Python - list tuple dict and string. - (2013-03-04)
  [3886] Formatting output - why we need to, and first Python example - (2012-10-09)
  [3796] Backquote, backtic, str and repr in Python - conversion object to string - (2012-07-05)
  [3469] Teaching dilemma - old tricks and techniques, or recent enhancements? - (2011-10-08)
  [3468] Python string formatting - the move from % to str.format - (2011-10-08)
  [3349] Formatting output in Python through str.format - (2011-07-07)
  [3218] Matching a license plate or product code - Regular Expressions - (2011-03-28)
  [2814] Python - splitting and joining strings - (2010-06-16)
  [2780] Formatted Printing in Python - (2010-05-25)
  [2765] Running operating system commands from your Python program - (2010-05-14)
  [2721] Regular Expressions in Python - (2010-04-14)
  [2692] Flexible search and replace in Python - (2010-03-25)
  [2406] Pound Sign in Python Program - (2009-09-15)
  [2284] Strings as collections in Python - (2009-07-12)
  [1876] Python Regular Expressions - (2008-11-08)
  [1608] Underlining in Perl and Python - the x and * operator in use - (2008-04-12)
  [1517] Python - formatting objects - (2008-01-24)
  [1195] Regular Express Primer - (2007-05-20)
  [1110] Python - two different splits - (2007-03-15)
  [970] String duplication - x in Perl, * in Python and Ruby - (2006-12-07)
  [954] Splitting Pythons in Bradford - (2006-11-29)
  [943] Matching within multiline strings, and ignoring case in regular expressions - (2006-11-25)
  [903] Pieces of Python - (2006-10-23)
  [773] Breaking bread - (2006-06-22)
  [560] The fencepost problem - (2006-01-10)
  [496] Python printf - (2005-11-15)
  [463] Splitting the difference - (2005-10-13)
  [324] The backtick operator in Python and Perl - (2005-05-25)


Back to
Python regular expressions - repeating, splitting, lookahead and lookbehind
Previous and next
or
Horse's mouth home
Forward to
How do regular expressions work / Regular Expression diagrams
Some other Articles
Setting your user_agent in PHP - telling back servers who you are
How many toilet rolls - hotel inventory and useage
wxPython geometry - BoxSizer example
Matching to a string - what if it matches in many possible ways?
Python regular expressions - repeating, splitting, lookahead and lookbehind
Melksham - two many councils?
Making the most of critical emails - reading behind the scene
Sizers (geometry control) in a wxPython GUI - a first example
Object Oriented Programming for Structured Programmers - conversion training
4318 posts, page by page
Link to page ... 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87 at 50 posts per page


This is a page archived from The Horse's Mouth at http://www.wellho.net/horse/ - the diary and writings of Graham Ellis. Every attempt was made to provide current information at the time the page was written, but things do move forward in our business - new software releases, price changes, new techniques. Please check back via our main site for current courses, prices, versions, etc - any mention of a price in "The Horse's Mouth" cannot be taken as an offer to supply at that price.

Link to Ezine home page (for reading).
Link to Blogging home page (to add comments).

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2014: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 899360 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho

PAGE: http://www.wellho.net/mouth/3090_Mat ... ways-.html • PAGE BUILT: Thu Sep 18 15:30:25 2014 • BUILD SYSTEM: WomanWithCat