Home Accessibility Courses Twitter The Mouth Facebook Resources Site Map About Us Contact
 
For 2023 (and 2024 ...) - we are now fully retired from IT training.
We have made many, many friends over 25 years of teaching about Python, Tcl, Perl, PHP, Lua, Java, C and C++ - and MySQL, Linux and Solaris/SunOS too. Our training notes are now very much out of date, but due to upward compatability most of our examples remain operational and even relevant ad you are welcome to make us if them "as seen" and at your own risk.

Lisa and I (Graham) now live in what was our training centre in Melksham - happy to meet with former delegates here - but do check ahead before coming round. We are far from inactive - rather, enjoying the times that we are retired but still healthy enough in mind and body to be active!

I am also active in many other area and still look after a lot of web sites - you can find an index ((here))
Matching to a string - what if it matches in many possible ways?

If you're looking to match and capture part of a string that matches your pattern, you have to be very careful to ensure that you match the correct part of the incoming string. If - for example - I were to ask you what 3 digit numbers the text "I live at 404 and my phone number is 708225" contains, you would naturally reply "404", but there are more three digit numbers there - 404, 708, 082, 822 and 225. And if I were to ask you for all numbers of 3 or more digits, you would add 7082 0822 8225 70822 08225 and 708225 to that list. This is unimportant when you ask "does this string contain some numbers of 3 or more digits", but it is critically important when you then say "what are those numbers because I want to process them".

So what are the rules for making selections in regular expressions? They default to:
• Leftmost starting point and then
• Longest possible match
and then if global / multiple matches are found, it will look for non-overlapping ones.

So ... looking in
  I live at 404 and my phone number is 708225
for a number of three or more digits, a regular expression system would find
  404
and if it was told to do a global match it would find
  404 and 708225

These defaults are sensible - they provide what you want in 95% of cases, and they apply to the following counts:
  * (zero or more)
  ? (zero or one)
  + (one or more)
  {3,6} (from 3 to 6)
  (3,} (3 or more)
and this is known as "greedy matching" because it will grab as many characters as possible.

But there are a few occasions where greedy matching is not what you want. Look at this XML:
  <grand>Aeryn</grand><grand>Zyliana</grand>
with a greedy match to <(.*)> , you'll get a single capture:
  grand>Aeryn</grand><grand>Zyliana</grand
whereas you're likely to be wanting to match each of the tag elements:
  grand, /grand, grand and /grand

You can achieve this though sparse matching, adding an extra ? after the count - thus:
  * (zero or more, as few as possible)
  ? (zero or one, preferably one rather that zero)
  + (one or more, as few as possible)
  {3,6} (from 3 to 6, as few as possible)
  (3,} (3 or more, as few as possible)

That's "sparse" v "greedy" but there's one further element to consider - the .* that I'be used in the regular expression examples, and what it actually means ...

You'll often find you're told that the dot (".") matches any character, but that's not quite true by default; the dot usually matches any character except a new line. That exception is written into the default regular expression engines to stop the sequence ".*" running on from one line to another within a multiline string, and that's a good default, but not always what you want to do. In Perl / PHP [preg style], you can use an "s" modifier - it stands for single line mode - to specify that you truely want the dot to match absolutely any character. In Python, you'll normall do it by adding the re.DOTALL flag onto the end of your regular expression compile.

The principles in this article apply across almost all of the regular expression implementations; I've chosen to add an example in Python on to our web site to illustrate them - source code and sample output is [here]
(written 2010-12-17)

 
Associated topics are indexed as below, or enter http://melksh.am/nnnn for individual articles
Y108 - Python - String Handling
  [324] The backtick operator in Python and Perl - (2005-05-25)
  [463] Splitting the difference - (2005-10-13)
  [496] Python printf - (2005-11-15)
  [560] The fencepost problem - (2006-01-10)
  [773] Breaking bread - (2006-06-22)
  [903] Pieces of Python - (2006-10-23)
  [943] Matching within multiline strings, and ignoring case in regular expressions - (2006-11-25)
  [954] Splitting Pythons in Bradford - (2006-11-29)
  [970] String duplication - x in Perl, * in Python and Ruby - (2006-12-07)
  [1110] Python - two different splits - (2007-03-15)
  [1195] Regular Express Primer - (2007-05-20)
  [1517] Python - formatting objects - (2008-01-24)
  [1608] Underlining in Perl and Python - the x and * operator in use - (2008-04-12)
  [1876] Python Regular Expressions - (2008-11-08)
  [2284] Strings as collections in Python - (2009-07-12)
  [2406] Pound Sign in Python Program - (2009-09-15)
  [2692] Flexible search and replace in Python - (2010-03-25)
  [2721] Regular Expressions in Python - (2010-04-14)
  [2765] Running operating system commands from your Python program - (2010-05-14)
  [2780] Formatted Printing in Python - (2010-05-25)
  [2814] Python - splitting and joining strings - (2010-06-16)
  [3218] Matching a license plate or product code - Regular Expressions - (2011-03-28)
  [3349] Formatting output in Python through str.format - (2011-07-07)
  [3468] Python string formatting - the move from % to str.format - (2011-10-08)
  [3469] Teaching dilemma - old tricks and techniques, or recent enhancements? - (2011-10-08)
  [3796] Backquote, backtic, str and repr in Python - conversion object to string - (2012-07-05)
  [3886] Formatting output - why we need to, and first Python example - (2012-10-09)
  [4027] Collections in Python - list tuple dict and string. - (2013-03-04)
  [4152] Why are bus fares so high? - (2013-08-18)
  [4213] Formatting options in Python - (2013-11-16)
  [4307] Identifying and clearing denial of service attacks on your Apache server - (2014-09-27)
  [4360] Python - comparison of old and new string formatters - (2014-12-22)
  [4593] Command line parameter handling in Python via the argparse module - (2015-12-08)
  [4595] Python formatting update - including named completions - (2015-12-10)
  [4659] Prining a pound sign from Python AND running from the command line at the same time - (2016-03-03)

Q804 - Object Orientation and General technical topics - Regular Expression Internals
  [1480] Next course - 7th January 2008, Regular Expressions - (2007-12-21)
  [2727] Making a Lua program run more than 10 times faster - (2010-04-16)
  [2806] Macho matching - do not do it! - (2010-06-13)
  [3091] How do regular expressions work / Regular Expression diagrams - (2010-12-17)


Back to
Python regular expressions - repeating, splitting, lookahead and lookbehind
Previous and next
or
Horse's mouth home
Forward to
How do regular expressions work / Regular Expression diagrams
Some other Articles
Setting your user_agent in PHP - telling back servers who you are
How many toilet rolls - hotel inventory and useage
wxPython geometry - BoxSizer example
Matching to a string - what if it matches in many possible ways?
Python regular expressions - repeating, splitting, lookahead and lookbehind
Melksham - two many councils?
Making the most of critical emails - reading behind the scene
Sizers (geometry control) in a wxPython GUI - a first example
Object Oriented Programming for Structured Programmers - conversion training
4759 posts, page by page
Link to page ... 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96 at 50 posts per page


This is a page archived from The Horse's Mouth at http://www.wellho.net/horse/ - the diary and writings of Graham Ellis. Every attempt was made to provide current information at the time the page was written, but things do move forward in our business - new software releases, price changes, new techniques. Please check back via our main site for current courses, prices, versions, etc - any mention of a price in "The Horse's Mouth" cannot be taken as an offer to supply at that price.

Link to Ezine home page (for reading).
Link to Blogging home page (to add comments).

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2024: 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho

PAGE: http://www.wellho.net/mouth/3090_Mat ... ways-.html • PAGE BUILT: Sun Oct 11 16:07:41 2020 • BUILD SYSTEM: JelliaJamb