Home Accessibility Courses Twitter The Mouth Facebook Resources Site Map About Us Contact
 
Python, Lua, Tcl, C and C++ training - public course schedule [here]
Private courses on your site - see [here]
Please ask about maintenance training for Perl, PHP, Java, Ruby, MySQL and Linux / Tomcat systems
 
Regex Reference sheet

For PCRE (Perl Compatible Regular Expressions)

Character classes



[abcd]
[^abcd]
[A-J]

[[:word:]] style
alnum letters and digits
alpha letters
ascii character codes 0 - 127
blank space or tab only
cntrl control characters
digit decimal digits (same as \d)
graph printing characters, excluding space
lower lower case letters
print printing characters, including space
punct printing characters, excluding letters and digits
space white space (not quite the same as \s)
upper upper case letters
word "word" characters (same as \w)
xdigit hexadecimal digits

. yes, just a full stop - (almost) any character

\p{xx} a character with the xx property, see unicode properties for more info
\P{xx} a character without the xx property, see unicode properties for more info

\d any decimal digit
\D any character that is not a decimal digit
\h any horizontal whitespace character
\H any character that is not a horizontal whitespace character
\s any whitespace character
\S any character that is not a whitespace character
\v any vertical whitespace character
\V any character that is not a vertical whitespace character
\w any "word" character
\W any "non-word" character

Anchors (zero width assertions)



^
$

\b word boundary
\B not a word boundary
\A start of subject (independent of multiline mode)
\Z end of subject or newline at end (independent of multiline mode)
\z end of subject (independent of multiline mode)
\G first matching position in subject

Individual Characters (literals)



just the character, or

\% really want a "%" (and others)
\a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any character
\e escape (hex 1B)
\f formfeed (hex 0C)
\n newline (hex 0A)
\r carriage return (hex 0D)
\R line break: matches \n, \r and \r\n
\t tab (hex 09)
\xhh character with hex code hh
\ddd character with octal code ddd, or backreference
Note - need in your version
[*] to match an asterisk (also . and | have same need)

counts



? 0 or 1

+ 1 or more
* 0 or more
{2} exactly 2
{2,4} 2 to 4
{2,} 2 or more
?? 0 or 1 spares
+? 1 or more sparse
*? 0 or more sparse
{2,4}? 2 to 4 sparse
{2,}? 2 or more sparse

grouping



() Capture, alternation, counts, sunbpatterns
(?:) Groups regular expressions together (no capture)
(?i) Case Insensitive
(?u) Makes a period (.) match even newline characters

alternation



|

subpatterns


\1
\g{1}

Back references to the named subpatterns can be achieved by (?P=name) or, since PHP 5.2.2, also by \k or \k'name'. Additionally PHP 5.2.4 added support for \k{name} and \g{name}, and PHP 5.2.7 for \g and \g'name'.


Look ahead and look behind



(?= ) Positive Look Ahead
(?! ) Negative Look Ahead
(?< ) Positive Look Behind
(?

Limitations



Symbols must be in square brackets in order to be matched on.
Symbols .*| are not supported for data identifier patterns.
\w does not match _ when implemented in a Data Identifier pattern.
\s cannot be used to match whitespace, please use whitespace character.

Efficient



Use PCRE Compatible regex syntax.
Only search in the appropriate message part.
Avoid using an asterisk (*) where possible.
Limit the scope, and change the string to a range instead
Start as literal as possible
Triage

Appendix - Modes



i - ignore case
s - or dotall - matches all to .
m - metches ^ and $ at intermediate new line
x - comments in regular expressions
o - once only

Appendix - Unicode



\p{L} or \p{Letter}: any kind of letter from any language.
\p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
\p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
\p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
\p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
\p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.
\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
\p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
\p{Me} or \p{Enclosing_Mark}: a character that encloses the character is is combined with (circle, square, keycap, etc.).
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
\p{Zs} or \p{Space_Separator}: a whitespace character that is invisible, but does take up space.
\p{Zl} or \p{Line_Separator}: line separator character U+2028.
\p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029.
\p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc.
\p{Sm} or \p{Math_Symbol}: any mathematical symbol.
\p{Sc} or \p{Currency_Symbol}: any currency sign.
\p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.
\p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.
\p{N} or \p{Number}: any kind of numeric character in any script.
\p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.
\p{Nl} or \p{Letter_Number}: a number that looks like a letter, such as a Roman numeral.
\p{No} or \p{Other_Number}: a superscript or subscript digit, or a number that is not a digit 0 •[93]9 (excluding numbers from ideographic scripts).
\p{P} or \p{Punctuation}: any kind of punctuation character.
\p{Pd} or \p{Dash_Punctuation}: any kind of hyphen or dash.
\p{Ps} or \p{Open_Punctuation}: any kind of opening bracket.
\p{Pe} or \p{Close_Punctuation}: any kind of closing bracket.
\p{Pi} or \p{Initial_Punctuation}: any kind of opening quote.
\p{Pf} or \p{Final_Punctuation}: any kind of closing quote.
\p{Pc} or \p{Connector_Punctuation}: a punctuation character such as an underscore that connects words.
\p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not a dash, bracket, quote or connector.
\p{C} or \p{Other}: invisible control characters and unused code points.
\p{Cc} or \p{Control}: an ASCII or Latin-1 control character: 0x00 •[93]0x1F and 0x7F •[93]0x9F.
\p{Cf} or \p{Format}: invisible formatting indicator.
\p{Co} or \p{Private_Use}: any code point reserved for private use.
\p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.
\p{Cn} or \p{Unassigned}: any code point to which no character has been assigned.
(written 2017-10-10, updated 2017-10-11)

 
Associated topics are indexed as below, or enter http://melksh.am/nnnn for individual articles
Q801 - Object Orientation and General technical topics - What are Regular Expressions?
  [4505] Regular Expressions for the petrified - in Ruby - (2015-06-03)
  [2844] Learning about Regular Expressions in C through examples - (2010-06-30)
  [2563] Efficient debugging of regular expressions - (2010-01-04)
  [1195] Regular Express Primer - (2007-05-20)

Q802 - Object Orientation and General technical topics - Regular Expression Elements
  [2804] Regular Expression Myths - (2010-06-13)
  [1849] String matching in Perl with Regular Expressions - (2008-10-20)
  [1799] Regular Expressions in PHP - (2008-09-16)
  [1766] Diagrams to show you how - Tomcat, Java, PHP - (2008-08-22)
  [1480] Next course - 7th January 2008, Regular Expressions - (2007-12-21)
  [453] Commenting Perl regular expressions - (2005-09-30)


Back to
Coverage map in Tcl - how many times has each proc been called?
Previous and next
or
Horse's mouth home
Forward to
Some thoughts on the closure proposal for Breich station
Some other Articles
Moving on from Sunnyside and Devizes
Breich Station - current pictures, and future options
Some thoughts on the closure proposal for Breich station
Regex Reference sheet
Coverage map in Tcl - how many times has each proc been called?
Looking forward to the autumn.
Rotary Talk, 25/7/2017
Sale of effects and furniture - 12th and 13th August 2017
Even more images!
Images ... continued
4755 posts, page by page
Link to page ... 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96 at 50 posts per page


This is a page archived from The Horse's Mouth at http://www.wellho.net/horse/ - the diary and writings of Graham Ellis. Every attempt was made to provide current information at the time the page was written, but things do move forward in our business - new software releases, price changes, new techniques. Please check back via our main site for current courses, prices, versions, etc - any mention of a price in "The Horse's Mouth" cannot be taken as an offer to supply at that price.

Link to Ezine home page (for reading).
Link to Blogging home page (to add comments).

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2017: 404 The Spa • Melksham, Wiltshire • United Kingdom • SN12 6QL
PH: 01144 1225 708225 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho

PAGE: http://www.wellho.net/mouth/4763_Reg ... sheet.html • PAGE BUILT: Sat May 27 16:49:10 2017 • BUILD SYSTEM: WomanWithCat