Regex Reference sheet

For PCRE (Perl Compatible Regular Expressions)

Character classes

[abcd]
[^abcd]
[A-J]

[[:word:]] style
alnum letters and digits
alpha letters
ascii character codes 0 - 127
blank space or tab only
cntrl control characters
digit decimal digits (same as \d)
graph printing characters, excluding space
lower lower case letters
print printing characters, including space
punct printing characters, excluding letters and digits
space white space (not quite the same as \s)
upper upper case letters
word "word" characters (same as \w)
xdigit hexadecimal digits

. yes, just a full stop - (almost) any character

\p{xx} a character with the xx property, see unicode properties for more info
\P{xx} a character without the xx property, see unicode properties for more info

\d any decimal digit
\D any character that is not a decimal digit
\h any horizontal whitespace character
\H any character that is not a horizontal whitespace character
\s any whitespace character
\S any character that is not a whitespace character
\v any vertical whitespace character
\V any character that is not a vertical whitespace character
\w any "word" character
\W any "non-word" character

Anchors (zero width assertions)

^
$

\b word boundary
\B not a word boundary
\A start of subject (independent of multiline mode)
\Z end of subject or newline at end (independent of multiline mode)
\z end of subject (independent of multiline mode)
\G first matching position in subject

Individual Characters (literals)

just the character, or

\% really want a "%" (and others)
\a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any character
\e escape (hex 1B)
\f formfeed (hex 0C)
\n newline (hex 0A)
\r carriage return (hex 0D)
\R line break: matches \n, \r and \r\n
\t tab (hex 09)
\xhh character with hex code hh
\ddd character with octal code ddd, or backreference
Note - need in your version
[*] to match an asterisk (also . and | have same need)

counts

? 0 or 1

+ 1 or more
* 0 or more
{2} exactly 2
{2,4} 2 to 4
{2,} 2 or more
?? 0 or 1 spares
+? 1 or more sparse
*? 0 or more sparse
{2,4}? 2 to 4 sparse
{2,}? 2 or more sparse

grouping

() Capture, alternation, counts, sunbpatterns
(?:) Groups regular expressions together (no capture)
(?i) Case Insensitive
(?u) Makes a period (.) match even newline characters

alternation

subpatterns

\1
\g{1}

Back references to the named subpatterns can be achieved by (?P=name) or, since PHP 5.2.2, also by \k or \k'name'. Additionally PHP 5.2.4 added support for \k{name} and \g{name}, and PHP 5.2.7 for \g and \g'name'.

Look ahead and look behind

(?= ) Positive Look Ahead
(?! ) Negative Look Ahead
(?< ) Positive Look Behind
(?

Limitations

Symbols must be in square brackets in order to be matched on.
Symbols .*| are not supported for data identifier patterns.
\w does not match _ when implemented in a Data Identifier pattern.
\s cannot be used to match whitespace, please use whitespace character.

Efficient

Use PCRE Compatible regex syntax.
Only search in the appropriate message part.
Avoid using an asterisk (*) where possible.
Limit the scope, and change the string to a range instead
Start as literal as possible
Triage

Appendix - Modes

i - ignore case
s - or dotall - matches all to .
m - metches ^ and $ at intermediate new line
x - comments in regular expressions
o - once only

Appendix - Unicode

\p{L} or \p{Letter}: any kind of letter from any language.
\p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
\p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
\p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
\p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
\p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.
\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
\p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
\p{Me} or \p{Enclosing_Mark}: a character that encloses the character is is combined with (circle, square, keycap, etc.).
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
\p{Zs} or \p{Space_Separator}: a whitespace character that is invisible, but does take up space.
\p{Zl} or \p{Line_Separator}: line separator character U+2028.
\p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029.
\p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc.
\p{Sm} or \p{Math_Symbol}: any mathematical symbol.
\p{Sc} or \p{Currency_Symbol}: any currency sign.
\p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.
\p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.
\p{N} or \p{Number}: any kind of numeric character in any script.
\p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.
\p{Nl} or \p{Letter_Number}: a number that looks like a letter, such as a Roman numeral.
\p{No} or \p{Other_Number}: a superscript or subscript digit, or a number that is not a digit 0 [82][ac] •Ü9 (excluding numbers from ideographic scripts).
\p{P} or \p{Punctuation}: any kind of punctuation character.
\p{Pd} or \p{Dash_Punctuation}: any kind of hyphen or dash.
\p{Ps} or \p{Open_Punctuation}: any kind of opening bracket.
\p{Pe} or \p{Close_Punctuation}: any kind of closing bracket.
\p{Pi} or \p{Initial_Punctuation}: any kind of opening quote.
\p{Pf} or \p{Final_Punctuation}: any kind of closing quote.
\p{Pc} or \p{Connector_Punctuation}: a punctuation character such as an underscore that connects words.
\p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not a dash, bracket, quote or connector.
\p{C} or \p{Other}: invisible control characters and unused code points.
\p{Cc} or \p{Control}: an ASCII or Latin-1 control character: 0x00 [82][ac] •Ü0x1F and 0x7F [82][ac] •Ü0x9F.
\p{Cf} or \p{Format}: invisible formatting indicator.
\p{Co} or \p{Private_Use}: any code point reserved for private use.
\p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.
\p{Cn} or \p{Unassigned}: any code point to which no character has been assigned.
(written 2017-10-10, updated 2017-10-11)

Associated topics are indexed as below, or enter http://melksh.am/nnnn for individual articles

Q802 - Object Orientation and General technical topics - Regular Expression Elements
  [453] Commenting Perl regular expressions - (2005-09-30)
  [1480] Next course - 7th January 2008, Regular Expressions - (2007-12-21)
  [1766] Diagrams to show you how - Tomcat, Java, PHP - (2008-08-22)
  [1799] Regular Expressions in PHP - (2008-09-16)
  [1849] String matching in Perl with Regular Expressions - (2008-10-20)
  [2804] Regular Expression Myths - (2010-06-13)
  [4505] Regular Expressions for the petrified - in Ruby - (2015-06-03)

Q801 - Object Orientation and General technical topics - What are Regular Expressions?
  [1195] Regular Express Primer - (2007-05-20)
  [2563] Efficient debugging of regular expressions - (2010-01-04)
  [2844] Learning about Regular Expressions in C through examples - (2010-06-30)

Back to
Coverage map in Tcl - how many times has each proc been called?

Previous and next
or
Horse's mouth home

Forward to
Some thoughts on the closure proposal for Breich station

Some other Articles

Some thoughts on 2017, and looking forward to 2018
Moving on from Sunnyside and Devizes
Breich Station - current pictures, and future options
Some thoughts on the closure proposal for Breich station
Regex Reference sheet
Coverage map in Tcl - how many times has each proc been called?
Looking forward to the autumn.
Rotary Talk, 25/7/2017
Sale of effects and furniture - 12th and 13th August 2017
Even more images!