Regular Expressions in Java

As from release 1.4, Java 2 comes with a regular expression class that supports patterns similar in style to Perl regular expressions. This page gives you a brief introduction to regular expressions if you're not already familiar with them, then covers Java-specific topics such as compile and match methods, Pattern objects and PatternSyntaxExceptions.

BASIC HANDLING OF STRINGS

The Java language includes a primitive data type char, which holds a 16-bit unicode character. You can hold multiple characters in a String object, or in a StringBuffer object.

Methods such as equals and equalsIgnoreCase, startsWith and endsWith allow you to test Strings against one another. Methods such as indexOf and substring allow you to perform operations on a String. parseInt and other similarly named methods allow you to extract a number (in this example an int) from a String, although you do need to remember to catch the exception that may be thrown.

In certain specialist applications, such as Bioinformatics, multiple characters can also be usefully held in a char array, where they're likely to be dealt with character-by-character in a loop, as in DNA and RNA sequencing.

The StringTokenizer class allows you to take a String and step through it element-by-element (token-by-token) to handle it in chunks or sections. You can choose what character or characters you use between the elements to break up the string in the way you want.

But, until Java release 1.4, the standard classes didn't include any way to ask "does this string look like xxxxxxx". Why would we want to? Well, we might want to ask, "Does this string look like an email address?" and go on to define (in simple terms) an email address as a series of non-spaces, followed by an @ character, followed by another series of non-spaces. Let's see how that is solved in Java 1.4:

$ java Reg1 email graham@wellho.net or lisa@wellho.net for information
"email" is NOT an email address
"graham@wellho.net" IS a possible email address
"or" is NOT an email address
"lisa@wellho.net" IS a possible email address
"for" is NOT an email address
"information" is NOT an email address
$

Looks good.

The Reg1 class is short too, but it includes some strange looking text strings:

import java.util.regex.*;
public class Reg1 {
public static void main (String [] args) {
                Pattern email = Pattern.compile("^\\S+@\\S+$");
                for (int j=0; i<args.length; j++) {
                        Matcher fit = email.matcher(args[j]);
                        if (fit.matches()) {
                                System.out.println (
                                "\"" +args[j] +
                                "\" IS a possible email address");
                        } else {
                                System.out.println (
                                "\"" + args[j] +
                                "\" is NOT an email address");
                        }
                }
        }
}

AN INTRODUCTION TO REGULAR EXPRESSIONS

A regular expression works by matching a String against a template or pattern (a Pattern object in Java), and in its simplest form, returning a boolean to say "yes, the string does look like the pattern" or "no, that doesn't match".

Regular expressions have been around for many years. They originated in Unix utilities such as "grep", the Global Regular Expression Processor, and are now supported by most modern languages. They are, however, a mystery to many people. That string of strange looking characters in our example above:

"^\\S+@\\S+$"

would be enough to put many people off.

If you're familiar with grep, or have programmed in Perl, PHP, awk or Tcl, you'll have come across regular expressions already, at least to some extent. Beware. Each language has its own regular expression engine (PHP has two!); although the basics are the same, more advanced regular expressions differ from one language to another.

In the early days of Java, a regular expression engine was written by Johnathan Locke, and donated to the Apache Software Foundation; you can download it from:
http://jakarta.apache.org/regexp/index.html
As of release 1.4 of Java, though, there's a standard package
java.util.regex
that's shipped with the JRE, and that's what we'll look at in this module.

THE ELEMENTS OF A REGULAR EXPRESSION.

A regular expression match is a "woolly match". When you don't want to ask "does A equal B?" but rather "does A look like B?", then you're probably looking for a regular expression.

Does this seem a bit alien to the world of computers? Are you expecting to be giving precise programming instructions and getting a bit worried when I tell you that the match is woolly or fuzzy? Fear not; it's up to the programmer to define every element of the fuzzy-ness.

Let's have a look at our crude email address matcher. What does an email address look like?

It STARTS WITH
                A NONE SPACE CHARACTER
                (and there's ONE OR MORE OF THOSE).
        That's followed by
                LITERALLY AN @ CHARACTER
        then A NONE SPACE CHARACTER
                (and there's ONE OR MORE OF THOSE)
        and that's then
                THE END OF THE STRING

If you read it carefully, the description is reasonable enough, even though I have laid it out in an odd way. That's so that you can see in a moment how we translate this description in English into a regular expression.

Whether we're looking at English (or a regular expression), we have four types of element in our description above:

-- Assertions ( "Starts With" and "Ends With")
-- Literal characters ( "Literally an @")
-- Any character from a group ( "a non-space character")
-- Counts (1 or more of those)
and those are the basic element groupings in a regular expression.

Translating, we'll add the regular expression down the right hand side;
        It STARTS WITH ^
                A NONE SPACE CHARACTER \S
                (and there's ONE OR MORE OF THOSE). +
        That's followed by
                LITERALLY AN @ CHARACTER @
        then A NONE SPACE CHARACTER \S
                (and there's ONE OR MORE OF THOSE) +
        and that's then
                THE END OF THE STRING. $

all we then need to do is combine it into a single string, and add extra \ characters to ensure that the \ of \S gets past the Java language and into the regex class, thus:

"^\\S+@\\S+$"

Within Java, there are three stages to regular expression matching. Firstly, you define a Pattern object against which you can do the matching, and that typically takes a String parameter. Chances are you'll have a number of matches to do using the same pattern, so you don't want your Java to be slowed down interpreting its curious input string every time. So:

Pattern email = Pattern.compile("^\\S+@\\S+$");

doesn't actually do any matching; it runs the regular expression compiler and prepares to do the matching using the object called email as the pattern or template. Only when you go on and run the Matcher (and we happen to do this within a loop):

Matcher fit = email.matcher(args[j]);

is the matching done, and that creates an object of type Matcher.

The final stage of our matching is to get the result. In this first instance, we're simply interested in knowing if the match succeeded or not, so we'll use the boolean matches method to find out:

if (fit.matches()) {

EXPANDING THE POWER OF REGULAR EXPRESSIONS

There are complete books on regular expressions, and on our Perl and PHP courses we spend up to half a day studying them. We won't go quite so far in this module, but we will look at each of the elements of the regular expression handler in Java and show you the power and flexibility it gives you. It's not yet a "core Java" subject, but in a short time it may be!

LITERAL CHARACTERS

If you use most characters within a regular expression, then they are matched exactly. For example:

Pattern feline = Pattern.compile("cat");

sets up a pattern which will match to any String that contains the letters c-a-t in that order somewhere within, so it matches
-- cat
-- catalogue
-- The use of regular expressions is vindicated

To match literally a unicode or other character that's special to the String handler, you precede it with a \ in the usual way, thus you may have literals such as:

\n \t \u00a3 \\

To match literally a character that's special to the regular expression engine, you also need to precede the character with a \, but in this case the \ must itself be protected as you require it to be passed on through the double quote handler and reach the compile method. Thus, to match a String that literally contains a + character, you would write: Pattern adder = Pattern.compile("\\+");

ANCHORS AND ASSERTIONS

You'll have noticed that a regular expression that contains only literal characters looks for the given pattern within the string. It does not force a match against the whole String.

If you want to match at the start of a String, start your pattern with a ^ character; if you want to match at the end, conclude your pattern with a $ character. Should you specify both a ^ and a $, then you're looking to match the complete String to your regular expression.

The ^ and $ elements are known as "anchors" as they tie the start and/or the end of the String down; this group as a whole is also known as "assertions" because they don't match any specific characters in the incoming string, they just assert that while the match is running a certain condition must occur at the given point in the match.

Example:
Pattern feline = Pattern.compile("^cat");
matches:
-- cat
-- catalogue
but not:
-- The use of regular expressions is vindicated

Example:
Pattern feline = Pattern.compile("cat$");
matches:
-- cat
but not:
-- catalogue
-- The use of regular expressions is vindicated

Important note:

It might appear that Java regular expressions are default anchored with both a ^ and $ character. This is how the match method that we're using at present works. Alternative methods such as find have anchors resume the traditional (but more confusing) "default off" status that they have in other programming languages.

CHARACTER GROUPS

With anchors and literal characters, you can look for a String that starts with, contains, ends with, or exactly matches another String. The mechanism is clear enough, but you could (if you think about it) have used methods such as startsWith just as easily. The power of regular expressions really comes into its own when you start adding in character groups.

If you write

[abcdef]

in your regular expression, then you're matching any one character from the list given (a b c d e or f). You can expand this capability further by using a minus sign to specify a character range, thus

[a-z]	any lower case letter
[0-9a-fA-F]	any hexadecimal character

and if you want to match any character except one from a list, you can start the character list with an ^ character, for example:

[^a-z]	any character except a lower case letter
[^%0-9]	any character except a digit or a % character

There are some very common character groups you may want to specify; you could write "any white space character" as:

[ \t\n\r\f\xoB]

but that would get messy really fast, so there are some common groupings available in Java's regular expressions:

\s	any white space character
\d	any digit
\w	any word character (letter, digit, underscore)

If you want any character except one of these, use a capital letter:

\S	any character that is not a white space
\D	any character that is not a digit
\W	any character that is not a word character

Sequences such as \s will be familiar to you if you use Perl's regular expressions, but there are other character groups too; these use a POSIX standard definition of the character groups, but it's extended and the format isn't taken from Perl, nor PHP, nor Tcl nor SQL!

\p{Space}	Alternative to \s for "any white space"
\p{Blank}	Space or tab character
\p{Alpha}	Any letter (upper or lower case)
\p{Graph}	Any visible character
\p{InGreek}	Any Greek letter
\p{Sc}	A currency symbol

You can negate these groups using \P rather than \p thus

\P{Graph}

Any character that is not visible

One final grouping, the ultimate group if you like, is the "." (full stop or period) character, which matches virtually any character.

COUNTS

The fourth main group (after anchors, literal characters, and character groups) are the counts; you use these in regular expressions if you want to give a quantity to a literal character or group, and you add the count character into you pattern directly after the element to which it applies. There are three very common counts:

+	one or more
*	zero or more
?	zero or one

You might find it easier to read these as

+	some
*	perhaps some
?	perhaps a

Remember the example we started this section with?

"^\\S+@\\S+$"

Well, we can now read it through from start to end...

AN EXAMPLE

Here's a sample program that lets you run a regular expression engine against all the lines from a file. We've really rewritten the "grep" utility in Java, but our handler will take the more powerful regular expressions that Java supports:

import java.util.regex.*;
import java.io.*;

public class Reg2 {

public static void main (String [] args) throws IOException {

        File in = new File(args[1]);
        BufferedReader get = new BufferedReader(
                new FileReader( in ));

        Pattern hunter = Pattern.compile(args[0]);
        String line;
        int lines = 0;
        int matches = 0;
        System.out.print("Looking for "+args[0]);
        System.out.println(" in "+args[1]);

        while ((line = get.readLine()) != null) {
                        lines++;
                        Matcher fit = hunter.matcher(line);

                        if (fit.matches()) {
                                System.out.println (
                                "" + lines +": "+line);
                                matches++;
                        }
                }
        if (matches == 0) {
                System.out.println("No matches in "+lines+" lines");
                }
        }
}

And in use:

$ java Reg2 ".*dog.*" /usr/share/dict/words
Looking for .*dog.* in /usr/share/dict/words
6459: bulldog
6460: bulldogs
13394: dog
13396: dogged
13397: doggedly
13398: doggedness
13399: dogging
13400: doghouse
13401: dogma
13402: dogmas
13403: dogmatic
13404: dogmatism
13405: dogs
$ java Reg2 "[dD][aeiou]gg.*" /usr/share/dict/words
Looking for [dD][aeiou]gg.* in /usr/share/dict/words
11229: dagger
12597: digger
12598: diggers
12599: digging
12600: diggings
13396: dogged
13397: doggedly
13398: doggedness
13399: dogging
$

FLAGS

How did we match to the word "dog"? We wrote ".*dog.*", but alas that would not have matched "Dog" or "DOG" as it's case sensitive. You can specify one or more flags (or'd together) to your pattern constructor. Flags available include:

CASE_INSENSITIVE	whole match case insensitive
MULTILINE	^ and $ to match at embedded new lines (by default they match as start/end of String)
DOTALL	. to match new line characters (by default it does not)
COMMENTS	White space and # to line end ignored, allowing you to comment your regular expression

SPLITTING

If you want to divide an incoming String at a particular regular expression, the split method allows you to do so. It's an alternative to the StringTokenizer, and you can use it without a loop and with a more complex separator.

split returns an array of Strings. An optional additional parameter allows you to specify a limit to the number of strings that you want returned.

Here's a data file which has a mixture of spaces (sometimes several of them) and tabs between each field:

passwd: files nisplus nis
shadow: files nisplus nis
group: files nisplus nis
hosts: files dns
bootparams: nisplus [NOTFOUND=return] files
ethers: files
netmasks: files
networks: files
protocols: files nisplus nis
rpc: files
services: files nisplus nis
netgroup: files nisplus nis
publickey: nisplus
automount: files nisplus nis
aliases: files nisplus

And we want to write an application which lets us find a list of all the lookups (the first word on each line) may be handled by a particular service (the following words).

import java.util.regex.*;
import java.io.*;

public class Reg3 {

public static void main (String [] args) throws IOException {

        File in = new File("confdata");
        BufferedReader get = new BufferedReader(
                new FileReader( in ));

        Pattern hunter = Pattern.compile(args[0],
                        Pattern.CASE_INSENSITIVE);
        Pattern divisor = Pattern.compile(":?\\s+ # any white spaces",
                        Pattern.COMMENTS);

        String line;

        while ((line = get.readLine()) != null) {
                String [] parts = divisor.split(line);
                for (int j=1; j<parts.length; j++) {
                        if (hunter.matcher(parts[j]).matches())
                                System.out.println("Used for "+parts[0]);
                        }
                }
        }
}

And the results:

$ java Reg3 Nis
Used for passwd
Used for shadow
Used for group
Used for protocols
Used for services
Used for netgroup
Used for automount
$ java Reg3 DNS
Used for hosts
$

You'll notice how the separator characters have been stripped out of the array of strings that has been returned – a feature we've used to our benefit to strip off the excess colon on the first field of each line of our incoming data file.

CAPTURING THE STRING THAT MATCHED A PATTERN

The Pattern object is only half of the equation.

We've already made lightweight use of the Matcher object, but it turns out that there's a lot more that we may want to do. Recall our first example of matching email addresses? For sure, it's useful to have the facility that allows us to match against a regular expression and see whether or not we have something of the format of an email address. We may want to go a stage further and save the user name and domain name (the bits before and after the @ character) into separate variables.

Now, when the matching has actually been performed, it's clear that work has been done internally to see which bits of the incoming pattern match which bits of the String that we're matching against; all we need to add is:
-- a way to say "this is a bit that I'm interested in"
and
-- a way to get back these interesting bits
Firstly, we indicate the "interesting bits" in our regular expression by grouping them in round brackets. Round brackets have a dual function in that a count can also be added directly after the brackets to repeat a pattern. We then use the group method to return the group(s) to us.

import java.util.regex.*;

public class Reg4 {

public static void main (String [] args) {

        Pattern email = Pattern.compile("(\\S+)@(\\S+)");

        Matcher fit = email.matcher(args[0]);

        if (fit.find()) {
                for (int i=0; i<=fit.groupCount(); i++) {
                        System.out.println("We have "+
                                fit.group(i));
                        }
                }
        }
}

Let's run that:

$ java Reg4 "At home, graham@wellho.net but away ..."
We have graham@wellho.net
We have graham
We have wellho.net
$

Note:
-- use of "find" to look within the string
-- use of () capturing brackets for subsequences
-- The whole match is returned as group number 0.

If you call the find method a second and subsequent times on the same Matcher, then you can make a series of successive matches. You'll get a false return when it runs out. Thus, simply by changing "if" to "while" in the previous example, you can look for a whole series of email addresses in a line of text.

$ java Reg5 "Use graham@wellho.net or lisa@wellho.net to reach us"
We have graham@wellho.net
We have graham
We have wellho.net
We have lisa@wellho.net
We have lisa
We have wellho.net
$

Further methods are available to have find start from a particular position, to reset it to look from the start, etc. There are also methods available that will return the start and end positions in the incoming string of the match, rather than the match string itself.

USING REGULAR EXPRESSIONS TO REPLACE ONE STRING BY ANOTHER

There are methods available in the Matcher that will let you replace a matched pattern with a specific string of text. These are replaceFirst and replaceAll. Let's change a phone number from a UK number into a full international one:

import java.util.regex.*;

public class Reg6 {

public static void main (String [] args) {

        Pattern phone = Pattern.compile("\\s0");
        Matcher action = phone.matcher(args[0]);
        String worldwide = action.replaceAll(" +44 (0) ");
        System.out.println(worldwide);

        }
}

Which runs as:

$ java Reg6 "phone 01225 708225 or fax 01225 707126"
phone +44 (0) 1225 708225 or fax +44 (0) 1225 707126
$