Deitel & Associates, Inc. Logo

Back to
digg.png delicious.png blinkit.png furl.png
Java How to Program, 6/e

© 2005
pages: 1576
Buy the Book!
Amazon logo
InformIT logo

This tutorial presents Java powerful regular-expression processing capabilities using class Pattern, class Matcher and class String's matches method. This tutorial is intended for students and developers who are familiar with basic Java string-processing techniques.

Download the code for this tutorial here.

[Note: This tutorial is an excerpt (Section 29.7) of Chapter 29, Strings, Characters and Regular Expressions, from our textbook Java How to Program, 6/e. This tutorial may refer to other chapters or sections of the book that are not included here. Permission Information: Deitel, Harvey M. and Paul J., JAVA HOW TO PROGRAM, ©2005, pp.1378-1387. Electronically reproduced by permission of Pearson Education, Inc., Upper Saddle River, New Jersey.]

29.7 Regular Expressions, Class Pattern and Class Matcher (Continued)
Figure 29.20 validates user input. Line 9 validates the first name. To match a set of characters that does not have a predefined character class, use square brackets, []. For example, the pattern "[aeiou]" matches a single character that is a vowel. Ranges of characters can be represented by placing a dash (-) between two characters. In the example, "[A-Z]" matches a single uppercase letter. If the first character in the brackets is "^", the expression accepts any character other than those indicated. However, it is important to note that "[^Z]" is not the same as "[A-Y]", which matches uppercase letters A–Y—"[^Z]" matches any character other than capital Z, including lowercase letters and non-letters such as the newline character. Ranges in character classes are determined by the letters’ integer values. In this example, "[A-Za-z]" matches all uppercase and lowercase letters. The range "[A-z]" matches all letters and also matches those characters (such as % and 6) with an integer value between uppercase Z and lowercase a (for more information on integer values of characters see Appendix B, ASCII Character Set). Like predefined character classes, character classes delimited by square brackets match a single character in the search object.

In line 9, the asterisk after the second character class indicates that any number of letters can be matched. In general, when the regular-expression operator "*" appears in a regular expression, the application attempts to match zero or more occurrences of the subexpression immediately preceding the "*". Operator "+" attempts to match one or more occurrences of the subexpression immediately preceding "+". So both "A*" and "A+" will match "AAA", but only "A*" will match an empty string.

If method validateFirstName returns true (line 29), the application attempts to validate the last name (line 31) by calling validateLastName (lines 13–16 of Fig. 29.20). The regular expression to validate the last name matches any number of letters split by spaces, apostrophes or hyphens. Line 33 validates the address by calling method validateAddress (lines 19–23 of Fig. 29.20). The first character class matches any digit one or more times (\\d+). Note that two \ characters are used because \ normally starts an escape sequences in a string. So \\d in a Java string represents the regular expression pattern \d. Then we match one or more whitespace characters (\\s+). The character "|" allows a match of the expression to its left or to its right. For example, "Hi (John|Jane)" matches both "Hi John" and "Hi Jane". The parentheses are used to group parts of the regular expression. In this example, the left side of | matches a single word, and the right side matches two words separated by any amount of white space. So the address must contain a number followed by one or two words. Therefore, "10 Broadway" and "10 Main Street" are both valid addresses in this example. The city (line 26–29 of Fig. 29.20) and state (line 32–35 of Fig. 29.20) methods also match any word of at least one character or, alternatively, any two words of at least one character if the words are separated by a single space. This means both Waltham and West Newton would match. The asterisk (*) and plus (+) are formally called quantifiers. Figure 29.22 lists all the quantifiers. We have already discussed how the asterisk (*) and plus (+) quantifiers work. All quantifiers affect only the subexpression immediately preceding the quantifier. Quantifier question mark (?) matches zero or one occurrences of the expression that it quantifies. A set of braces containing one number ({n}) matches exactly n occurrences of the expression it quantifies. We demonstrate this quantifier to validate the zip code in Fig. 29.20 at line 40. Including a comma after the number enclosed in braces matches at least n occurrences of the quantified expression. The set of braces containing two numbers ({n,m}), matches between n and m occurrences of the expression that it qualifies. Quantifiers may be applied to patterns enclosed in parentheses to create more complex regular expressions.

Quantifier Matches
* Matches zero or more occurrences of the pattern.
+ Matches one or more occurrences of the pattern.
? Matches zero or one occurrences of the pattern.
{n} Matches exactly n occurrences.
{n,} Matches at least n occurrences.
{n,m} Matches between n and m (inclusive) occurrences.
Fig. 29.22 Quantifiers used in regular expressions.

All of the quantifiers are greedy. This means that they will match as many occurrences as they can as long as the match is still successful. However, if any of these quantifiers is followed by a question mark (?), the quantifier becomes reluctant (sometimes called lazy). It then will match as few occurrences as possible as long as the match is still successful. The zip code (line 40 in Fig. 29.20) matches a digit five times. This regular expression uses the digit character class and a quantifier with the digit 5 between braces. The phone number (line 46 in Fig. 29.20) matches three digits (the first one cannot be zero) followed by a dash followed by three more digits (again the first one cannot be zero) followed by four more digits.

String Method matches checks whether an entire string conforms to a regular expression. For example, we want to accept "Smith" as a last name, but not "9@Smith#". If only a substring matches the regular expression, method matches returns false.

Replacing Substrings and Splitting Strings Sometimes it is useful to replace parts of a string or to split a string into pieces. For this purpose, class String provides methods replaceAll, replaceFirst and split. These methods are demonstrated in Fig. 29.23.

   1  // Fig. 29.23:
2 // Using methods replaceFirst, replaceAll and split.
4 public class RegexSubstitution
5 {
6 public static void main( String args[] )
7 {
8 String firstString = "This sentence ends in 5 stars *****";
9 String secondString = "1, 2, 3, 4, 5, 6, 7, 8";
11 System.out.printf( "Original String 1: %s\n", firstString );
13 // replace '*' with '^'
14 firstString = firstString.replaceAll( "\\*", "^" );
16 System.out.printf( "^ substituted for *: %s\n", firstString );
18 // replace 'stars' with 'carets'
19 firstString = firstString.replaceAll( "stars", "carets" );
21 System.out.printf(
22 "\"carets\" substituted for \"stars\": %s\n", firstString );
24 // replace words with 'word'
25 System.out.printf( "Every word replaced by \"word\": %s\n\n",
26 firstString.replaceAll( "\\w+", "word" ) );
28 System.out.printf( "Original String 2: %s\n", secondString );
30 // replace first three digits with 'digit'
31 for ( int i = 0; i < 3; i++ )
32 secondString = secondString.replaceFirst( "\\d", "digit" );
34 System.out.printf(
35 "First 3 digits replaced by \"digit\" : %s\n", secondString );
36 String output = "String split at commas: [";
38 String[] results = secondString.split( ",\\s*" ); // split on commas
40 for ( String string : results )
41 output += "\"" + string + "\", "; // output results
43 // remove the extra comma and add a bracket
44 output = output.substring( 0, output.length() - 2 ) + "]";
45 System.out.println( output );
46 } // end main
47 } // end class RegexSubstitution
 Fig. 29.23  Methods replaceFirst, replaceAll and split.

Original String 1: This sentence ends in 5 stars *****
^ substituted for *: This sentence ends in 5 stars ^^^^^
"carets" substituted for "stars": This sentence ends in 5 carets ^^^^^
Every word replaced by "word": word word word word word word ^^^^^

Original String 2: 1, 2, 3, 4, 5, 6, 7, 8
First 3 digits replaced by "digit" : digit, digit, digit, 4, 5, 6, 7, 8
String split at commas: ["digit", "digit", "digit", "4", "5", "6", "7", "8"]

Page 1 | 2 | 3 | 4 | 5
Return to Tutorial Index