您的位置:首页 > 产品设计 > UI/UE

Top 10 Questions for Java Regular Expression

2013-10-31 13:11 411 查看
This post summarizes the top questions asked about Java regular expressions. As they are most frequently asked, you may find that they are also very useful.

1. How to extract numbers from a string?

One common question of using regular expression is to extract all the numbers into an array of integers.

In Java, 
\d
 means a range of digits (0-9). Using the predefined classes whenever possible will make your code easier to read and eliminate
errors introduced by malformed character classes. Please refer toPredefined character
classes for more details. Please note the first backslash 
\
 in 
\d
.
If you are using an escaped construct within a string literal, you must precede the backslash with another backslash for the string to compile. That’s why we need to use 
\\d
.

List<Integer> numbers = new LinkedList<Integer>();
Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher(str);
while (m.find()) {
numbers.add(Integer.parseInt(m.group()));
}

2. How to split Java String by newlines?

There are at least three different ways to enter a new line character, dependent on the operating system you are working on.
\r
 represents CR (Carriage Return), which is used in
Unix
\n
 means LF (Line Feed), used in Mac OS
\r\n
 means CR + LF, used in Windows

Therefore the most straightforward way to split string by new lines is

String lines[] = String.split("\\r?\\n");

But if you don’t want empty lines, you can use, which is also my favourite way:

String.split("[\\r\\n]+")

A more robust way, which is really system independent, is as follows. But remember, you will still get empty lines if two newline characters are placed side by side.

String.split(System.getProperty("line.separator"));

3. Importance of Pattern.compile()

A regular expression, specified as a string, must first be compiled into an instance of Pattern class.Pattern.compile() method
is the only way to create a instance of object. A typical invocation sequence is thus

Pattern p = Pattern.compile("a*b");
Matcher matcher = p.matcher("aaaaab");
assert matcher.matches() == true;

Essentially, Pattern.compile() is used to transform a regular expression into an Finite state machine (seeCompilers:
Principles, Techniques, and Tools (2nd Edition)). But all of the states involved in performing a match resides in the matcher. By this way, the Pattern p can be reused.
And many matchers can share the same pattern.

Matcher anotherMatcher = p.matcher("aab");
assert anotherMatcher.matches() == true;

Pattern.matches() method is defined as a convenience for when a regular expression is used just once. This method still uses compile() to
get the instance of a Pattern implicitly, and matches a string. Therefore,

boolean b = Pattern.matches("a*b", "aaaaab");

is equivalent to the first code above, though for repeated matches it is less efficient since it does not allow the compiled pattern to be reused.

4. How to escape text for regular expression?

In general, regular expression uses “\” to escape constructs, but it is painful to precede the backslash with another backslash for the Java string to compile. There is another way for users to pass string Literals to the Pattern,
like “$5″. Instead of writing 
\\$5
 or 
[$]5
,
we can type

Pattern.quote("$5");

5. Why does String.split() need pipe delimiter to be escaped?

String.split() splits a string around matches of the given regular expression. Java expression supports special characters that affect the way a pattern is matched, which
is called metacharacter
|
 is
one metacharacter which is used to match a single regular expression out of several possible regular expressions. For example, 
A|B
 means
either 
A
 or 
B
.
Please refer to Alternation with The Vertical Bar or Pipe Symbol for more details. Therefore, to use 
|
 as
a literature, you need to escape it by adding 
\
 in front of it, like 
\\|
.

6. How can we match anbn with
Java regex?

This is the language of all non-empty strings consisting of some number of 
a
‘s followed by an equal number of 
b
‘s,
like 
ab
aabb
,
and 
aaabbb
. This language can be show to be context-free grammar S → aSb | ab, and therefore a non-regular language.

However, Java regex implementations can recognize more than just regular languages. That is, they are not “regular” by formal language theory definition. Using lookahead and self-reference
matching will achieve it. Here I will give the final regular expression first, then explain it a little bit. For a comprehensive explanation, I would refer you to read How
can we match a^n b^n with Java regex.

Pattern p = Pattern.compile("(?x)(?:a(?= a*(\\1?+b)))+\\1");
// true
System.out.println(p.matcher("aaabbb").matches());
// false
System.out.println(p.matcher("aaaabbb").matches());
// false
System.out.println(p.matcher("aaabbbb").matches());
// false
System.out.println(p.matcher("caaabbb").matches());

Instead of explaining the syntax of this complex regular expression, I would rather say a little bit how it works.
In the first iteration, it stops at the first 
a
 then looks ahead (after skipping some 
a
s
by using 
a*
) whether there is a 
b
.
This was achieved by using 
(?:a(?= a*(\\1?+b)))
. If it matches, 
\1
,
the self-reference matching, will matches the very inner parenthesed elements, which is one single
b
 in the first iteration.
In the second iteration, the expression will stop at the second 
a
, then it looks ahead (again skipping
a
s)
to see if there will be 
b
. But this time, 
\\1+b
 is
actually equivalent to 
bb
, therefore two 
b
s
have to be matched. If so, 
\1
 will be changed to 
bb
 after
the second iteration.
In the nth iteration, the expression stops at the nth 
a
 and
see if there are n 
b
s ahead.

By this way, the expression can count the number of 
a
s and match if the number of 
b
s
followed by 
a
 is same.

7. How to replace 2 or more spaces with single space in string and delete leading
spaces only?

String.replaceAll() replaces each substring that matches the given regular expression with the given replacement. “2 or more spaces” can be expressed by regular expression 
[
]+
. Therefore, the following code will work. Note that, the solution won’t ultimately remove all leading and trailing whitespaces. If you would like to have them deleted, you can use String.trim() in
the pipeline.

String line = "  aa bbbbb   ccc     d  ";
// " aa bbbbb ccc d "
System.out.println(line.replaceAll("[\\s]+", " "));

8. How to determine if a number is a prime with regex?

public static void main(String[] args) {
// false
System.out.println(prime(1));
// true
System.out.println(prime(2));
// true
System.out.println(prime(3));
// true
System.out.println(prime(5));
// false
System.out.println(prime(8));
// true
System.out.println(prime(13));
// false
System.out.println(prime(14));
// false
System.out.println(prime(15));
}
 
public static boolean prime(int n) {
return !new String(new char[n]).matches(".?|(..+?)\\1+");
}

The function first generates n number of characters and tries to see if that string matches 
.?|(..+?)\\1+
.
If it is prime, the expression will return false and the 
!
 will reverse the result.

The first part 
.?
 just tries to make sure 1 is not primer. The magic part is the second part where backreference is used. 
(..+?)\\1+
 first
try to matches n length of characters, then repeat it several times by 
\\1+
.

By definition, a prime number is a natural number greater than 1 that has no positive divisors other than 1 and
itself. That means if a=n*m then a is not a prime. n*m can be further explained “repeat n m times”,
and that is exactly what the regular expression does: matches n length of characters by using 
(..+?)
,
then repeat it m times by using 
\\1+
. Therefore, if the pattern matches, the number is
not prime, otherwise it is. Remind that 
!
 will reverse the result.

9. How to split a comma-separated string but ignoring commas in quotes?

You have reached the point where regular expressions break down. It is better and more neat to write a simple splitter, and handles special cases as you wish.

Alternative, you can mimic the operation of finite state machine, by using a switch statement or if-else. Attached is a snippet of code.

public static void main(String[] args) {
String line = "aaa,bbb,\"c,c\",dd;dd,\"e,e";
List<String> toks = splitComma(line);
for (String t : toks) {
System.out.println("> " + t);
}
}
 
private static List<String> splitComma(String str) {
int start = 0;
List<String> toks = new ArrayList<String>();
boolean withinQuote = false;
for (int end = 0; end < str.length(); end++) {
char c = str.charAt(end);
switch(c) {
case ',':
if (!withinQuote) {
toks.add(str.substring(start, end));
start = end + 1;
}
break;
case '\"':
withinQuote = !withinQuote;
break;
}
}
if (start < str.length()) {
toks.add(str.substring(start));
}
return toks;
}

10. How to use backreferences in Java Regular Expressions

Backreferences is another
useful feature in Java regular expression.

Related posts:

Regular
Expression: exclude a word/string
Backreferences
in Java Regular Expressions
Online
Java Regular Expression Testing
A
tool to increase blog visitor count



Category: Java,Regular
Expressions,Top 10  
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: