您的位置:首页 > 其它

关于正则表达式的一些问题

2011-11-25 15:19 447 查看
From: http://www.greenend.org.uk/rjk/2002/06/regexp.html

看到一个问题,就是要写一个正则式,匹配含有张三,但是没有李四、王五的字符串,我在linux下试了半天没有结果。

最后发现GNU grep是不支持collating element的,后来用GNU regex写了一个代码,用了ERE,似乎也不支持collating element。

yuanqingyu0123说在jsp里面可以用

'/^[^李四,^王五]*[张三]+[^李四,^王五]*$/g'

而我仿照着写的就是

'^[^[.李四.],^[.王五.]]*[.张三.]+[^[.李四.],^[.王五.]]*$'

然而没有环境可以验证我写的正则式。下面转一篇文章。


Regexp Syntax Summary

This table summarizes the meaning of various strings in different regexp syntaxes. It is intended as a quick reference, rather than a tutorial or specification. Please report any errors.
StringGNU grepBRE (grep)ERE (egrep)GNU EmacsPerlPythonTcl
.Any characterAny character except \0Any character except \nAny character
[...]Bracket ExpressionCharacter SetCharacter ClassBracket Expression
\(re\)SubexpressionGrouping
re\{...\}Match re multiple timesMatch re multiple times
(re)SubexpressionGrouping
re{...}Match re multiple timesMatch re multiple times
re{...}?Nongreedy {}
\digitBack-reference
^Start of line
$End of line
re?re 0 or 1 times
re*re 0 or more times
re+re one or more times
l|rl or rl or r
*?Non-greedy *
+?Non-greedy +
??Non-greedy ?
\AStart of string
\bEither end of wordEither end of word
\BNot either end of wordNot either end of wordSynonym for \
\cCAny in category C
\CCAny not in category C
\CAny octet
\dDigit
\DNon-digit
\GAt pos()
\mStart of word
\MEnd of word
\pproperty

\p{property}
Unicode property
\Pproperty

\P{property}
Not unicode property
\sCAny with syntax C
\SCAny with syntax not C
\sWhitespace
\SNon-whitespace
\wSame as [[:alnum]]Same as \swAlphanumeric and _
\WSame as [^[:alnum]]Same as \SwNot alphanumeric or _
\XCombining sequence
\yEither end of word
\yNot either end of word
\ZEnd of string/last lineEnd of string
\zEnd of string
\`Start of buffer/string
\'End of buffer/string
\<Start of wordStart of word
\>End of wordEnd of word
re\?re 0 or 1
re\+re 1 or more
l\|rl or rl or r
(?#text)Comment, ignored
(?modifiers)Embedded modifiers
(?modifiers:re)Shy grouping + modifiers
(?:re)Shy grouping
\(?:...\)Shy grouping
(?=re)Lookahead
(?!re)Negative lookahead
(?<=p)Lookbehind
(?<!o)Negative lookbehind
(?{code})

(??{code})
Embedded Perl
(?>re)Independent expression
(?(cond)re)

(?(cond)re|re)
Condition expression
(?P<name>re)Symbolic grouping
(?P=name)Symbolic backref
StringGNU grepBRE (grep)ERE (egrep)GNU EmacsPerlPythonTcl


Who Uses What?

BRE refers to POSIX "basic regular expressions" and ERE is POSIX "extended regular expressions".
grep is supposed to use BREs, except that grep -E uses EREs. (GNU grep
fits some extensions in where POSIX leaves the behaviour unspecified). egrep uses EREs.grep -F doesn't use regexps at all, of course.
ed uses BREs. ex and vi use
BREs but additionally support \< and \> as described above, and use ~ to match the replacement part of the previous substitution.
expr uses BREs with all patterns implicitly anchored at the start.
awk is supposed to use EREs, plus the extra C-style escapes \\, \a, \b, \f, \n, \r, \t, \v with
their usual meanings. sed is supposed to use BREs, plus \n with its usual meaning.
lex is also supposed to use EREs with some extensions: "..." quotes everything inside it (backslash
escapes are recognized); an initial <state> matches a start condiiton; r/x matches r only when followed by x; and {name} matches the value of a substitution
symbol. A variety of escape sequences, including the usual C ones, are recognized. Possibly this deserves a new column.
regcomp uses BREs by default but can also use EREs. It has a variety of other options which
modify the syntax slightly.
Boost's regex++ supports a variety of syntaxes.
PCRE is almost the same as Perl, though it doesn't support the embedded Perl feature and the man page lists a number of other differences.
Vim has enough differences and extensions that it perhaps deserves a column (or two) to itself.


Subexpressions, Grouping and Back-References

Subexpressions or groups are surrounded by ( and ), or sometimes \( and \). They serve two purposes; firstly they override the precedence rules of other operators,
and secondly they "capture" part of the text matched by a regexp. This can then be used later on in the regexp via the \digit syntax (this is called a back-reference) or outside the regexp to extract the appropriate part of a string.
"Shy grouping" has the precedence-overriding feature but not the capturing feature.
"Symbolic grouping" allows groups to be identified by name rather than number.


Match Multiple Times

The syntax of this varies a bit; sometimes you used \{ and \}, and sometimes you use { and }. However the idea is the same:

RE{N} will match RE exactly N times.
RE{N,} will match RE N or more times.
RE{N,M} will match RE between N and M times (inclusive).

It is worth nothing that the GNU Grep manual says:
Traditional `egrep' did not support the `{' metacharacter, and some`egrep' implementations support `\{' instead, so portable scripts
a
literal `{'.should avoid `{' in `egrep' patterns and should use `[{]' to matc
h


Bracket Expressions

This refers to expressions in [square brackets], for which POSIX defines a complicated syntax all of their own.
Firstly, if the first character after the [ is a ^ (caret) then the sense of the match is reversed.
The rest of the bracket expression consists of a sequence of elements selected from the following list. The bracket expression as a whole matches any character (or character sequence) that is
matched by at least one of them (or is matched by none of them, if an initial ^ was used).
1. Collating symbols. These look like [.element.], where element is a collating element (i.e. a symbolic name for a multi-character string), and match the value of
the collating element in the current locale. This doesn't seem to work in GNU grep.
2. Equivalence classes. These look like [=element=], where element is a collating element. They match any collating element (single or multiple characters) which has
the same primary weight as element, i.e. if they appear in the same place in the current locale's collation sequence. This doesn't seem to work in GNU grep.
3. Character classes. These look like [:class:], where class is the name of the character class to match. The following character classes exist in all locales:
[:alnum:] [:alpha:] [:blank:] [:cntrl:] [:digit:][:graph:] [:lower:] [:print:] [:space:] [:upper:]

4. Range expressions. These look like start-end where start and end are either single characters or collating symbols. The behaviour is only specified
in the POSIX locale, where they match all the characters between start and end inclusive.
5. Single characters. These match themselves.
To include a ], put it immediately after the opening [ or [^; if it occurs later it will close the bracket expression. The hyphen (-) is not treated as a range
separator if it appears first or last, or as the endpoint of a range.
Emacs "character sets" are similar to bracket expressions, except that collating symbols, equivalence classes and character classes aren't supported.
Perl "character classes" are also similar. They support POSIX character class syntax (argh, confusing names!) and recognize, but don't support, collating symbols or equivalence classes.


GNU Grep and .

GNU Grep has slightly strange handling of . and newlines.
Firstly, the manual says that . matches "any single character". Superficially it appears not to match the newline character:
$ echo | grep .$

The outcome is actually in keeping with standard and traditional behaviour for grep, where the newline is not included in the text to be matched. But that doesn't appear to be quite what's going
on with the GNU version, as explicitly searching for a newline does produce a match:
$ echo | perl -e 'exec("/usr/bin/grep","\n");'

$

So is there a newline to match against or not?
The other case to consider is when the -z or --null-data option is used. In that case, . definitely does match a newline, exactly as the manual says:
$ perl -e 'print "\n\0";' | grep -z . | od -tx10000000 0a 00
0000002
$


Perl Variations

. and newlines

The /s modifier changes the meaning of . to match any haracter including \n.

Anchors

The /m modifier causes ^ and $ to match at the start of any line within the subject string rathe than just the start and end of the subject string.

"Lookbehind" Matching

Perl's lookbehind matches, i.e. (?<=p) and (?<!p) only work for fixed-width patterns, not arbitrary regular expressions.


Sources

The POSIX regular expression specification can be found at http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html.
For the regexp languages used by particular programs, I looked at the documentation for GNU Grep 2.4.2; GNU
Emacs 21.2.1; Perl 5.6.1; Python 2.2.1; and Tcl 8.3.3.
All errors are my own!
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: