关于正则表达式的一些问题
2011-11-25 15:19
447 查看
From: http://www.greenend.org.uk/rjk/2002/06/regexp.html
看到一个问题,就是要写一个正则式,匹配含有张三,但是没有李四、王五的字符串,我在linux下试了半天没有结果。
最后发现GNU grep是不支持collating element的,后来用GNU regex写了一个代码,用了ERE,似乎也不支持collating element。
yuanqingyu0123说在jsp里面可以用
'/^[^李四,^王五]*[张三]+[^李四,^王五]*$/g'
而我仿照着写的就是
'^[^[.李四.],^[.王五.]]*[.张三.]+[^[.李四.],^[.王五.]]*$'
然而没有环境可以验证我写的正则式。下面转一篇文章。
This table summarizes the meaning of various strings in different regexp syntaxes. It is intended as a quick reference, rather than a tutorial or specification. Please report any errors.
BRE refers to POSIX "basic regular expressions" and ERE is POSIX "extended regular expressions".
grep is supposed to use BREs, except that grep -E uses EREs. (GNU grep
fits some extensions in where POSIX leaves the behaviour unspecified). egrep uses EREs.grep -F doesn't use regexps at all, of course.
ed uses BREs. ex and vi use
BREs but additionally support \< and \> as described above, and use ~ to match the replacement part of the previous substitution.
expr uses BREs with all patterns implicitly anchored at the start.
awk is supposed to use EREs, plus the extra C-style escapes \\, \a, \b, \f, \n, \r, \t, \v with
their usual meanings. sed is supposed to use BREs, plus \n with its usual meaning.
lex is also supposed to use EREs with some extensions: "..." quotes everything inside it (backslash
escapes are recognized); an initial <state> matches a start condiiton; r/x matches r only when followed by x; and {name} matches the value of a substitution
symbol. A variety of escape sequences, including the usual C ones, are recognized. Possibly this deserves a new column.
regcomp uses BREs by default but can also use EREs. It has a variety of other options which
modify the syntax slightly.
Boost's regex++ supports a variety of syntaxes.
PCRE is almost the same as Perl, though it doesn't support the embedded Perl feature and the man page lists a number of other differences.
Vim has enough differences and extensions that it perhaps deserves a column (or two) to itself.
Subexpressions or groups are surrounded by ( and ), or sometimes \( and \). They serve two purposes; firstly they override the precedence rules of other operators,
and secondly they "capture" part of the text matched by a regexp. This can then be used later on in the regexp via the \digit syntax (this is called a back-reference) or outside the regexp to extract the appropriate part of a string.
"Shy grouping" has the precedence-overriding feature but not the capturing feature.
"Symbolic grouping" allows groups to be identified by name rather than number.
The syntax of this varies a bit; sometimes you used \{ and \}, and sometimes you use { and }. However the idea is the same:
RE{N} will match RE exactly N times.
RE{N,} will match RE N or more times.
RE{N,M} will match RE between N and M times (inclusive).
It is worth nothing that the GNU Grep manual says:
This refers to expressions in [square brackets], for which POSIX defines a complicated syntax all of their own.
Firstly, if the first character after the [ is a ^ (caret) then the sense of the match is reversed.
The rest of the bracket expression consists of a sequence of elements selected from the following list. The bracket expression as a whole matches any character (or character sequence) that is
matched by at least one of them (or is matched by none of them, if an initial ^ was used).
1. Collating symbols. These look like [.element.], where element is a collating element (i.e. a symbolic name for a multi-character string), and match the value of
the collating element in the current locale. This doesn't seem to work in GNU grep.
2. Equivalence classes. These look like [=element=], where element is a collating element. They match any collating element (single or multiple characters) which has
the same primary weight as element, i.e. if they appear in the same place in the current locale's collation sequence. This doesn't seem to work in GNU grep.
3. Character classes. These look like [:class:], where class is the name of the character class to match. The following character classes exist in all locales:
4. Range expressions. These look like start-end where start and end are either single characters or collating symbols. The behaviour is only specified
in the POSIX locale, where they match all the characters between start and end inclusive.
5. Single characters. These match themselves.
To include a ], put it immediately after the opening [ or [^; if it occurs later it will close the bracket expression. The hyphen (-) is not treated as a range
separator if it appears first or last, or as the endpoint of a range.
Emacs "character sets" are similar to bracket expressions, except that collating symbols, equivalence classes and character classes aren't supported.
Perl "character classes" are also similar. They support POSIX character class syntax (argh, confusing names!) and recognize, but don't support, collating symbols or equivalence classes.
GNU Grep has slightly strange handling of . and newlines.
Firstly, the manual says that . matches "any single character". Superficially it appears not to match the newline character:
The outcome is actually in keeping with standard and traditional behaviour for grep, where the newline is not included in the text to be matched. But that doesn't appear to be quite what's going
on with the GNU version, as explicitly searching for a newline does produce a match:
So is there a newline to match against or not?
The other case to consider is when the -z or --null-data option is used. In that case, . definitely does match a newline, exactly as the manual says:
The POSIX regular expression specification can be found at http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html.
For the regexp languages used by particular programs, I looked at the documentation for GNU Grep 2.4.2; GNU
Emacs 21.2.1; Perl 5.6.1; Python 2.2.1; and Tcl 8.3.3.
All errors are my own!
看到一个问题,就是要写一个正则式,匹配含有张三,但是没有李四、王五的字符串,我在linux下试了半天没有结果。
最后发现GNU grep是不支持collating element的,后来用GNU regex写了一个代码,用了ERE,似乎也不支持collating element。
yuanqingyu0123说在jsp里面可以用
'/^[^李四,^王五]*[张三]+[^李四,^王五]*$/g'
而我仿照着写的就是
'^[^[.李四.],^[.王五.]]*[.张三.]+[^[.李四.],^[.王五.]]*$'
然而没有环境可以验证我写的正则式。下面转一篇文章。
Regexp Syntax Summary
This table summarizes the meaning of various strings in different regexp syntaxes. It is intended as a quick reference, rather than a tutorial or specification. Please report any errors.String | GNU grep | BRE (grep) | ERE (egrep) | GNU Emacs | Perl | Python | Tcl |
. | Any character | Any character except \0 | Any character except \n | Any character | |||
[...] | Bracket Expression | Character Set | Character Class | Bracket Expression | |||
\(re\) | Subexpression | Grouping | |||||
re\{...\} | Match re multiple times | Match re multiple times | |||||
(re) | Subexpression | Grouping | |||||
re{...} | Match re multiple times | Match re multiple times | |||||
re{...}? | Nongreedy {} | ||||||
\digit | Back-reference | ||||||
^ | Start of line | ||||||
$ | End of line | ||||||
re? | re 0 or 1 times | ||||||
re* | re 0 or more times | ||||||
re+ | re one or more times | ||||||
l|r | l or r | l or r | |||||
*? | Non-greedy * | ||||||
+? | Non-greedy + | ||||||
?? | Non-greedy ? | ||||||
\A | Start of string | ||||||
\b | Either end of word | Either end of word | |||||
\B | Not either end of word | Not either end of word | Synonym for \ | ||||
\cC | Any in category C | ||||||
\CC | Any not in category C | ||||||
\C | Any octet | ||||||
\d | Digit | ||||||
\D | Non-digit | ||||||
\G | At pos() | ||||||
\m | Start of word | ||||||
\M | End of word | ||||||
\pproperty \p{property} | Unicode property | ||||||
\Pproperty \P{property} | Not unicode property | ||||||
\sC | Any with syntax C | ||||||
\SC | Any with syntax not C | ||||||
\s | Whitespace | ||||||
\S | Non-whitespace | ||||||
\w | Same as [[:alnum]] | Same as \sw | Alphanumeric and _ | ||||
\W | Same as [^[:alnum]] | Same as \Sw | Not alphanumeric or _ | ||||
\X | Combining sequence | ||||||
\y | Either end of word | ||||||
\y | Not either end of word | ||||||
\Z | End of string/last line | End of string | |||||
\z | End of string | ||||||
\` | Start of buffer/string | ||||||
\' | End of buffer/string | ||||||
\< | Start of word | Start of word | |||||
\> | End of word | End of word | |||||
re\? | re 0 or 1 | ||||||
re\+ | re 1 or more | ||||||
l\|r | l or r | l or r | |||||
(?#text) | Comment, ignored | ||||||
(?modifiers) | Embedded modifiers | ||||||
(?modifiers:re) | Shy grouping + modifiers | ||||||
(?:re) | Shy grouping | ||||||
\(?:...\) | Shy grouping | ||||||
(?=re) | Lookahead | ||||||
(?!re) | Negative lookahead | ||||||
(?<=p) | Lookbehind | ||||||
(?<!o) | Negative lookbehind | ||||||
(?{code}) (??{code}) | Embedded Perl | ||||||
(?>re) | Independent expression | ||||||
(?(cond)re) (?(cond)re|re) | Condition expression | ||||||
(?P<name>re) | Symbolic grouping | ||||||
(?P=name) | Symbolic backref | ||||||
String | GNU grep | BRE (grep) | ERE (egrep) | GNU Emacs | Perl | Python | Tcl |
Who Uses What?
BRE refers to POSIX "basic regular expressions" and ERE is POSIX "extended regular expressions".grep is supposed to use BREs, except that grep -E uses EREs. (GNU grep
fits some extensions in where POSIX leaves the behaviour unspecified). egrep uses EREs.grep -F doesn't use regexps at all, of course.
ed uses BREs. ex and vi use
BREs but additionally support \< and \> as described above, and use ~ to match the replacement part of the previous substitution.
expr uses BREs with all patterns implicitly anchored at the start.
awk is supposed to use EREs, plus the extra C-style escapes \\, \a, \b, \f, \n, \r, \t, \v with
their usual meanings. sed is supposed to use BREs, plus \n with its usual meaning.
lex is also supposed to use EREs with some extensions: "..." quotes everything inside it (backslash
escapes are recognized); an initial <state> matches a start condiiton; r/x matches r only when followed by x; and {name} matches the value of a substitution
symbol. A variety of escape sequences, including the usual C ones, are recognized. Possibly this deserves a new column.
regcomp uses BREs by default but can also use EREs. It has a variety of other options which
modify the syntax slightly.
Boost's regex++ supports a variety of syntaxes.
PCRE is almost the same as Perl, though it doesn't support the embedded Perl feature and the man page lists a number of other differences.
Vim has enough differences and extensions that it perhaps deserves a column (or two) to itself.
Subexpressions, Grouping and Back-References
Subexpressions or groups are surrounded by ( and ), or sometimes \( and \). They serve two purposes; firstly they override the precedence rules of other operators,and secondly they "capture" part of the text matched by a regexp. This can then be used later on in the regexp via the \digit syntax (this is called a back-reference) or outside the regexp to extract the appropriate part of a string.
"Shy grouping" has the precedence-overriding feature but not the capturing feature.
"Symbolic grouping" allows groups to be identified by name rather than number.
Match Multiple Times
The syntax of this varies a bit; sometimes you used \{ and \}, and sometimes you use { and }. However the idea is the same:RE{N} will match RE exactly N times.
RE{N,} will match RE N or more times.
RE{N,M} will match RE between N and M times (inclusive).
It is worth nothing that the GNU Grep manual says:
Traditional `egrep' did not support the `{' metacharacter, and some`egrep' implementations support `\{' instead, so portable scripts a literal `{'.should avoid `{' in `egrep' patterns and should use `[{]' to matc h
Bracket Expressions
This refers to expressions in [square brackets], for which POSIX defines a complicated syntax all of their own.Firstly, if the first character after the [ is a ^ (caret) then the sense of the match is reversed.
The rest of the bracket expression consists of a sequence of elements selected from the following list. The bracket expression as a whole matches any character (or character sequence) that is
matched by at least one of them (or is matched by none of them, if an initial ^ was used).
1. Collating symbols. These look like [.element.], where element is a collating element (i.e. a symbolic name for a multi-character string), and match the value of
the collating element in the current locale. This doesn't seem to work in GNU grep.
2. Equivalence classes. These look like [=element=], where element is a collating element. They match any collating element (single or multiple characters) which has
the same primary weight as element, i.e. if they appear in the same place in the current locale's collation sequence. This doesn't seem to work in GNU grep.
3. Character classes. These look like [:class:], where class is the name of the character class to match. The following character classes exist in all locales:
[:alnum:] [:alpha:] [:blank:] [:cntrl:] [:digit:][:graph:] [:lower:] [:print:] [:space:] [:upper:]
4. Range expressions. These look like start-end where start and end are either single characters or collating symbols. The behaviour is only specified
in the POSIX locale, where they match all the characters between start and end inclusive.
5. Single characters. These match themselves.
To include a ], put it immediately after the opening [ or [^; if it occurs later it will close the bracket expression. The hyphen (-) is not treated as a range
separator if it appears first or last, or as the endpoint of a range.
Emacs "character sets" are similar to bracket expressions, except that collating symbols, equivalence classes and character classes aren't supported.
Perl "character classes" are also similar. They support POSIX character class syntax (argh, confusing names!) and recognize, but don't support, collating symbols or equivalence classes.
GNU Grep and .
GNU Grep has slightly strange handling of . and newlines.Firstly, the manual says that . matches "any single character". Superficially it appears not to match the newline character:
$ echo | grep .$
The outcome is actually in keeping with standard and traditional behaviour for grep, where the newline is not included in the text to be matched. But that doesn't appear to be quite what's going
on with the GNU version, as explicitly searching for a newline does produce a match:
$ echo | perl -e 'exec("/usr/bin/grep","\n");' $
So is there a newline to match against or not?
The other case to consider is when the -z or --null-data option is used. In that case, . definitely does match a newline, exactly as the manual says:
$ perl -e 'print "\n\0";' | grep -z . | od -tx10000000 0a 00 0000002 $
Perl Variations
. and newlines
The /s modifier changes the meaning of . to match any haracter including \n.Anchors
The /m modifier causes ^ and $ to match at the start of any line within the subject string rathe than just the start and end of the subject string."Lookbehind" Matching
Perl's lookbehind matches, i.e. (?<=p) and (?<!p) only work for fixed-width patterns, not arbitrary regular expressions.
Sources
The POSIX regular expression specification can be found at http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html.For the regexp languages used by particular programs, I looked at the documentation for GNU Grep 2.4.2; GNU
Emacs 21.2.1; Perl 5.6.1; Python 2.2.1; and Tcl 8.3.3.
All errors are my own!
相关文章推荐
- 关于正则表达式分组的一个问题
- 关于开发的一些正则表达式
- 关于基础正则表达式的一些总结
- 收藏:关于正则表达式的的一些经验
- 关于在DELPHI6中使用正则表达式的一些心得
- 关于正则表达式的$问题
- 关于java中正则表达式的一些总结
- 关于正则表达式的递归匹配问题
- 关于正则表达式问题
- 关于正则表达式的递归匹配问题
- 关于正则表达式在access读取字符后替换的问题
- 关于 regcomp()、regexec() 正则表达式的问题
- 正则表达式关于多个数字匹配的问题
- 关于ORACLE正则表达式一些
- 关于flex中正则表达式上下文匹配的问题
- jsp中的一些关于注释表达式的简单问题
- 关于Python正则表达式的区分大小写的问题
- 关于Notepad++中用正则表达式匹配中文的问题
- 关于脏字典过滤问题-用正则表达式来过滤脏数据
- 关于脏字典过滤问题-用正则表达式来过滤脏数据