SAS9新体验-在DATA STEP中使用perl 正则表达式支持(Regular Expressions)
2004-09-22 14:04
696 查看
sas自9版开始支持perl(Perl 5.6.1 ) 正则表达式支持,极大的方便了数据校验的简易性、可靠性
在没有Regular Expressions(RE)前,只能使用index,substr,tranwrd等函数对字符串进行操作,但这些函数对动态字符串的操作是缺乏弹性且效率较低
故SAS9推出RE,以方便的进行字符串校验、替换、提取
Regexp是由一组被称为metacharacters的特殊字符组成,这些特殊字符代表着特殊的匹配规则,具体请参考
http://www.perldoc.com/perl5.6.1/pod/perlre.html
各种使用案例如下:
1、对客户数据中的电话号码进行数据校验
data _null_;
retain re;
length first last home business $ 16;
if _N_ = 1 then do;
/*设置电话匹配模式1 (XXX) XXX-XXXX */
paren = "/([2-9]/d/d/) ?[2-9]/d/d-/d/d/d/d";
/*设置电话匹配模式2 XXX-XXX-XXXX */
dash = "[2-9]/d/d-[2-9]/d/d-/d/d/d/d";
/* 合并两种匹配模式,使用【|】特殊符号 */
regexp = "/(" || paren || ")|(" || dash || ")/";
/*判断是否为正确的正则表达式*/
re = prxparse(regexp);
if missing(re) then do;
putlog "ERROR: Invalid regexp " regexp;
stop;
end;
end;
input first last home business;
/*启用正则匹配,如果匹配失败则返回missing*/
if ^prxmatch(re, home) then
putlog "NOTE: Invalid home phone number for " first last home;
if ^prxmatch(re, business) then
putlog "NOTE: Invalid business phone number for " first last business;
datalines;
Jerome Johnson (919)319-1677 (919)846-2198
Romeo Montague 800-899-2164 360-973-6201
Imani Rashid (508)852-2146 (508)366-9821
Palinor Kent . 919-782-3199
Ruby Archuleta . .
Takei Ito 7042982145 .
Tom Joad 209/963/2764 2099-66-8474
;
输出结果如下:
NOTE: Invalid home phone number for Palinor Kent
NOTE: Invalid home phone number for Ruby Archuleta
NOTE: Invalid business phone number for Ruby Archuleta
NOTE: Invalid home phone number for Takei Ito 7042982145
NOTE: Invalid business phone number for Takei Ito
NOTE: Invalid home phone number for Tom Joad 209/963/2764
NOTE: Invalid business phone number for Tom Joad 2099-66-84
2、替换字符串,把<替换为<把>替换为>
data _null_;
retain lt_re gt_re;
if _N_ = 1 then do;
/*设置替换模式 格式为:s/正则匹配表达式/替换的文本/*/
lt_re = prxparse('s/</');
gt_re = prxparse('s/>/>/');
if missing(lt_re) or missing(gt_re) then do;
putlog "ERROR: Invalid regexp.";
stop;
end;
end;
input;
/*启用这则替换*/
call prxchange(lt_re, -1, _infile_);
call prxchange(gt_re, -1, _infile_);
put _infile_;
datalines4;
The bracketing construct ( ... ) creates capture buffers.
To refer to the digit'th buffer use / within the match.
Outside the match use "$" instead of "/". (The /
notation works in certain circumstances outside the match.
See the warning below about /1 vs $1 for details.) Referring
back to another part of the match is called a backreference.
;;;;
输出结果如下:
The bracketing construct ( ... ) creates capture buffers.
To refer to the digit'th buffer use /<digit> within the match.
Outside the match use "$" instead of "/". (The /<digit>
notation works in certain circumstances outside the match.
See the warning below about /1 vs $1 for details.) Referring
back to another part of the match is called a backreference.
3、从客户信息中提取客户的办公电话文本
data _null_;
retain re areacode_re;
length first last home business $ 16;
length areacode $ 3;
if _N_ = 1 then do;
/* (XXX) XXX-XXXX */
paren = "/(([2-9]/d/d)/) ?[2-9]/d/d-/d/d/d/d";
/* XXX-XXX-XXXX */
dash = "([2-9]/d/d)-[2-9]/d/d-/d/d/d/d";
/* Combine two phone patterns into one with a | */
regexp = "/(" || paren || ")|(" || dash || ")/";
re = prxparse(regexp);
if missing(re) then do;
putlog "ERROR: Invalid regexp " regexp;
stop;
end;
areacode_re = prxparse("/828|336|704|910|919|252/");
if missing(areacode_re) then do;
putlog "ERROR: Invalid area code regexp";
stop;
end;
end;
input first last home business;
if ^prxmatch(re, home) then
putlog "NOTE: Invalid home phone number for " first last home;
if prxmatch(re, business) then do;
/*返回最后匹配结果的信息*/
which_format = prxparen(re);
/*从匹配结果中提取字符串*/
call prxposn(re, which_format, pos, len);
areacode = substr(business, pos, len);
/*判断提取出的字符串的区号是否匹配,匹配则输出结果*/
if prxmatch(areacode_re, areacode) then
put "In North Carolina: " first last business;
end;
else
putlog "NOTE: Invalid business phone number for " first last business;
datalines;
Jerome Johnson (919)319-1677 (919)846-2198
Romeo Montague 800-899-2164 360-973-6201
Imani Rashid (508)852-2146 (508)366-9821
Palinor Kent 704-782-4673 704-782-3199
Ruby Archuleta 905-384-2839 905-328-3892
Takei Ito 704-298-2145 704-298-4738
Tom Joad 515-372-4829 515-389-2838
;
输出结果如下:
In North Carolina: Jerome Johnson (919)846-2198
In North Carolina: Palinor Kent 704-782-3199
In North Carolina: Takei Ito 704-298-4738
以上源代码来自SAS网站,我只是稍微加了点注释,便于初次接触者了解,详情请参考SAS网站
在没有Regular Expressions(RE)前,只能使用index,substr,tranwrd等函数对字符串进行操作,但这些函数对动态字符串的操作是缺乏弹性且效率较低
故SAS9推出RE,以方便的进行字符串校验、替换、提取
Regexp是由一组被称为metacharacters的特殊字符组成,这些特殊字符代表着特殊的匹配规则,具体请参考
http://www.perldoc.com/perl5.6.1/pod/perlre.html
各种使用案例如下:
1、对客户数据中的电话号码进行数据校验
data _null_;
retain re;
length first last home business $ 16;
if _N_ = 1 then do;
/*设置电话匹配模式1 (XXX) XXX-XXXX */
paren = "/([2-9]/d/d/) ?[2-9]/d/d-/d/d/d/d";
/*设置电话匹配模式2 XXX-XXX-XXXX */
dash = "[2-9]/d/d-[2-9]/d/d-/d/d/d/d";
/* 合并两种匹配模式,使用【|】特殊符号 */
regexp = "/(" || paren || ")|(" || dash || ")/";
/*判断是否为正确的正则表达式*/
re = prxparse(regexp);
if missing(re) then do;
putlog "ERROR: Invalid regexp " regexp;
stop;
end;
end;
input first last home business;
/*启用正则匹配,如果匹配失败则返回missing*/
if ^prxmatch(re, home) then
putlog "NOTE: Invalid home phone number for " first last home;
if ^prxmatch(re, business) then
putlog "NOTE: Invalid business phone number for " first last business;
datalines;
Jerome Johnson (919)319-1677 (919)846-2198
Romeo Montague 800-899-2164 360-973-6201
Imani Rashid (508)852-2146 (508)366-9821
Palinor Kent . 919-782-3199
Ruby Archuleta . .
Takei Ito 7042982145 .
Tom Joad 209/963/2764 2099-66-8474
;
输出结果如下:
NOTE: Invalid home phone number for Palinor Kent
NOTE: Invalid home phone number for Ruby Archuleta
NOTE: Invalid business phone number for Ruby Archuleta
NOTE: Invalid home phone number for Takei Ito 7042982145
NOTE: Invalid business phone number for Takei Ito
NOTE: Invalid home phone number for Tom Joad 209/963/2764
NOTE: Invalid business phone number for Tom Joad 2099-66-84
2、替换字符串,把<替换为<把>替换为>
data _null_;
retain lt_re gt_re;
if _N_ = 1 then do;
/*设置替换模式 格式为:s/正则匹配表达式/替换的文本/*/
lt_re = prxparse('s/</');
gt_re = prxparse('s/>/>/');
if missing(lt_re) or missing(gt_re) then do;
putlog "ERROR: Invalid regexp.";
stop;
end;
end;
input;
/*启用这则替换*/
call prxchange(lt_re, -1, _infile_);
call prxchange(gt_re, -1, _infile_);
put _infile_;
datalines4;
The bracketing construct ( ... ) creates capture buffers.
To refer to the digit'th buffer use / within the match.
Outside the match use "$" instead of "/". (The /
notation works in certain circumstances outside the match.
See the warning below about /1 vs $1 for details.) Referring
back to another part of the match is called a backreference.
;;;;
输出结果如下:
The bracketing construct ( ... ) creates capture buffers.
To refer to the digit'th buffer use /<digit> within the match.
Outside the match use "$" instead of "/". (The /<digit>
notation works in certain circumstances outside the match.
See the warning below about /1 vs $1 for details.) Referring
back to another part of the match is called a backreference.
3、从客户信息中提取客户的办公电话文本
data _null_;
retain re areacode_re;
length first last home business $ 16;
length areacode $ 3;
if _N_ = 1 then do;
/* (XXX) XXX-XXXX */
paren = "/(([2-9]/d/d)/) ?[2-9]/d/d-/d/d/d/d";
/* XXX-XXX-XXXX */
dash = "([2-9]/d/d)-[2-9]/d/d-/d/d/d/d";
/* Combine two phone patterns into one with a | */
regexp = "/(" || paren || ")|(" || dash || ")/";
re = prxparse(regexp);
if missing(re) then do;
putlog "ERROR: Invalid regexp " regexp;
stop;
end;
areacode_re = prxparse("/828|336|704|910|919|252/");
if missing(areacode_re) then do;
putlog "ERROR: Invalid area code regexp";
stop;
end;
end;
input first last home business;
if ^prxmatch(re, home) then
putlog "NOTE: Invalid home phone number for " first last home;
if prxmatch(re, business) then do;
/*返回最后匹配结果的信息*/
which_format = prxparen(re);
/*从匹配结果中提取字符串*/
call prxposn(re, which_format, pos, len);
areacode = substr(business, pos, len);
/*判断提取出的字符串的区号是否匹配,匹配则输出结果*/
if prxmatch(areacode_re, areacode) then
put "In North Carolina: " first last business;
end;
else
putlog "NOTE: Invalid business phone number for " first last business;
datalines;
Jerome Johnson (919)319-1677 (919)846-2198
Romeo Montague 800-899-2164 360-973-6201
Imani Rashid (508)852-2146 (508)366-9821
Palinor Kent 704-782-4673 704-782-3199
Ruby Archuleta 905-384-2839 905-328-3892
Takei Ito 704-298-2145 704-298-4738
Tom Joad 515-372-4829 515-389-2838
;
输出结果如下:
In North Carolina: Jerome Johnson (919)846-2198
In North Carolina: Palinor Kent 704-782-3199
In North Carolina: Takei Ito 704-298-4738
以上源代码来自SAS网站,我只是稍微加了点注释,便于初次接触者了解,详情请参考SAS网站
相关文章推荐
- SAS9新体验-在DATA STEP中使用对象
- SAS9新体验-在DATA STEP中使用JAVA对象
- perl 正则表达式使用技巧
- perl 中部分正则表达式中匹配非空字符和正常使用字符
- 正则表达式使用详解 (php,perl,unix,javascript)
- 正则表达式使用详解 (php,perl,unix,javascript) (有点意思)
- 某网友总结的grep、sed、awk、perl等对正则表达式的支持的差别,谁给贴个网址
- Eclipse中的查询支持使用正则表达式
- Delphi正则表达式使用方法(TPerlRegEx)
- 在PHP中使用与Perl兼容的正则表达式
- SAS9.2新功能--在DATA STEP中使用自定义函数
- (管道| / 重定向> / xargs)/find 与xargs结合使用/vi,grep,sed,awk(支持正则表达式的工具程序)
- perl 正则表达式支持的特殊字符
- java中提供了对正则表达式的支持。 有的时候,恰当地使用正则,可以让我们的工作事半功倍! 如下代码用来检验一个四则运算式中数据项的数目,请填写划线部分缺少的代码。 注意:只填写缺少代码,不要
- 初学正则表达式2(在Perl下使用)
- 在PHP中使用与Perl兼容的正则表达式
- Perl:Perl正则表达式、循环和SHELL命令结合使用。
- 在PHP中使用与Perl兼容的正则表达式
- 使用perl的正则表达式对文件中的特定类型超链接里面的换行去掉,使其在一行
- grep、sed、awk、perl、js、vim等对正则表达式的支持的差别