您的位置:首页 > 其它

perl 循环类选择器 ,爬取内容

2016-05-26 18:03 155 查看
jrhmpt01:/root/lwp/0526# cat 0526.txt
<div class="TXD_sy_title"><span class="TXD_sy_text_1">天下金专区</span> <span class="TXD_sy_text_2">投资期限自选  可进行债权转让  100元起投  每月还息,到期还本</span><span class="TXD_sy_text_3" style="float: right"><a href="/AnJuJinIntroduce.html" target="_blank">产品介绍 ></a>    <a href="/AnJuJinIndex.html" target="_blank" class="grey">更多项目 ></a></span></div>
<div class="anjlist" id="txjDiv">
<ul class="altitle TXD_top_title">
<li class="alcw1 TXD_top_title1">项目名称</li>
<li class="alcw2">投资金额</li>
<li class="alcw3">剩余投资期限</li>
<li class="alcw4">预期年化收益</li>
<li class="alcw4">进度</li>
<li class="alcw5">起投金额</li>
<li class="alcw6">操作</li>
</ul>

<ul class="alcomment" style="overflow: visible;">
<li class="alcw1"><a target="_blank" href="/invest/fd6b88342c69470fb8ae9365589f78aa.html">天下金 201605253763</a></li>

<li class="alcw2">1,000,000.00元</li>
<li class="alcw3">27 天</li>

<li class="alcw4">5.5% </li>

<li class="alcw4 alcw41">
<div class="ajjbfb txdbfb bfb100">100<span>%</span></div>
</li>
<li class="alcw5">100.00元</li>
<li class="alcw6">
<div class="txdbtns4 mt27 ml40"><a href="/invest/fd6b88342c69470fb8ae9365589f78aa.html" target="_blank" class="txdpng">查看</a></div>
</li>
</ul>

jrhmpt01:/root/lwp/0526# cat a2.pl
use  LWP::UserAgent;
use DBI;
use POSIX;
use Data::Dumper;
use HTML::TreeBuilder;
my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;
$ua->agent("Mozilla/8.0");

use HTML::TreeBuilder::XPath;
$tree= HTML::TreeBuilder::XPath->new;
$tree->parse_file( "0526.txt");
my    @pages=$tree->find_by_tag_name('li');
#先循环取出所有的li标签的 类选择器
foreach (@pages) {
@titlepage = $_->attr('class');
foreach (@titlepage) {
if ($_){
print "\$_ is $_\n";
unless ($_ ~~ @urlall) { push (@urlall ,$_);};
};
};
};

print @urlall ;
print "\n";

##循环类选择器 查找li标签的@class="$var"的值,class代表类选择器: .开头
foreach my $var (@urlall){
#my $url=qq(/html/body//li[@class='$var']);
my $url="/html/body//li\[\@class=xxx\]";
$url =~ s/xxx/"$var"/g;
print "\$url is $url\n";
@total= $tree->findvalues("$url");
print @total;
print "\n";
#my @title= $tree->findvalues('/html/body//li[@class="alcw4 alcw41"]');

};
jrhmpt01:/root/lwp/0526# perl a2.pl
$_ is alcw1 TXD_top_title1
$_ is alcw2
$_ is alcw3
$_ is alcw4
$_ is alcw4
$_ is alcw5
$_ is alcw6
$_ is alcw1
$_ is alcw2
$_ is alcw3
$_ is alcw4
$_ is alcw4 alcw41
$_ is alcw5
$_ is alcw6
alcw1 TXD_top_title1alcw2alcw3alcw4alcw5alcw6alcw1alcw4 alcw41
$url is /html/body//li[@class="alcw1 TXD_top_title1"]
项目名称
$url is /html/body//li[@class="alcw2"]
投资金额1,000,000.00元
$url is /html/body//li[@class="alcw3"]
剩余投资期限27 天
$url is /html/body//li[@class="alcw4"]
预期年化收益进度5.5%
$url is /html/body//li[@class="alcw5"]
起投金额100.00元
$url is /html/body//li[@class="alcw6"]
操作查看
$url is /html/body//li[@class="alcw1"]
天下金 201605253763
$url is /html/body//li[@class="alcw4 alcw41"]
100%
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: