您的位置:首页 > 理论基础 > 计算机网络

网络爬虫速成指南(二)网页解析(基于模板)

2013-08-15 10:58 357 查看
网页解析技术:
1 xpath教程
2 正则表达式教程

xpath是将html加载为DOM树解析,简单,易维护。
通常我用正则作为辅助抽取,用xpath定位后,再从定位的数据中用正则抽取。

xpath的类库:
.net 方向主要用到HtmlAgilityPack
java 方向主要用到HtmlCleaner(得FQ) jsoup

以下是本人封装好的:HtmlParser
使用示例:


HtmlParser<HotelInfo> parser = new HtmlParser<HotelInfo>();
ParseConfig config = new ParseConfig("save_xc.xml");
String html = File.ReadAllText("xiecheng.txt", Encoding.GetEncoding("GBK"));
HotelInfo entity = parser.GetEntity(html, config);


模板样例1:

<?xml version="1.0" encoding="utf-8"?>
<template>
<page>
<save root=".">
<field>
<name>Title</name>
<xpath>//div[@id='J_Article_Wrap']//h1</xpath>
</field>
<field>
<name>PubTime</name>
<xpath>//*[@id='pub_date']</xpath>
<regex>
<pattern>(\d+)年(\d+)月(\d+)日</pattern>
<format>{0}-{1}-{2}</format>
</regex>
</field>
<field>
<name>Article</name>
<xpath>//*[@id="artibody"]</xpath>
</field>
</save>
</page>
</template>
模板样例2:

<?xml version="1.0" encoding="utf-8"?>
<template>
<page>
<save_m root="//tr[@id]">
<field>
<name>Price</name>
<xpath>./td[@class='price']</xpath>
</field>
</save_m>
</page>
</template>

模板样例3:

<?xml version="1.0" encoding="utf-8"?>
<template>
<page>
<save root=".">
<field>
<name>Name</name>
<xpath>//h1</xpath>
</field>
<field>
<name>EngName</name>
<xpath>//div[@class='name']/h2</xpath>
</field>
<field>
<name>Star</name>
<xpath>//div[@class='grade']/span/@title</xpath>
</field>
<field>
<name>Address</name>
<xpath>//div[@class='adress']</xpath>
</field>
<field>
<name>Description</name>
<xpath>//*[@id="htlDes"]</xpath>
</field>
<field>
<name>Facility</name>
<xpath position='outerhtml'>//div[@class="htl_info_table "]</xpath>
</field>
<field>
<name>Policy</name>
<xpath position='outerhtml'>//div[@class='detail_main']/div[@class="htl_info_table"]</xpath>
</field>
<field>
<name>Traffic</name>
<xpath position='outerhtml'>//div[@class='transSub'][1]/div[@class="htl_info_table"]</xpath>
</field>
<field>
<name>Nearby</name>
<xpath position='outerhtml'>//div[@class='transSub'][2]/div[@class="htl_info_table"]</xpath>
</field>
</save>
</page>
</template>
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: