您的位置:首页 > 其它

开源:基于百万商业圈.NET开发框架开发的并行带分词的采集器

2011-07-10 23:25 337 查看
开源:基于百万商业圈.NET开发框架开发的并行带分词的采集器

并行采集并做分词处理,在2MB带宽下测试平均:100 URL 用时3

秒种

特点:各种编码自动识别、各种编码自动转换、对有压缩的页能

自动解压、采集信息细致全面、程序非常稳定 等

链接 + doc.Url + " 采集信息如下
主机名 + doc.HostName + "<
内容类型 + doc.ContentType
编码:+ doc.Encoding + "<br
Mime类型:+ doc.MimeType +
服务器IP地址:+ doc.ServerI
所用WebServer:+ doc.WebSer
ProtocolVersion:+ doc.ProtProtocolVersion
URL所在地区:+ doc.Area
URL所在国家:+ doc.Country
URL对应的PR值:+ doc.PageRank
("html源代码:+ doc.Html
URL标题:+ doc.Title
URL标题分词结果:+ doc.TitleFc
Keywords信息:+ doc.Keyword
Keywords分词结果:+ doc.KeywordFc
描述信息:+ doc.Description
描述信息分词结果:+ doc.DescriptionFc
去除html标记后的正文内容:+doc.PlainText
去除html标记后的正文内容分词结果:+doc.PlainTextFc

====================================================

下面的信息是每次采集后获取的信息不知道是否全面?
--序号 --domain
--ContentType --Encoding
--MimeType --HostName
--IP --Web Server
--ProtocolVersion --country
--area --pr值
--page image url --url (sorted)
--title --title 分词内容
--keyword --keyword 分词内容
--description --description 分词内容
--source --source length
--含链接的文本 --含链接的文本 length
--含链接的文本分词内容 --不含链接的文本
--不含链接的文本 length --不含链接的文本分词内容
--所有外链 --所有外链数(链出数)
--所有内链 --所有内链数
--被链入数 --层数
--采集状态 0:未采集 1:正在采集 2:采集完毕
--采集日期 --是否做了索引 --是否计算了权重

CSDN网站信息一览无余啊,绝对震撼 采集效果如下:





链接http://www.csdn.net/index.htm 采集信息如下
主机名:
内容类型:text/html; charset=utf-8
编码:UTF-8
Mime类型:text/html
服务器IP地址:117.79.157.242
所用WebServer:nginx/0.7.68
ProtocolVersion:1.1
URL所在地区:联通
URL所在国家:江苏省
URL对应的PR值:0
URL标题:CSDN.NET - 全球最大中文IT社区,为IT专业技术人员提供最全面的信息传播和服务平台
URL标题分词结果:"CSDN","NET","全球","全世界","最大","中文","IT","社区","为","IT","专业技术人员","提供","最","全面","的","信息","传播","传布","和","以及","服务","平台"
Keywords信息:
Keywords分词结果:
描述信息:
描述信息分词结果:

链接http://www.codeproject.com/ 采集信息如下
主机名:
内容类型:text/html; charset=utf-8
编码:UTF-8
Mime类型:text/html
服务器IP地址:65.39.148.34
所用WebServer:Microsoft-IIS/7.0
ProtocolVersion:1.1
URL所在地区:Peer 1 Network Inc
URL所在国家:美国
URL对应的PR值:6
URL标题:CodeProject - Your Development Resource
URL标题分词结果:"CodeProject","Your","DevelopmentResource"
Keywords信息:Free source code, tutorials
Keywords分词结果:"Free","source","codetutorials"
描述信息:Free source code and tutorials for Software developers and Architects.
描述信息分词结果:"Free","source","code","and","tutorials","for","Software","developers","and","Architects"

链接http://www.china.com/ 采集信息如下
主机名:
内容类型:text/html
编码:GB18030
Mime类型:text/html
服务器IP地址:125.39.101.17
所用WebServer:
ProtocolVersion:1.0
URL所在地区:联通
URL所在国家:天津市
URL对应的PR值:7
URL标题:中华网 - 首页
URL标题分词结果:"中华网","首页"
Keywords信息:要闻,热点话题,互动娱乐,新闻,汽车,各地新闻,科技,体育,游戏,超级神算,论坛,博客,军事,旅游,娱乐,开运,求医,财经,文化,网页游戏,幼儿
Keywords分词结果:" 要闻","热点话题","互动","娱乐","文娱","新闻","汽车","各地","新闻","科技","体育","游戏","超级","神算"," 论坛","博客","军事","旅游","游览","娱乐","文娱","开运","求医","财经","文化","网页","游戏","幼儿"
描述信息:中华网以中国的市场为核心,致力为当地用户提供流动增值服务、网上娱乐及互联网服务。本公司亦推出网上游戏,及透过其门户网站提供包罗万有的网上产品及服务。
描述信息分词结果:" 中华网","以","中国","的","市场","为","核心","致力","为","当地","用户","提供","流动","活动","增值"," 服务","网上","娱乐","文娱","及","互联网","服务","本","公司","亦","推出","网上游戏","及","透过"," 其","门户","网站","提供","包罗万有","的","网上","产品","及","服务"

链接http://www.aibang.com/ 采集信息如下
主机名:
内容类型:text/html;charset:utf-8
编码:UTF-8
Mime类型:text/html
服务器IP地址:60.28.205.202
所用WebServer:Apache/2.2.4 (Unix) PHP/5.2.3
ProtocolVersion:1.1
URL所在地区:联通ADSL
URL所在国家:天津市
URL对应的PR值:6
URL标题:爱帮网 - 电子地图,公交查询,团购·打折·优惠券,城市黄页
URL标题分词结果:"爱","帮","网","电子地图","公交查询","团","购","打折","优惠券","城市","黄页"
Keywords信息:电子地图,公交查询,团购,优惠券,黄页
Keywords分词结果:"电子地图","公交查询","团","购","优惠券","黄页"
描述信息:爱帮网是中国领先的本地生活搜索服务提供商。在全国265个城市,爱帮网为您提供最为详尽准确的电话、地址、电子地图等本地商户黄页信息,以及公交查询,团购打折,生活经验,影讯活动等本地生活类信息。
描述信息分词结果:" 爱","帮","网","是","中国","领先","的","本地","生活","糊口","搜索","搜寻","服务","提供商","在","全国","265","个","城市","爱","帮","网","为","您","提供","最为","详尽","详实","准确","的","电话","地址","电子地图","等","本地","商户","黄页","信息","以及","和","公交查询","团","购","打折","生活","糊口","经验","影","讯","活动","流动","等","本地","生活","糊口","类","信息"

链接http://www.google.com.hk/ 采集信息如下
主机名:
内容类型:text/html; charset=UTF-8
编码:UTF-8
Mime类型:text/html
服务器IP地址:74.125.153.99
所用WebServer:gws
ProtocolVersion:1.1
URL所在地区:加利福尼亚州山景市谷歌公司
URL所在国家:美国
URL对应的PR值:0
URL标题:Google
URL标题分词结果:"Google"
Keywords信息:
Keywords分词结果:
描述信息:
描述信息分词结果:

链接http://www.sina.com.cn/ 采集信息如下
主机名:
内容类型:text/html
编码:GB18030
Mime类型:text/html
服务器IP地址:61.172.201.195
所用WebServer:Apache/2.0.54 (Unix)
ProtocolVersion:1.0
URL所在地区:电信张江机房
URL所在国家:上海市
URL对应的PR值:9
URL标题:新浪首页
URL标题分词结果:"新浪","首页"
Keywords信息:
Keywords分词结果:
描述信息:新浪网为全球用户24小时提供全面及时的中文资讯,内容覆盖国内外突发新闻事件、体坛赛事、娱乐时尚、产业资讯、实用信息等,设有新闻、体育、娱乐、财经、科技、房产、汽车等30多个内容频道,同时开设博客、视频、论坛等自由互动交流空间。
描述信息分词结果:" 新浪网","为","全球","全世界","用户","24","小时","提供","全面","及时","的","中文","资讯","内容","覆盖","笼盖","国内外","突发","新闻","事件","体坛","赛事","娱乐","文娱","时尚","产业","资讯","实用信息"," 等","设有","新闻","体育","娱乐","文娱","财经","科技","房产","汽车","等","30","多","个","内容","频道","同时","开设","博客","视频","论坛","等","自由","互动","交流","交换","空间"

链接http://www.163.com/ 采集信息如下
主机名:
内容类型:text/html; charset=GBK
编码:GB18030
Mime类型:text/html
服务器IP地址:222.186.19.19
所用WebServer:nginx
ProtocolVersion:1.1
URL所在地区:电信ADSL
URL所在国家:江苏省镇江市
URL对应的PR值:8
URL标题:网易
URL标题分词结果:"网易"
Keywords信息:网易,邮箱,游戏,新闻,体育,娱乐,女性,亚运,论坛,短信,数码,汽车,手机,财经,科技,相册
Keywords分词结果:"网易","邮箱","游戏","新闻","体育","娱乐","文娱","女性","亚运","论坛","短信","数码","汽车","手机","财经","科技","相册"
描述信息:网易是中国领先的互联网技术公司,为用户提供免费邮箱、游戏、搜索引擎服务,开设新闻、娱乐、体育等30多个内容频道,及博客、视频、论坛等互动交流,网聚人的力量。
描述信息分词结果:" 网易","是","中国","领先","的","互联网","技术","公司","为","用户","提供","免费邮箱","游戏","搜索引擎"," 服务","开设","新闻","娱乐","文娱","体育","等","30","多","个","内容","频道","及","博客","视频","论坛","等","互动","交流","交换","网","聚","人","的","力量","气力"

链接http://www.sohu.com/ 采集信息如下
主机名:
内容类型:text/html
编码:GB18030
Mime类型:text/html
服务器IP地址:61.135.181.167
所用WebServer:SWS
ProtocolVersion:1.1
URL所在地区:联通ADSL
URL所在国家:北京市
URL对应的PR值:8
URL标题:搜狐-中国最大的门户网站
URL标题分词结果:"搜狐","中国","最大","的","门户","网站"
Keywords信息:
Keywords分词结果:
描述信息:
描述信息分词结果:

链接http://www.qq.com/ 采集信息如下
主机名:
内容类型:text/html; charset=GB2312
编码:GB18030
Mime类型:text/html
服务器IP地址:60.28.14.190
所用WebServer:squid/3.0
ProtocolVersion:1.1
URL所在地区:联通ADSL
URL所在国家:天津市
URL对应的PR值:8
URL标题:腾讯首页
URL标题分词结果:"腾讯","首页"
Keywords信息:
Keywords分词结果:
描述信息:腾讯网(www.QQ.com)是中国浏览量最大的中文门户网站,是腾讯公司推出的集新闻信息、互动社区、娱乐产品和基础服务为一体的大型综合门户网站。腾讯网服务于全球华人用户,致力成为最具传播力和互动性,权威、主流、时尚的互联网媒体平台。通过强大的实时新闻和全面深入的信息资讯服务,为中国数以亿计的互联网用户提供富有创意的网上新生活。
描述信息分词结果:" 腾讯","网","www","QQ","com","是","中国","浏览","阅读","涉猎","量","最大","的","中文","门户","网站","是","腾讯","公司","推出","的","集","新闻","信息","互动","社区","娱乐","文娱","产品"," 和","以及","基础","服务","为","一体","的","大型","综合","门户","网站","腾讯","网","服务","于","全球华人","用户","致力","成为","最","具","传播","传布","力","和","以及","互动","性","权威","主流","时尚","的","互联网","媒体","平台","通过","强大","的","实时","新闻","和","以及","全面","深入","深刻"," 的","信息资讯","服务","为","中国","数","以","亿","计","的","互联网","用户","提供","富有","创意"," 的","网上","新生活"

下载地址1:(最新调试版):( Bwsyq.ParallelSpider.SourceCode.rar) http://ishare.iask.sina.com.cn/f/17042417.html

下载地址2(最新调试版):( Bwsyq.ParallelSpider.SourceCode.rar ) http://www.everbox.com/f/cUrYrgvhfEjxNPVJad6stOqSxA

更多详情请关注: http://hi.baidu.com/earthsearch/home

注:编译运行时请建立目录

“D:\分词\” (也可以自行修改 web.config 中的<KeyInfo Key="PhysicaPath" Value="D:\分词\"/>)

然后将代码中的文件夹 App_Data 全部拷贝到“D:\分词\”目录下

开发环境:Windows 7.0 + VS2008
有任何疑问请与我本人联系 QQ 99923309 QQ群:74965947 更多范例请加QQ群探讨
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: