nutch2.2.1 mysql 建表语句
2015-08-01 21:45
561 查看
CREATE TABLE `webpage` (
`id` varchar(250) NOT NULL,
`headers` blob,
`text` mediumtext,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`prevModifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`batchId` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED;
1 unfetched (links not yet fetched due to limits set in regex-urlfilter.txt, -TopN crawl parameters, etc.)
2 fetched (page was successfully fetched)
3 gone (that page no longer exists)
4 redir_temp (temporary redirection — see reprUrl below for more details)
5 redir_perm (permanent redirection — see reprUrl below for more details)
34 retry
38 not modified
protocolStatus
ACCESS_DENIED 17
BLOCKED 23
EXCEPTION 16
FAILED 2
GONE 11
MOVED 12
NOTFETCHING 20
NOTFOUND 14
NOTMODIFIED 21
PROTO_NOT_FOUND 10
REDIR_EXCEEDED 19
RETRY 15
ROBOTS_DENIED 18
SUCCESS 1
TEMP_MOVED 13
WOULDBLOCK 22
`id` varchar(250) NOT NULL,
`headers` blob,
`text` mediumtext,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`prevModifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`batchId` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED;
id
主键,根据网页url生成(格式:reversed domain name:protocol:port and path),因此,Nutch2只能保存当前网页的状态,而不能保存历史信息。headers
标准的http headers ,其中包含非打印字符。Last-Modified 等信息可能于判断网页是否需要更新(仅需发一个head请求,而不是下载整个网页)text
合并了解析出来的所有文本字段(utf-8),用于普通的检索,不过现在检索一般使用solr,所以这个字段意义不大。status
记录抓取状态,以下是各个状态代表的含义1 unfetched (links not yet fetched due to limits set in regex-urlfilter.txt, -TopN crawl parameters, etc.)
2 fetched (page was successfully fetched)
3 gone (that page no longer exists)
4 redir_temp (temporary redirection — see reprUrl below for more details)
5 redir_perm (permanent redirection — see reprUrl below for more details)
34 retry
38 not modified
markers
各个任务的标记(如:dist***injmrk_***updmrk_***ftcmrk_***gnmrk_***prsmrk_**)parseStatus
parse状态,在执行parseJob之前都是NULL。 ParseStatusCodes.htmlmodifiedTime
最后更改时间score
网页重要程度(PR),Nutch2.2.1 使用的是OPIC算法typ
类型(如application/xhtml+xml)batchId
批次ID,由generate生成( (curTime/1000) + "-" +randomSeed ), fetch时可选择特定batchId的任务baseUrl
用于将网页源码中相对链接地址的转为绝对地址,通常就是当前网页的地址,有重定向的情况下,是最终定向到的地址content
完整的网页源码,未经任何处理(字符集也没转)。title
title标签里的内容 (已转utf-8编码)reprUrl
重定向url,将在下一轮抓取,不会立即跟入fetchInterval
抓取间隔,默认是2592000(30天)prevFetchTime
上次抓取时间inlinks
入链(url+linktext)prevSignature
上次更新时网页签名outlinks
出链(url+linktext)fetchTime
下次抓取时间,通常是间隔一个月retriesSinceFetch
重试次数signature
网页签名,用于判断网页是否改变,默认的实现是:org.apache.nutch.crawl.MD5Signature ,采用content的MD5值,另一个方案是org.apache.nutch.crawl.TextProfileSignature,content抽取文本、分词、排序等一系列操作后计算MD5值 TextProfileSignaturemetadata
自定义元数据,可以在种子文件里面加,例如: http://xxxx/xxx.html \t type=newsprotocolStatus
ACCESS_DENIED 17
BLOCKED 23
EXCEPTION 16
FAILED 2
GONE 11
MOVED 12
NOTFETCHING 20
NOTFOUND 14
NOTMODIFIED 21
PROTO_NOT_FOUND 10
REDIR_EXCEEDED 19
RETRY 15
ROBOTS_DENIED 18
SUCCESS 1
TEMP_MOVED 13
WOULDBLOCK 22
相关文章推荐
- mysql 数据库分区
- mysql修改数据库的密码
- mysql中事物的隐性的提交
- 测试连接MySQL数据库时遇到的一些问题
- windows mysql 自动备份的几种方法
- mysql中的mysql_pconnect和mysql_connect的区别
- MySQL的InnoDB索引原理详解 (转)
- 对mysql多表查询的理解
- mysql之DML(SELECT DELETE INSERT UPDATE)
- mysql 5.6.24 在cmd操作启动服务、停止服务操作
- ERROR 2003 (HY000): Can't connect to MySQL server on 'localhost' (10061)
- mysql添加索引的方法,及max()函数的优化
- MySQL中基本的用户和权限管理方法小结
- ubuntu 安装mysql
- 【转】MySQL修改字段默认值_mysql数据库_三联
- MySQL编码问题
- MySql常用命令--优化参数以及日常管理
- MySQL 触发器简单实例
- mysql存储过程之游标遍历数据表
- MySQL 存储过程游标错误