PHP正则采集类,采集moviereleased.net站点,可以无限扩展
2012-01-10 10:35
357 查看
<?php define('COOKIE_PATH',dirname(__FILE__).'/cookie.txt'); class Collection { private $_url; private $_regex; private $_match_href = array(); private $_cookie_file = COOKIE_PATH; private $_login_url; private $_detail = array(); function __construct() { } function setUrl($url) { if(!is_array($url)){ return false; } $this->_url = $url['target_url']; if($url['login_url']){ $this->_login_url = $url['login_url']; } } function setRegex($regex) { if(!is_array($regex)){ return false; } $this->_regex = $regex; } function userLogin($user_data) { if(!file_exists($this->_cookie_file)){ file_put_contents($this->_cookie_file,''); } $this->getData($this->_login_url, $user_data, $this->_cookie_file); } function matchLink() { $data = $this->getData($this->_url); preg_match_all(array_shift($this->_regex),$data,$match); $second = array_shift($this->_regex); foreach($match as $key=>$value){ if(is_int($key)) continue; foreach($value as $v){ preg_match($second,$v,$matched); $this->_match_href[] = $matched['href']; } } return $this->_match_href; } function matchDetail($user_data=false) { $this->matchLink(); if($user_data){ $this->userLogin($user_data); } if(empty($this->_match_href)) return false; foreach($this->_match_href as $m){ $detail = $this->getData($m,false, $this->_cookie_file); foreach($this->_regex as $key=>$val){ preg_match($this->_regex[$key],$detail,$match); $this->_detail[$key][] = $match[$key]; } } return $this->_detail; } function getData($url, $data=false,$cookie_file=false,$timeout=3) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); if($data){ curl_setopt($ch, CURLOPT_POST, true); curl_setopt($ch, CURLOPT_POSTFIELDS, $data); } if($cookie_file){ curl_setopt($ch, CURLOPT_COOKIEFILE,$cookie_file); curl_setopt($ch, CURLOPT_COOKIEJAR,$cookie_file); } $data = curl_exec($ch); curl_close($ch); return $data; } } $c = new Collection(); $regex = array( 'list_h2'=>'/<h2\s*class="\s*posttitle\s*"\s*>(?<link>.*?)<\/h2>/is', 'alink'=>'/<a\s*href="(?<href>.*?)">.*?<\/a>/is', 'post_title'=>'/<h2\s*class="\s*posttitle\s*"\s*><a.*?href=".*?".*?>(?<post_title>.*?)<\/a><\/h2>/is', 'post_content'=>'/<div\s*class="postcontent"\s*><p>(?<post_content>.*?)<div\s*class="wumii-hook">/is', 'post_img'=>'/<div\s*class="postcontent"\s*><p><a\s*href="(?<large_img>.*?)"\s*><img.*?src="(?<post_img>.*?)".*?><\/a>/is', 'review'=>'/<li.*?class="comment\s*byuser\s*comment-author-admin.*?".*?>(?<review>.*?)<\/li>/is', ); $url = array('target_url'=>'http://moviereleased.net/','login_url'=>'http://moviereleased.net/wp-login.php'); $c->setRegex($regex); $c->setUrl($url); $user_data = 'log=testuser&pwd=testuser'; $data = $c->matchDetail($user_data); print_r($data['post_img']);
相关文章推荐
- PHP扩展curl和正则表达式轻松采集新闻
- PHP中一些可以替代正则表达式函数的字符串操作函数
- [VB.NET]正则表达式可以先在VB.net 的FORM中吗?怎么写?
- system\classes\Kohana.php 空的核心扩展文件也可以把它复制到application\classes下面写自己的扩展
- PHP扩展之文本处理(二)——PCRE正则表达式语法15——性能
- PHP采集天猫商品列表,正则表达式匹配店铺名称和商品ID
- ASP.NET 4中的可扩展输出缓存(可以缓存页面/控件等)
- php正则采集小例子
- PHP网页数据正则采集
- php中将文中关键词高亮显示,快捷方式可以用正则
- PHP 正则表达式 获取网页charset 编码 ,可以获取任意网页charset(代码备份)
- PHP正则表达式替换站点关键字链接后空白的解决方法
- ASP.NET MVC 可以扩展的13个地方
- PHP扩展之文本处理(二)——PCRE正则表达式语法1——分隔符
- PHP5.6对命名空间的扩展,use可以导入函数与常量空间
- centos7.0 可以访问HTML文件,不能访问PHP文件,因为php-fpm没有扩展包
- QueryList 4.0 简洁、优雅、可扩展的PHP采集工具(爬虫)