您的位置：首页 > 编程语言 > Go语言

加入一个基于GOOGLE的"站内搜索引擎"

2008-03-03 13:07 483 查看

由于这一次的客户只能提供虚拟主机作为项目运行平台,无法搭配中文分词组件,原来自行开发的站内搜索引擎无法发挥最大的功效(主要是不能自动分析关键词,只能通过指定相关索引字段,以及手工输入TAG的机制来生成索引),因此,我们决定转而使用GOOGLE的站内搜索(Google In-Site-Seacrh)作为该项目的主力搜索引擎.

其实GOOGLE的站内搜索其实就是SEO中常用的site:命令.GOOGLE站内搜索接口http://www.google.com/custom,核心参数是q , sitesearch 和 start,这些参数的含义可以参见 google

参数

需要强调的是,在我的实际应用中,q参数只接受经过form_url_encoded编码的字符串,不支持其他编码(不知道是我浏览器的原因还是GOOGLE自身的原因).因此只能通过FORM的GET方法提交,而且FORM必须设置为enctype="application/x-www-form-urlencoded".

定义好接口之后,接下来就是用file_get_contents()向接口发送HTTP请求,获取返回的HTML结果,进入下一步的分析.

以下的实例中的核心代码片段,根据实际情况可能有所不同,在一些关键地方我进行了标注,如有问题,一般就是出在那些地方.

<?php

// 分析GOOGLE的搜索结果，将有效内容截取到本地

// 组合检索条件，发送查询到GOOGLE并获取结果

$q = $_GET['q'];

$query_url_proto = 'http://www.google.cn/custom?ie=UTF-8&sitesearch=nfxm.com&q=[#q]&start=[#PAGE]&complete=1&hl=zh-CN&newwindow=1&cof=&ie=UTF-8&sa=N';

$offset = (is_numeric($_GET['start'])) ? $_GET['start'] : '0';

$query_url = str_replace('[#PAGE]', $offset, str_replace('[#q]', $q, $query_url_proto));

if(!$content = file_get_contents($query_url)) {

$result = ' <div class="Item" id="Notfound">

<h3>目标主机没有在有效时间内响应</h3>

<br />

建议：

<ul>

<li><a href="javascript:window.loation.reload()">刷新这个页面</a></li>

</ul>

</div>';

}

if(strlen($content)<=0) {

$result = ' <div class="Item" id="Notfound">

<h3>目标主机没有在有效时间内响应</h3>

<br />

建议：

<ul>

<li><a href="javascript:window.loation.reload()">刷新这个页面</a></li>

</ul>

</div>';

} else {

// 代码片段起点

// </form></table></td></tr></table></div>

// 代码片断终点

// </div></div>

// 主分析引擎

$patten = '/</form></table></td></tr></table></div>(.*)</div></div>/';

/*如果无法获取结果,有可能是GOOGLE更改了接口的源代码,可以根据实际的接口情况调整这个正则表达式*/

$bool = preg_match($patten, $content, $match);

// 结果

$result = iconv('gb2312', 'utf-8', $match[1]);

$content = null; // free mempry

$match = null; // free memory

if(strlen($result)<=0) {

$result = ' <div class="Item" id="Notfound">

<h3>找不到和您的查询 "<strong>'.$q.'</strong>" 相符的网页。</h3>

<br />

建议：

<ul>

<li>请检查输入字词有无错误。</li>

<li>请换用另外的查询字词。</li>

<li>请改用较常见的字词。</li>

<li>请减少查询字词的数量。</li>

</ul>

</div>';

} else {

// 去掉无用的数据

// 快照、类似网页等

$result = str_replace('类似网页', '', str_replace('网页快照', '', $result));

// 去掉原有的分页连接

// 分页图片

$patten_slice_page_img_first = '<img src=/intl/zh-CN/nav_first.gif width=18 height=26 alt=""><br>';

$patten_slice_page_img_next = '<img src=/intl/zh-CN/nav_next.gif width=100 height=26 alt="" border=0><br>';

$patten_slice_page_img_preview = '<img src=/intl/zh-CN/nav_previous.gif width=68 height=26 alt="" border=0><br>';

$patten_slice_page_img_current = '<img src=/intl/zh-CN/nav_current.gif width=16 height=26 alt=""><br>';

$patten_slice_page_img_page = '<img src=/intl/zh-CN/nav_page.gif width=16 height=26 alt="" border=0><br>';

$result = str_replace($patten_slice_page_img_first, '', str_replace($patten_slice_page_img_next, '', str_replace($patten_slice_page_img_preview, '', str_replace($patten_slice_page_img_current, '', str_replace($patten_slice_page_img_page, '', $result)))));

$result = str_replace('<table border=0 cellpadding=0 width=1% cellspacing=0 align=center>', '<table border=0 cellpadding=5 cellspacing=0 align=center>', $result);

$result = str_replace('/custom?', './?m=news&', $result);

// 广告连接

$result = str_replace('/aclk?', 'http://www.google.cn/aclk?', $result);

// 重设相关搜索代码

$result = str_replace('class=rsl', 'class=rsl target=_blank', str_replace('/search?', 'http://www.google.cn/search?', $result));

}

// $result 就是可以输入的结果,可以在你自己设置好的页面任何位置进行输出

效果演示:http://www.nfxm.com/search/?m=news&q=%E5%AF%B9%E8%99%BE

变量	值	描述
q	$query (您的请求)	用于搜索的字符串
Start	从0 到结果总数	指定搜索的结果显示开始于某一个点。实际上这个就是google用来分页的参数了。google没有page这个参数
num/maxResults	1 -- 100	每页显示的结果数
filter	O or 1	是否显示过滤相似结果，1为是，0为否。如果为1，google将会让你在搜索结果的最下面选择将省略的结果纳入搜索范围后再重新搜索
restrict	"限制代码".例子: countryAF (阿富汗) countryAR (阿根廷) countryAU (澳大利亚) countryBE (比利时) countryBM (百慕大群岛)...	限制为某个特定的国家 (使用IP来鉴定... 由于IP的特殊性，google可能会进行错误的判断). Google还有这4个特殊的主题限制: 美国政府 unclesam; GNU-Linux linux; Macintosh mac; FreeBSD bsd
hl	"国家界面代码"	目前国家代码主要有: af, sq, am, ar, az, eu, be, bn, bh, xx-bork, bs, br, bg, ca, zh-CN, zh-TW, hr, cs, da, nl, xx-elmer, en, eo, et, fo, tl, fi, fr, fy, gl, ka, de, el, gn, gu, xx-hacker, iw, hi, hu, is, id, ia, ga, it, ja, jw, kn, xx-klingon, ko, ky, la, lv, lt, mk, ms, ml, mt, mr, ne, no, nn, oc, or, fa, xx-piglatin, pl, pt-BR, pt-PT, pa, ro, ru, gd, sr, sh, st, si, sk, sl, es, su, sw, sv, ta, te, th, ti, tr, tk, tw, uk, ur, uz, vi, cy, xh, yi, zu.
lr	语言限制代码	语言限制. 只显示使用所指定语言的结果. 代码: 阿拉伯语 lang_ar; 中国大陆(简体) lang_zh-CN; 中国台湾(繁体) lang_zh-TW; 捷克语 lang_cs; 丹麦语 lang_da; 荷兰语 lang_nl; 英语 lang_en; 爱沙尼亚语 lang_et; 芬兰语 lang_fi; 法语 lang_fr; 德语 lang_de; 希腊语 lang_el; 希伯来语 lang_iw; 匈牙利语 lang_hu; 冰岛语 lang_is; 意大利语 lang_it; 日语 lang_ja; 朝鲜语 lang_ko; 拉托维亚语 lang_lv; 立陶宛语 lang_lt; 挪威语 lang_no; 葡萄牙语 lang_pt; 波兰语 lang_pl; 罗马尼亚语 lang_ro; 俄语 lang_ru; 西班牙语 lang_es; 瑞典语 lang_sv; 土耳其语 lang_tr
ie	UTF-8	The input encoding of Web searches. Google suggests UTF-8
oe	UTF-8	The output encoding of Web searches. Google suggests UTF-8
as_epq	Exact phrase	Advanced search: "with the exact phrase". The value is submitted as an exact phrase. It's no more necessary to surround the phrase with quotes.
as_ft	i = include file type; e = exclude file type a file extension	Advanced search: File format: Only \| Don't.... Include or exclude the file type indicated by as_filetype (see below)
as_filetipe	file extension	Advanced search: File Format: ....return results of the file format. Include or exclude this file type as indicated by the value of as_ft (see above)
as_qdr	m3 = past 3 months; m6 = past 6 months; y = past year	Advanced search: Date Return web pages updated in the.... Locate pages updated within the specified timeframe
as_nlo	low number	Find numbers between as_nlo and as_nhi
as_nhi	high number	Find numbers between as_nlo and as_nhi
as_oq	a list of words	Find at least one among the words of the list
as_occt	any = anywhere; title = title of page; body = text of page; url = in the page URL; links = in links to the page	Advanced search: Occurrences Return results where my terms occur.... Find search term in a specific page location
as_dt	i = only include site or domain; e = exclude site or domain	Advanced search: Domain: Only \| Don't.... Include or exclude searches from the domain specified by as_sitesearch (see below)
as_sitesearch	domain or site	Advanced search: Domain: ...return results from the site or domain. Include or exclude this domain or site as specified by as_dt (see above)
safe	active = 使用安全搜索 off = 禁用安全搜索	是否使用"safe search" (自动审查)
as_rq	URL	Locate pages similar to this URL
as_lq	URL	Locate pages that link to this URL.
newwindow	0或1	点击结果是否在新窗口打开.1为是，2为否
c2coff	数字	暂时不知道意思.
sa	/(N\|.*)/	暂时不知道意思.
btnG	“搜索”	暂时不知道意思.遇到的唯一的值是“搜索”这两个字符
pwst	1	暂时不知道意思.
oi	lrtip8	暂时不知道意思.
sitesearch	站点搜索	在sitesearch指定的域名范围内搜索

标签:

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： google search domain include file url

相关文章推荐

新的分享

章节导航