您的位置：首页 > 编程语言 > Java开发

搜索引擎java实战

2015-09-30 17:18 435 查看

相信很多人都了解过搜索引擎吧，由于之前做校园新闻应用的时候爬下来很多数据，感觉很适合加个检索功能，所以看了看《集体智慧编程》的相关章节，自己动手实践一下。写新闻爬虫的时候用的自然语言处理java库，解析用户输入查询语句的时候也可以用上，所以把书里python代码改成java，写了基于内容的搜索引擎（画外音：其实你就是个搬砖工，别往脸上贴金了！）。这里想给大家分享一下制作过程和结果，希望同样感兴趣的童鞋不吝赐教！

什么是搜索引擎呢，大概意思就是在数据库里记录了一堆item和对应的属性，客户端与数据库交互，每次交互，机器通过一系列算法，计算出item的得分，根据这些得分给item排序，得分越高的item越符合用户需求，然后再返回结果列表到客户端。

基于内容的搜索引擎则是，根据对用户输入的关键词和数据库item的属性信息，来计算item的得分。
我目前只实现了三种最原始的计算item（也就是下面说的文档）得分的方法：统计关键词的词频，在文档中的位置，在文档中的距离。除了这些信息，后续对搜索引擎的研究还可以引进了很多其他属性，比如网页文档内部链接的数量和质量等。

本文主要是讲解如何在java环境下实践基于内容的搜索引擎，架构比较简单，分为两步：

后台数据库

搜索引擎

1、创建后台数据库

经过爬虫得到的原始文件，其中内容并不全是我们想要的，需要提取有用信息，并存储到数据库管理系统，方便存取。基于内容的搜索引擎只需要保存文档资源位置（url），文档中的词，词在文档中的位置。因此可以做出数据库的ER图：

URLlist和WordLocation，WordList和WordLocation，都是一对多的关系，一篇文档可以有很多词，一个词可以出现在多篇文档中，也可以出现在文档中的多个地方。

从原始文件获得以上数据，下面结合包类结构图来分析。

BO层实体与DC层的数据交互的结构已经提到过了（java实现起来需要使用反射机制，感兴趣的童鞋可以点击查看）这回主要讲解BO层业务处理核心BOManager类的一些方法：

isIndexed(String):boolean

addToIndex(MyUrlInfo):void

loadData2DB():void

//遍历读取原始文件，调用addToIndex建立索引
public void loadData2DB()
{
//原始文件所在文件夹在本地的位置
File file0 = new File(FilePathStaticString.originFilePath+"\\");
String[] inFile = file0.list();
for(int i=0;i<inFile.length;i++)
{
fi.filename = inFile[i];
//遍历文件夹，其中某个文件在本地的位置
File file = new File(FilePathStaticString.originFilePath+"\\"+fi.filename);
fi.j = Integer.valueOf(0);
while(file.length()!=fi.j)
{
//用文件代理获取一条新闻的信息实体
FileHolder ent = fa.getFileEntity(fi);
//为了让BO层和DC层尽量解耦，设计FileHolder作为信息实体的容器
MyUrlInfo urlinfo = new MyUrlInfo(ent.getUrl(),ent.getTime(),ent.getTitle(),ent.getImgID(),ent.getContent(),0);
//建立数据库
addToIndex(urlinfo);
}
}
}
//将文档中的文本分词，
//给url和word建立联系索引
private void addToIndex(MyUrlInfo urlInfo)
{
//检验是否已经记录了文档和词的位置关系，如果已经建立，则跳过
if(isIndexed(urlInfo.getM_sUrl()))
return;
//如果没有建立，则为他们建立关系的索引
//需要先查找数据库是否存在此Entity记录，如果存在则返回其ID
//如果不存在，则插入Entity并返回其ID
int urlid = getEntityID(urlInfo);
//自然语言处理：分词
List<String> words = fa.getSeperateWord(urlInfo.getM_sContent());
MyWordLocation wl = new MyWordLocation();
int i=0;
for(String word : words)
{
i++;
int wordid = getEntityID(new MyWord(word));
wl.setM_iUrlID(urlid);
wl.setM_iWordID(wordid);
wl.setM_iLocation(i);
wl.InsertEntity2DB();
}
}
//判断是否url是否已经存入数据库，否，返回false
//判断url是否与word有索引,否，返回false
//返回true
private boolean isIndexed(String url)
{
String sql ="";
sql = "select * from urllist where url = '"+url+"'";
MyUrlInfo urlinfo = new MyUrlInfo();
urlinfo.LoadEntity(sql);
//如果数据库没有此记录，则得到实体结果的id则为空
if(urlinfo.getM_iID()!=null)
{
int urlid = urlinfo.getM_iID();
sql = "select * from wordLocation where urlid = "+urlid;
MyWordLocation wordlocation = new MyWordLocation();
wordlocation.LoadEntity(sql);
if(wordlocation.getM_iID()!=null)
return true;
}
return false;
}

2、搜索引擎

后台的数据库一般是离线建好的，搜索引擎才是实时和用户交互的，下面结合包类结构图来分析其内部工作原理。

由于业务层，数据层的架构和创建数据库时使用的架构类似，所以AEntity，WordList，UrlList，WordLocation，DBAgent的属性和函数就没有详细列出。BOManager提供了和处理客户端请求和返回结果的接口。这里主要理解Util包里的UrlRankUtil类的一些方法：

doGetSortedMap

doGetTotalScores

doGetFrequencyScore

doGetLocationScore

doGetDistanceScore

normalizeScore

doGetWeight

举个例子，假如用户在输入了“研究生开学”这样的检索请求

//统计“研究生”“开学”同时出现的文档中，“研究生”和“开学”出现的次数
//查找数据库表，得到“研究生”“开学”两个词同时存在的文档，记录条数越多的文档，说明目标词频越高，文档得分越高
//list：list<Map<urlid,List<wordlocation>>>
private Map<Integer,Double> doGetFrequencyScore(List<Map<Integer,List<Integer>>> list)
{
//frequency<urlid,counts>
Map<Integer,Double> frequency = new HashMap<Integer,Double>();
//init frequency
for(Map<Integer,List<Integer>> map:list)
for(Entry<Integer,List<Integer>>ent:map.entrySet())
{
frequency.put(ent.getKey(), 0.0);
}
//calculate frequency可以理解为统计查找结果中urlid出现的次数
for(Map<Integer,List<Integer>> map:list)
for(Entry<Integer,List<Integer>>ent:map.entrySet())
{
frequency.put(ent.getKey(), frequency.get(ent.getKey())+1);
}
return frequency;
}
//统计目标词在文档中的最靠前的位置
//位置越靠前的，说明在文档中重要性越高，文档得分就越高
//list：list<Map<urlid,List<wordlocation>>>
private Map<Integer,Double> doGetLocationScore(List<Map<Integer,List<Integer>>> list)
{
//location<urlid,locsum>
Map<Integer,Double> location = new HashMap<Integer,Double>();
//init location
for(Map<Integer,List<Integer>> map:list)
for(Entry<Integer,List<Integer>>ent:map.entrySet())
{
location.put(ent.getKey(),1000000.0);
}
//calculate location 可以理解为统计每个文档中，目标词位置的最小（靠前）组合
for(Map<Integer,List<Integer>> map:list)
for(Entry<Integer,List<Integer>>ent:map.entrySet())
{
double loc = getSumOfList(ent.getValue());
if(loc<location.get(ent.getKey()))
location.put(ent.getKey(), loc);
}
return location;
}
//统计每个文档中，目标词两两相距距离最小值
//词距离越近，说明文档意思越靠近用户的检索需求，文档得分越高
//list：list<Map<urlid,List<wordlocation>>>
private Map<Integer,Double> doGetDistanceScore(List<Map<Integer,List<Integer>>> list)
{
//distance<urlid,dis>
Map<Integer,Double> minDistance = new HashMap<Integer,Double>();
boolean isWordLenTooShort = false;
//init minDistance 如果只有一个检索词或更少，则大家得分都一样高
for(Map<Integer,List<Integer>> map:list)
for(Entry<Integer,List<Integer>>ent:map.entrySet())
{
if(ent.getValue().size()<=1)
{
isWordLenTooShort = true;
minDistance.put(ent.getKey(),1.0);
}
else
minDistance.put(ent.getKey(),1000000.0);
}
if(isWordLenTooShort)
return minDistance;
//calculate distance 可以理解为统计每个文档，目标词两两距离最近的组合
for(Map<Integer,List<Integer>> map:list)
for(Entry<Integer,List<Integer>>ent:map.entrySet())
{
double sumDis = 0.0;
for(int i=1;i<ent.getValue().size();i++)
sumDis+=Math.abs(ent.getValue().get(i)-ent.getValue().get(i-1));
//find the min distance composite
if(sumDis<minDistance.get(ent.getKey()))
minDistance.put(ent.getKey(), sumDis);
}
return minDistance;
}
//根据每个文档的得分，从大到小排序，返回排序后的结果（之后会按需提取前n个返回给客户端）
private Map doGetSortedMap(Map oldMap)
{
//排序前先序列化映射表
ArrayList<Map.Entry<Integer, Double>> list = new ArrayList<Map.Entry<Integer, Double>>(oldMap.entrySet());
Collections.sort(list, new Comparator<Map.Entry<Integer, Double>>() {
@Override
public int compare(Entry<Integer, Double> o2,
Entry<Integer, Double> o1) {
if ((o2.getValue() - o1.getValue())>0)
return 1;
else if((o2.getValue() - o1.getValue())==0)
return 0;
else
return -1;
}
});
Map newMap = new LinkedHashMap();
for (int i = 0; i < list.size(); i++) {
newMap.put(list.get(i).getKey(), list.get(i).getValue());
}
return newMap;
}

其实这样已经可以得到一些结果了，但是还需要做完善，比如归一化和加权

//normalization to (0,1]
//特征的量纲往往是不同的，比如因为他们在现实中有不同的意义，所以需要把用不同方法计算的文档得分，归一化到(0,1]范围内
//score:Map<urlid,score>
//isSmallBetter: 值是越大越好，还是越小越好（比如距离值越小越好，词频值越大越好）
private Map<Integer,Double> normalizeScore(Map<Integer,Double> score,boolean isSmallBetter)
{
Map<Integer,Double> normScore = new HashMap<Integer,Double>();
double vsmall = 0.0001;//avoid to being divided by zero
//根据isSmallBetter值不同，归一化的处理方也不同
if(isSmallBetter)
{
double minVal = (double) getMinValue(score);
for(Entry<Integer,Double> ent:score.entrySet())
normScore.put(ent.getKey(),minVal/Math.max(ent.getValue(), vsmall));
}
else
{
double maxVal = (double) getMaxValue(score);
if(maxVal==0)maxVal = vsmall;
for(Entry<Integer,Double> ent:score.entrySet())
normScore.put(ent.getKey(),ent.getValue()/maxVal);
}
return normScore;
}
//为了结合多种计算文档得分的方法，我们给每种方法赋相应权值
private Map<Double,Map<Integer,Double>> doGetWeight(List<Map<Integer,List<Integer>>> list)
{
//init weight<weight,<urlid,score>>
Map<Double,Map<Integer,Double>> weight = new HashMap<Double,Map<Integer,Double>>();
Map<Integer,Double> frequency = doGetFrequencyScore(list);
Map<Integer,Double> location = doGetLocationScore(list);
Map<Integer,Double> distance =doGetDistanceScore(list);
//这里可以按需分配，由于map不支持相同键值，只有这样“均分”了
weight.put(1.0, normalizeScore(frequency,false));
weight.put(1.01, normalizeScore(location,true));
weight.put(1.02, normalizeScore(distance,true));
return weight;
}

我们看一下运行结果：

每个键值对，等号左边是文档编号（urlid），右边是文档的加权得分。得分最高的文档（网页）如下【解释一下：爬虫程序主要是针对四川大学新闻网，所以输入“研究生开学”得到谢校长的开学典礼致辞就不难怪啦！】

但是上述搜索引擎并不是很智能，存在以下待改进之处：

检索文字没有经过自然语言处理，只是简单的按空格分开，以原词搜索

如果是目标词是多个词，检索到的文档是都必须同时包含所有词

对文档打分的方法比较单一，可能需要增加内部链接，用户交互等方法，为用户找到更符合需求的结果

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航