您的位置：首页 > 编程语言

swish-e代码分析，索引部分（5）

2009-10-08 22:19 351 查看

在前一节中通过getentry函数的处理，在hash表中查找是否含有该词条，如果没有，则初始化一个词条entry变量。然后通过addentry进行处理。

2.3.4 addentry函数分析

加入词条到hash表的过程主要分为两部分：已有词条、新词条。

如果是 hash 表中未出现的词条。 在这个过程中写入了频率和位置信息（位置信息含有 strcuture 结构信息，便于在以后的压缩过程中处理）；先将这些 TP 位置信息放在 currentChunkLocationList 链表中。

/* Check for first time */
if(!e->tfrequency)
{
/* create a location record */
tp = (LOCATION *) new_location(idx);
tp->filenum = filenum;
tp->frequency = 1;
tp->metaID = metaID;
tp->posdata[0] = SET_POSDATA(position,structure);
tp->next = NULL;
e->currentChunkLocationList = tp;
e->tfrequency = 1;
e->u1.last_filenum = filenum;

return;
}

addentry代码片段1

如果是entry首次被索引，在getentry函数处理时频率为0，对于LOCATION进行初始化，即，保存位置信息；

在Location位置中写入文件的filenum、meta、具体位置，然后在entry中加入LOCATION，同时也写入频率；

通过这gentry和addentry处理以后，词条包含了所在的meta、频率、位置信息等，然后将词条写入到hash表中。

在原先的 hash 表中出现的词条，则需要判断是否属于同一个 field (mata)和file中的词条。如果是，直接在 Location 中增加位置信息，否则需要在增加 Location 结构保存。在处理 Location 位置信息时，需要考虑随着位置信息的逐渐增加，需要扩容位置信息。

/* Word found -- look for same metaID and filename */
/* $$$ To do it right, should probably compare the structure, too */
/* Note: filename not needed due to compress we are only looking at the current file */
/* Oct 18, 2001 -- filename is needed since merge adds words in non-filenum order */
tp = e->currentChunkLocationList;
found = 0;
while (tp != e->currentlocation)
{
if(tp->metaID == metaID && tp->filenum == filenum  )
{
found =1;
break;
}
tp = tp->next;
}
-------------------------
/* Otherwise, found matching LOCATION record (matches filenum and metaID) */
/* Just add the position number onto the end by expanding the size of the LOCATION record */
/* 2001/08 jmruiz - Much better memory usage occurs if we use MemZones */
/* MemZone will be reset when the doc is completely proccesed */
newtp = add_position_location(tp, idx, tp->frequency);
if(newtp != tp)
{
if(e->currentChunkLocationList == tp)
e->currentChunkLocationList = newtp;
else
for(prevtp = e->currentChunkLocationList;;prevtp = prevtp->next)
{
if(prevtp->next == tp)
{
prevtp->next = newtp;
break;
}
}
tp = newtp;
}
tp->posdata[tp->frequency++] = SET_POSDATA(position,structure);

addentry代码片段2

如果不是第一次的 entry ，则需要在原先 entry 数据的基础上面，进行一些修改。此时需要在在 entry 的 Location 结构中查找改 entry 的 metaID 和 filenum 是否一致，即：是否在一个属性中出现的关键字，如果是同一属性（比如： txt 文本中，属于一个文件中的）。此时，只需要修改一个 Location 中的 position 数组的值;

如果需要增加 Location 扩展 posdata 数组时，可能会重新分配内容；

果此时的该 entry 出现的位置不同，即：不在同一个 filenum 同一个 metaID 中的，则需要增加一个 LOCATION 结构在 ENTRY 的 currentChunkLocationList 当中，即：增加不同的 Location 位置；

每个 LOCATION保存了一个词条在相同meta、相同文件中的位置信息。

如果发现不在同一个meta或者文件中，则需要在建立一个LOCATION结构保存新的位置信息。

if(!found)
{
/* create the new LOCATION entry */
tp = (LOCATION *) new_location(idx);
tp->filenum = filenum;
tp->frequency = 1;            /* count of times this word in this file:metaID */
tp->metaID = metaID;
tp->posdata[0] = SET_POSDATA(position,structure);
/* add the new LOCATION onto the array */
tp->next = e->currentChunkLocationList;
e->currentChunkLocationList = tp;
/* Count number of different files that this word is used in */
if ( e->u1.last_filenum != filenum )
{
e->tfrequency++;
e->u1.last_filenum = filenum;
}
return; /* all done */
}

addentry代码片段3

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航