您的位置:首页 > 编程语言 > Go语言

字符串匹配算法之Boyer-Moore-Horspool Algorithm

2013-12-01 09:56 453 查看
Boyer-Moore-Horspool 算法也称Horspool 算法,由Nigel Horspool设计于1980年,是在BM算法上改进版,因为BM算法里面的 好后缀规则较难理解,同时其效率与正确性的证明当时一直没有得到解决,所以Horspool 算法只用了一个BM里的坏字符规则.

借用“find a needle in a haystack” 典故,意为"大海捞针",引意到我们这里就是 从haystack 字串中查找needle字串(needle 字串等同pattern字串),同时假定haystack字串长度n,needle字串长度为m;

基本原理:

Horspool算法 也是从右向左比较但Horspool算法相对于Boyer-Moore算法改进了坏字符规则;从右向左匹配,当遇到 不匹配字符(mismatch character) 时:

BM 跳转规则: 当前不匹配字符和needle中最右边出现的该字符对齐匹配;

Horspool 跳转规则:haystack 字串中与needle字串尾部字符对应的字符和needle中最右边出现的该字符匹配;

坏字符规则跳转表初始化和BM中一样,理解了原理,code理解起来就容易了;

下面是实现代码:

#include <stdio.h>
#include <string.h>       //
#include <limits.h>       //UCHAR_MAX

/* Returns a pointer to the first occurrence of "needle"
* within "haystack", or NULL if not found. Works like
* memmem() OR strstr().
*/

/* Note: In this example needle is a C string. The ending
* 0x00 will be cut off, so you could call this example with
* boyermoore_horspool_memmem(haystack, hlen, "abc", sizeof("abc"))
*/
const unsigned char *
boyermoore_horspool_memmem(const unsigned char* haystack, size_t hlen,
const unsigned char* needle,   size_t nlen)
{
size_t scan = 0;
size_t bad_char_skip[UCHAR_MAX + 1]; /* Officially called:
* bad character shift */

/* Sanity checks on the parameters */
if (nlen <= 0 || !haystack || !needle)
return NULL;

/* ---- Preprocess ---- */
/* Initialize the table to default value */
/* When a character is encountered that does not occur
* in the needle, we can safely skip ahead for the whole
* length of the needle.
*/
for (scan = 0; scan <= UCHAR_MAX; scan = scan + 1)
bad_char_skip[scan] = nlen;

/* C arrays have the first byte at [0], therefore:
* [nlen - 1] is the last byte of the array. */
size_t last = nlen - 1;

/* Then populate it with the analysis of the needle */
for (scan = 0; scan < last; scan = scan + 1)
bad_char_skip[needle[scan]] = last - scan;

/* ---- Do the matching ---- */

/* Search the haystack, while the needle can still be within it. */
while (hlen >= nlen)
{
/* scan from the end of the needle */
for (scan = last; haystack[scan] == needle[scan]; scan = scan - 1)
{
if (scan == 0) /* If the first byte matches, we've found it. */
return haystack;
}

/* otherwise, we need to skip some bytes and start again.
Note that here we are getting the skip value based on the last byte
of needle, no matter where we didn't match. So if needle is: "abcd"
then we are skipping based on 'd' and that value will be 4, and
for "abcdd" we again skip on 'd' but the value will be only 1.
The alternative of pretending that the mismatched character was
the last character is slower in the normal case (E.g. finding
"abcd" in "...azcd..." gives 4 by using 'd' but only
4-2==2 using 'z'. */
hlen     -= bad_char_skip[haystack[last]];
haystack += bad_char_skip[haystack[last]];      //与BM中的坏字符区别主要在这
}

return NULL;
}

void main(void)
{
char haystack[80] = "WHICH-FINALLY-HALTS.--AT-THAT-POINT";
char needle[80] = "AT-THAT";
const unsigned char* find_str = NULL;

find_str = boyermoore_horspool_memmem((const unsigned char *)haystack, strlen(haystack), (const unsigned char *)needle, strlen(needle));
if(NULL != find_str)
{
printf("Success find string : %s\n", find_str);
}
else
{
printf("no find pattern string !\n");
}
}
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: