您的位置：首页 > 其它

字符串匹配算法之"Boyer Moore"

2013-11-29 20:20 447 查看

Boyer-Moore字符串搜索算法是一种非常高效的字符串搜索算法。它由Bob Boyer和J Strother Moore设计于1977年，最初的定义1975年就给出了，后续才给出构造算法以及算法证明。

先假定部分定义：

1、pattern 为模式字符串，长度为patLen;

2、Text为目标查找字符串，长度为n;

2、当前不匹配字符在pattern中位置为 j（0≤ j ≤patLen -1）;

3、已经匹配的长度为 m（0≤ m ＜patLen）;

4、先假设不匹配字符在pattern中位置为 Δ(*),其中*可以是任何字符;

很多资料里面讲解原理时说的数组位置都是从1开始的，这里为了好理解code，都是从0开始;

首先来看下坏字符规则：

一、坏字符规则（bad character rule）：让不匹配字符和pattern中最右边出现的该字符对齐匹配，如果没有则全部跳过；

>假设1：遇到不匹配字符，如果该字符在pattern 中不存在，有:（如下图示跳转）

字符指针右移：patLen 长度后和 pattern 右对齐;

Pattern 右移：patLen – m;

>假设2：遇到不匹配字符，如果该字符在pattern 中存在，这里也分两种情况:

a>.在pattern最右边出现的该字符在当前不匹配字符左边,有:（如下图示跳转）

字符指针右移：j–Δ(‘-’) +m = (j + m)–Δ(‘-’) = (patlen – 1) -Δ(‘-’) = (7-1)-2 = 4

Pattern 右移：字符指针偏移 - m = 4 – m = 2;

b>.在pattern中最右边出现的该字符在当前不匹配字符右边,有:（如下图示跳转）

字符指针右移: (patlen-1) – Δ(‘T’) = (7-1) – 6 = 0

Pattern右移：字符指针偏移 – m = 0 – 2 = -2

可以看出，pattern 竟然回退比较了，这是不应该出现的，这时候直接往后移动1位就行了：

总结上面三种情况,我们定义坏字符函数delta1() 为字符指针的偏移：

Delta1($) = patLen;(不匹配字符在pattern中不存在)

= patLen–1-Δ(*);(不匹配字符存在pattern中，且在pattern中最右边出现的位置在当前不匹配字符左边)

= 1;( 不匹配字符存在pattern中，且在pattern中最右边出现的该字符在当前不匹配字符右边)

二、好后缀规则（good suffix rule）：根据已经匹配的部分字串(subpat)，在pattern中寻找是否有和 subpat 全部或者部分匹配的字串，直接对齐匹配，避免无效的移动；

先约定几点：

1、假设 $ 为pattern中没有出现过的字符，有pat[i] = $ 当i < 0;

2、两个序列[C1 … Cn] 和[d1… dn] 是一致的, 当且仅且cj = dj 或者 cj = $ 或者 dj = $；其中(0≤j＜n)

3、最右边可能重新出现的subpat (p[j+1 ~ patLen-1])的位置为rpr(j)(rightmost plausible reoccurrence), 是使[pat[j + 1] ... pat[patlen]] 和 [pat[k] ... pat[k + patlen - j – 1] ]一致的最大K值，其中k≤0 或者pat[k – 1] != pat[j].

上图写出了pattern “ABXYCDEXY” 的rpr()值计算结果：我们来解析下

a>.当j = 8 时，已经匹配字串p[j+1 … patLen-1] 为空，参照rpr()定义，可知，pattern最右边可能和空串一致的，就是p[8 ~ PatLen-1], 可知rpr(8) = 8.

b>.当j = 7时，已经匹配字串subpat为”Y”, 可以看到p[3 ~ 3] = subpat , 此时k=3>0, 但是pat(k-1) == pat[j] = “X”不满足条件，再往右找，可以知道该 subpat 只可能存在 pattern 头部-1位置，即rpr(7) = -1.

c>.当j = 6 时，已经匹配字串subpat为”XY”, 可以看到p[2 ~ 3] = subpat, 同时满足p[k-1] != pat[j] ,可知rpr(6) = 2.

d>.当j = 5 时，已经匹配字串subpat为”EXY”, pattern中没有对应字串和subpat一致，只可能存在pattern头部，可知rpr(5) = -3;

其他情况依次类推，上面的几种情况应该包含了所有的rpr() 求法，从上面分析可以得出个规律：

rpr[patLen-1] = patLen-1.

可以得出 good suffix rule 的偏移值, 让pat[k] 和 pat[j+1] 对齐匹配：

Pattern 右移：j + 1 - rpr(j)

字符指针右移: m + j + 1 - rpr(j) = (patLen - 1 - j) + j + 1 – rpr(j) = patLen – rpr(j)

下面我们定义好后缀规则偏移算法：

delta2(j) = patLen - rpr(j); (0≤j<patLen)

*读者如果有看过别的BM算法资料，有地方 delta2(j) = patLen – 1 – rpr(j)，还是开头的这句话，我们这里数组索引从0开始，所以rpr(j) 的值也比索引从1开始的小1；

下面给出完整的实现代码:

#include <string.h>  // strlen()
#include <stdlib.h>  // __max()

#define ALPHABET_SIZE (1 << (sizeof(char)*8))

// Enable any/all to trace intermediate results
//#define TRACE_DELTA1
//#define TRACE_DELTA2
//#define TRACE_BM

#if defined TRACE_DELTA1 || defined TRACE_DELTA2 || defined TRACE_BM
#include <stdio.h>
#include <ctype.h>
#endif

void calc_delta1(const char *pat, int patlen, int delta1[])
{
int j = 0;
for (j = 0; j < ALPHABET_SIZE; j++)
delta1[j] = patlen;

for (j = 0; j < patlen; j++)
{
// By scanning pat from left to right, the final
// value in delta1[char] is the *rightmost* occurrence of
// char in pat
delta1[pat[j]] = patlen - 1 - j;
}

#ifdef TRACE_DELTA1
printf("Starting dump delta1[]>>>>>>>>>>>>>>>>>>>>>>>>>\n");
for (j = 0; j < ALPHABET_SIZE; j++)
{
if (delta1[j] != patlen)
{
printf("       %c:%d\n", (char)j, delta1[j]);
}
}
printf("  others:%d\n", patlen);
#endif
}

void calc_delta2(const char *pat, int patlen, int * delta2)
{
int i = 0, j = 0, s = 0, m = 0, n = 0;
// rpr[j] : where we can find rightmost plausible recurrence of pat[j+1 .. patlen-1]
int *rpr = new int[patlen];

// Mark each uninitialized rpr value with a large negative index
const int def = -2*patlen;
for (i = 0; i != patlen; i++)
{
rpr[i] = def;
}

// r: number of uninitialized entries in rpr[]
int r = patlen;

// Scan pattern from right-to-left until all rpr[] are initialized.
// s: scan position.
// Examine all substrings that end at pat[s] including null string pat[s .. s]
for (s = patlen - 1; r > 0; s--)
{
// m: length of substring  pat[s-m .. s]
for (m = 0; m <= patlen - 1 && r > 0; m++)
{
// Introduce j and k (as used in the BM paper)
// j: index of leftmost character of suffix
int j = patlen - m - 1;
// k: index of leftmost character of (possible) recurrence.
int k = s - m;

#ifdef TRACE_DELTA2
const int indent = patlen;
printf("\ns:%d m:%d j:%d k:%d\n", s, m, j, k);
printf("p  :%*s%s\n", indent, "", pat);
printf("j  :%*s%*.*s\n", indent+j, "", m+1, m+1, &pat[j] );
printf("k-1:%*s", indent+k-1, "");
for (n = 0; n <= m; n++)
{
printf("%c", (k-1+n < 0 ? pat[j+n] : pat[k-1+n]) );
}
printf("\n");
#endif

// We have a match of pat[j+1 .. j+1+m] with pat[k .. k+m]
// Compare pat[j] to pat[k-1].
// Match: extend the substring to the left by increasing m
// Mismatch: terminate the substring and check if plausible RPR

bool mismatch = false;
if (k > 0)
{
if (pat[j] == pat[k-1]) // extend substring
continue;
mismatch = true;
}
// else preceding char, pat[k-1] lies to the left of pat[0]
// which terminates the substring

// We have a match of m (possibly zero) characters.
// pat[j+1 .. j+1+m] matches pat[k .. k+m] and
// either pat[j] != pat[k-1] or k <= 0.
// So rpr[j] = k (unless rpr[j] is already > k)
if (rpr[j] < k)
{
#ifdef TRACE_DELTA2
printf("2  :%*s %c %*.*s %*s s:%d m:%d j:%d k:%d r:%d\n",
indent+j, "",
toupper(pat[j]),
m, m, &pat[j+1],
(patlen-j-1-m), "",
s, m, j, k, r);
#endif
rpr[j] = k;
r--;
}
#ifdef TRACE_DELTA2
else
{
printf("rpr[%d]=%d already inited\n", j, rpr[j]);
}
#endif

// Once we have a mismatch (pat[j] != pat[k-1]) it is fruitless
//to examine further substrings ending at pat[s];
//as Any subpat end with pat[s] will not be the rightmost plausible
//recurrence of the terminal substring pat[j+1 ~ patlen-1]
if (mismatch)
{
break;
}
}
}

for (j = 0; j != patlen; j++)
{
delta2[j] = patlen - rpr[j];
}

#ifdef TRACE_DELTA2
printf("R:"); // trace rpr[] values
for (j = 0; j != patlen; j++)
{
printf(" %3d", rpr[j] );
}

printf("\n");
printf("D:"); // trace delta2[] values

for (j = 0; j != patlen; j++)
{
printf(" %3d", delta2[j] );
}
printf("\n");
#endif

delete [] rpr;
}

/*
* Boyer-Moore search algorithm
*/
const char *boyermoore_search(const char * string, const char *pat)
{
int i = 0, j = 0, stringlen = 0;
const char *result = NULL;

int patlen = strlen(pat);
int *delta1 = NULL;
int *delta2 = NULL;

if (patlen == 0)
goto out;

stringlen = strlen(string);
if (patlen > stringlen)
goto out;

delta1 = new int[ALPHABET_SIZE];
delta2 = new int[patlen];

#ifdef TRACE_BM
printf("pattern: %s\n", pat);
#endif
calc_delta1(pat, patlen, delta1);
calc_delta2(pat, patlen, delta2);

#ifdef TRACE_BM
printf("\nCalculating boyermoore_search>>>>>>>>>>>>>>>>>>>>>>>>>\n");
#endif

// i: index of current string character
for (i = patlen-1;;)
{
if (i > stringlen)
{
result = NULL;
goto out;
}

// j: index of current pattern character
j = patlen-1;
for (;;)
{
if (j == 0)
{
result = &string[i];
goto out;
}

if (string[i] == pat[j])
{
#ifdef TRACE_BM
printf("p:%*s%*.*s%c%*.*s\n", \
(i-j), "", \
j, j, pat, \
toupper(pat[j]), // mark matched char with upcase
patlen-j-1, patlen-j-1, &pat[j+1]);
#endif
j--;
i--;
continue;
}
break;
}

#ifdef TRACE_BM
printf("p:%*s%*.*s%c%*.*s\n",
(i-j), "",
j, j, pat,
L'?', // mark mismatch char
patlen-j-1, patlen-j-1, &pat[j+1]); // which-finally-halts.--at-that-point ...
printf("c:%s\n", string);
#endif
// bc: "bad character" shift amount
int bc = delta1[string[i]];

// gs: "good suffix" shift amount
int gs = delta2[j];

#ifdef TRACE_BM
printf("j:%d bc:%d gs:%d\n\n", j, bc, gs);
#endif
i += __max(bc, gs);
}

/* not found */
out:
delete [] delta1;
delete [] delta2;
return result;
}

void main(void)
{
char src_str[80] = "WHICH-FINALLY-HALTS.--AT-THAT-POINT";
char pat_str[80] = "AT-THAT";
const char* find_str = NULL;

find_str = boyermoore_search((const char *)src_str, (const char *)pat_str);
if(NULL != find_str)
{
printf("\n Success find string : %s\n", find_str);
}
else
{
printf("no find pattern string !\n");
}
}

Boyer Moore 算法时间复杂度是亚线性O(patLen+n), pattern 越长BM算法效率越高；

参考：
1、A Fast String Searching Algorithm

2、http://en.wikipedia.org/wiki/User:RMcPhillip/sandbox/boyer-moore

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航