简单字符串匹配问题-——用哈希解决
2012-10-31 19:24
309 查看
Pattern matching is the most fundamental algorithmic operation on text strings.This algorithm implements the find command available in any web browser or text
editor:
Problem: Substring Pattern Matching
Input: A text string t and a pattern string p.
Output: Does t contain the pattern p as a substring, and if so where?
{
int i,j;
int m, n;
m = strlen(p);
n = strlen(t);
for (i=0; i<=(n-m); i=i+1) {
j=0;
while ((j<m) && (t[i+j]==p[j]))
j = j+1;
if (j == m) return(i);
}
return(-1);
}
Hashing and Strings
Hash tables are a very practical way to maintain a dictionary. They exploit the fact
that looking an item up in an array takes constant time once you have its index. A
hash function is a mathematical function that maps keys to integers. We will use
the value of our hash function as an index into an array, and store our item at that
position.
The first step of the hash function is usually to map each key to a big integer.
Let α be the size of the alphabet on which a given string S is written. Let char(c)
be a function that maps each symbol of the alphabet to a unique integer from 0 to
α − 1. The function
maps each string to a unique (but large) integer by treating the characters of the
string as “digits” in a base-α number system.
ALGORITHM is different than LOGARITHM. Text strings are fundamental to a
host of computing applications, from programming language parsing/compilation,
to web search engines, to biological sequence analysis.
The primary data structure for representing strings is an array of characters.
This allows us constant-time access to the ith character of the string. Some auxiliary
information must be maintained to mark the end of the string—either a special
end-of-string character or (perhaps more usefully) a count of the n characters in
the string.
The most fundamental operation on text strings is substring search, namely:
Problem: Substring Pattern Matching
Input: A text string t and a pattern string p.
Output: Does t contain the pattern p as a substring, and if so where?
The simplest algorithm to search for the presence of pattern string p in text t
overlays the pattern string at every position in the text, and checks whether every
pattern character matches the corresponding text character. As demonstrated in
Section 2.5.3 (page 43), this runs in O(nm) time, where n = |t| and m = |p|.
This quadratic bound is worst-case. More complicated, worst-case linear-time
search algorithms do exist: see Section 18.3 (page 628) for a complete discussion.
But here we give a linear expected-time algorithm for string matching, called the
Rabin-Karp algorithm. It is based on hashing. Suppose we compute a given hash
function on both the pattern string p and the m-character substring starting from
the ith position of t. If these two strings are identical, clearly the resulting hash
values must be the same. If the two strings are different, the hash values will
almost certainly be different. These false positives should be so rare that we can
easily spend the O(m) time it takes to explicitly check the identity of two strings
whenever the hash values agree.
This reduces string matching to n−m+2 hash value computations (the n−m+1
windows of t, plus one hash of p), plus what should be a very small number of O(m)
time verification steps. The catch is that it takes O(m) time to compute a hash
function on an m-character string, and O(n) such computations seems to leave us
with an O(mn) algorithm again.
But let’s look more closely at our previously defined hash function, applied to
the m characters starting from the jth position of string S:
What changes if we now try to compute H(S, j + 1)—the hash of the next
window of m characters? Note that m−1 characters are the same in both windows,
although this differs by one in the number of times they are multiplied by α. A
little algebra reveals that
This means that once we know the hash value from the j position, we can find
the hash value from the (j + 1)st position for the cost of two multiplications, one
addition, and one subtraction. This can be done in constant time (the value of
αm−1 can be computed once and used for all hash value computations). This math
works even if we compute H(S, j) mod M, where M is a reasonably large prime
number, thus keeping the size of our hash values small (at most M) even when the
pattern string is long.
Rabin-Karp is a good example of a randomized algorithm (if we pick M in some
random way).We get no guarantee the algorithm runs in O(n+m) time, because we
may get unlucky and have the hash values regularly collide with spurious matches.
Still, the odds are heavily in our favor—if the hash function returns values uniformly
from 0 to M − 1, the probability of a false collision should be 1/M. This is quite
reasonable: if M ≈ n, there should only be one false collision per string, and if
M ≈ nk for k ≥ 2, the odds are great we will never see any false collisions.
substring) using a single number. The goal is a representation of the large object
by an entity that can be manipulated in constant time, such that it is relatively
unlikely that two different large objects map to the same value.
Hashing has a variety of clever applications beyond just speeding up search. I
once heard Udi Manber—then Chief Scientist at Yahoo—talk about the algorithms
employed at his company. The three most important algorithms at Yahoo, he said,
were hashing, hashing, and hashing.
Consider the following problems with nice hashing solutions:
• Is a given document different from all the rest in a large corpus? – A search
engine with a huge database of n documents spiders yet another webpage.
How can it tell whether this adds something new to add to the database, or
is just a duplicate page that exists elsewhere on the Web?
Explicitly comparing the new document D to all n documents is hopelessly
inefficient for a large corpus. But we can hash D to an integer, and compare
it to the hash codes of the rest of the corpus. Only when there is a collision
is D a possible duplicate. Since we expect few spurious collisions, we can
explicitly compare the few documents sharing the exact hash code with little
effort.
•Is part of this document plagiarized from a document in a large corpus? – A
lazy student copies a portion of a Web document into their term paper. “The
Web is a big place,” he smirks. “How will anyone ever find which one?”
This is a more difficult problem than the previous application. Adding, deleting,
or changing even one character from a document will completely change
its hash code. Thus the hash codes produced in the previous application
cannot help for this more general problem.
However, we could build a hash table of all overlapping windows (substrings)
of length w in all the documents in the corpus. Whenever there is a match of
hash codes, there is likely a common substring of length w between the two
documents, which can then be further investigated. We should choose w to
be long enough so such a co-occurrence is very unlikely to happen by chance.
The biggest downside of this scheme is that the size of the hash table becomes
as large as the documents themselves. Retaining a small but well-chosen
subset of these hash codes (say those which are exact multiples of 100) for
each document leaves us likely to detect sufficiently long duplicate strings.
• How can I convince you that a file isn’t changed? – In a closed-bid auction,
each party submits their bid in secret before the announced deadline. If you
knew what the other parties were bidding, you could arrange to bid $1 more
than the highest opponent and walk off with the prize as cheaply as possible.
Thus the “right” auction strategy is to hack into the computer containing
the bids just prior to the deadline, read the bids, and then magically emerge
the winner.
How can this be prevented? What if everyone submits a hash code of their
actual bid prior to the deadline, and then submits the full bid after the deadline?
The auctioneer will pick the largest full bid, but checks to make sure the
hash code matches that submitted prior to the deadline. Such cryptographic
hashing methods provide a way to ensure that the file you give me today is
the same as original, because any changes to the file will result in changing
the hash code.
Although the worst-case bounds on anything involving hashing are dismal, with
a proper hash function we can confidently expect good behavior. Hashing is a fundamental
idea in randomized algorithms, yielding linear expected-time algorithms
for problems otherwise Θ(n log n), or Θ(n2) in the worst case.
editor:
Problem: Substring Pattern Matching
Input: A text string t and a pattern string p.
Output: Does t contain the pattern p as a substring, and if so where?
基本解法:
int findmatch(char *p, char *t){
int i,j;
int m, n;
m = strlen(p);
n = strlen(t);
for (i=0; i<=(n-m); i=i+1) {
j=0;
while ((j<m) && (t[i+j]==p[j]))
j = j+1;
if (j == m) return(i);
}
return(-1);
}
Hashing and Strings
Hash tables are a very practical way to maintain a dictionary. They exploit the factthat looking an item up in an array takes constant time once you have its index. A
hash function is a mathematical function that maps keys to integers. We will use
the value of our hash function as an index into an array, and store our item at that
position.
The first step of the hash function is usually to map each key to a big integer.
Let α be the size of the alphabet on which a given string S is written. Let char(c)
be a function that maps each symbol of the alphabet to a unique integer from 0 to
α − 1. The function
maps each string to a unique (but large) integer by treating the characters of the
string as “digits” in a base-α number system.
Efficient String Matching via Hashing
Strings are sequences of characters where the order of the characters matters, sinceALGORITHM is different than LOGARITHM. Text strings are fundamental to a
host of computing applications, from programming language parsing/compilation,
to web search engines, to biological sequence analysis.
The primary data structure for representing strings is an array of characters.
This allows us constant-time access to the ith character of the string. Some auxiliary
information must be maintained to mark the end of the string—either a special
end-of-string character or (perhaps more usefully) a count of the n characters in
the string.
The most fundamental operation on text strings is substring search, namely:
Problem: Substring Pattern Matching
Input: A text string t and a pattern string p.
Output: Does t contain the pattern p as a substring, and if so where?
The simplest algorithm to search for the presence of pattern string p in text t
overlays the pattern string at every position in the text, and checks whether every
pattern character matches the corresponding text character. As demonstrated in
Section 2.5.3 (page 43), this runs in O(nm) time, where n = |t| and m = |p|.
This quadratic bound is worst-case. More complicated, worst-case linear-time
search algorithms do exist: see Section 18.3 (page 628) for a complete discussion.
But here we give a linear expected-time algorithm for string matching, called the
Rabin-Karp algorithm. It is based on hashing. Suppose we compute a given hash
function on both the pattern string p and the m-character substring starting from
the ith position of t. If these two strings are identical, clearly the resulting hash
values must be the same. If the two strings are different, the hash values will
almost certainly be different. These false positives should be so rare that we can
easily spend the O(m) time it takes to explicitly check the identity of two strings
whenever the hash values agree.
This reduces string matching to n−m+2 hash value computations (the n−m+1
windows of t, plus one hash of p), plus what should be a very small number of O(m)
time verification steps. The catch is that it takes O(m) time to compute a hash
function on an m-character string, and O(n) such computations seems to leave us
with an O(mn) algorithm again.
But let’s look more closely at our previously defined hash function, applied to
the m characters starting from the jth position of string S:
What changes if we now try to compute H(S, j + 1)—the hash of the next
window of m characters? Note that m−1 characters are the same in both windows,
although this differs by one in the number of times they are multiplied by α. A
little algebra reveals that
This means that once we know the hash value from the j position, we can find
the hash value from the (j + 1)st position for the cost of two multiplications, one
addition, and one subtraction. This can be done in constant time (the value of
αm−1 can be computed once and used for all hash value computations). This math
works even if we compute H(S, j) mod M, where M is a reasonably large prime
number, thus keeping the size of our hash values small (at most M) even when the
pattern string is long.
Rabin-Karp is a good example of a randomized algorithm (if we pick M in some
random way).We get no guarantee the algorithm runs in O(n+m) time, because we
may get unlucky and have the hash values regularly collide with spurious matches.
Still, the odds are heavily in our favor—if the hash function returns values uniformly
from 0 to M − 1, the probability of a false collision should be 1/M. This is quite
reasonable: if M ≈ n, there should only be one false collision per string, and if
M ≈ nk for k ≥ 2, the odds are great we will never see any false collisions.
Duplicate Detection Via Hashing
The key idea of hashing is to represent a large object (be it a key, a string, or asubstring) using a single number. The goal is a representation of the large object
by an entity that can be manipulated in constant time, such that it is relatively
unlikely that two different large objects map to the same value.
Hashing has a variety of clever applications beyond just speeding up search. I
once heard Udi Manber—then Chief Scientist at Yahoo—talk about the algorithms
employed at his company. The three most important algorithms at Yahoo, he said,
were hashing, hashing, and hashing.
Consider the following problems with nice hashing solutions:
• Is a given document different from all the rest in a large corpus? – A search
engine with a huge database of n documents spiders yet another webpage.
How can it tell whether this adds something new to add to the database, or
is just a duplicate page that exists elsewhere on the Web?
Explicitly comparing the new document D to all n documents is hopelessly
inefficient for a large corpus. But we can hash D to an integer, and compare
it to the hash codes of the rest of the corpus. Only when there is a collision
is D a possible duplicate. Since we expect few spurious collisions, we can
explicitly compare the few documents sharing the exact hash code with little
effort.
•Is part of this document plagiarized from a document in a large corpus? – A
lazy student copies a portion of a Web document into their term paper. “The
Web is a big place,” he smirks. “How will anyone ever find which one?”
This is a more difficult problem than the previous application. Adding, deleting,
or changing even one character from a document will completely change
its hash code. Thus the hash codes produced in the previous application
cannot help for this more general problem.
However, we could build a hash table of all overlapping windows (substrings)
of length w in all the documents in the corpus. Whenever there is a match of
hash codes, there is likely a common substring of length w between the two
documents, which can then be further investigated. We should choose w to
be long enough so such a co-occurrence is very unlikely to happen by chance.
The biggest downside of this scheme is that the size of the hash table becomes
as large as the documents themselves. Retaining a small but well-chosen
subset of these hash codes (say those which are exact multiples of 100) for
each document leaves us likely to detect sufficiently long duplicate strings.
• How can I convince you that a file isn’t changed? – In a closed-bid auction,
each party submits their bid in secret before the announced deadline. If you
knew what the other parties were bidding, you could arrange to bid $1 more
than the highest opponent and walk off with the prize as cheaply as possible.
Thus the “right” auction strategy is to hack into the computer containing
the bids just prior to the deadline, read the bids, and then magically emerge
the winner.
How can this be prevented? What if everyone submits a hash code of their
actual bid prior to the deadline, and then submits the full bid after the deadline?
The auctioneer will pick the largest full bid, but checks to make sure the
hash code matches that submitted prior to the deadline. Such cryptographic
hashing methods provide a way to ensure that the file you give me today is
the same as original, because any changes to the file will result in changing
the hash code.
Although the worst-case bounds on anything involving hashing are dismal, with
a proper hash function we can confidently expect good behavior. Hashing is a fundamental
idea in randomized algorithms, yielding linear expected-time algorithms
for problems otherwise Θ(n log n), or Θ(n2) in the worst case.
相关文章推荐
- ie6 重复字符BUG问题的简单解决方法
- 字符串匹配问题解决方案之一KMP算法
- 中文字符乱码问题简单解决 python
- 正则+split 解决国航项目获取字符串匹配问题
- 使用C语言解决字符串匹配问题的方法
- 简单的留言板,解决了长英文字符的问题哦,不过汉字又出问题了:)希望大家一起来解决
- 用struts2标签简单,解决字符长度超出文本框问题,
- 字符串转换成整数,通配符的字符串匹配问题
- Java 简单解决springmvc获取properties文件里面中文内容出现论码问题(我遇到的问题是数据提交检验时返回的错误提示出现乱码)
- 递归解决输出一个字符串的全排列问题(缺陷:没有考虑字符串中字符重复的问题)
- Windows 控制台(console)输出宽字符(解决中文输出乱码问题)
- 两种php中文字符转拼音问题解决方法
- 物理方法解决数学问题(一):从一个简单的平面几何题谈起
- 快速解决jquery之get缓存问题的最简单方法介绍
- IE6,7,8下使用Javascript记录光标选中范围(已补全)(已解决单个节点内部重复字符的问题)
- 如何解决 html 中多空格字符被当作一个空格字符处理的问题
- 解决表格中英文字符过多出现将边框撑大弄乱布局的问题
- 《Flash Communication Server: 建立一个简单的聊天室》中引用问题的解决
- c#读写ANSI格式文件,解决中乱码问题最简单的写法
- 解决ubuntu下vi上下左右方向键出现字母backspace键不能删除字符 问题