您的位置：首页 > 编程语言 > C语言/C++

Horspool 算法C++11实现（支持中英文混合搜索）

2015-04-02 16:13 387 查看

摘要：

本文给出一个horspool算法的实现，展示一个使用示例，并向介绍一个非常好用的UTF8字符转码项目，给出一个简单的测试报告等。

算法实现：

#include <iostream>
#include <unordered_map>
//#include <codecvt>
#include <fstream>
#include <iterator>
#include <sstream>
#include <bitset>
#include "utf8.h"
using namespace std;

template <typename Key,typename Value>
class ShiftTable{

public:
ShiftTable(const std::u32string& pattern){
index_=pattern.size();
auto end=pattern.rbegin();
auto head=pattern.rend();
auto cur=end+1;
while(cur!=head){
shiftTable_.emplace(make_pair(*cur,cur-end));
++cur;
}
}
Value operator [](const Key& key){
auto cur=shiftTable_.find(key);
if(cur!=shiftTable_.end())
return cur->second;
else
return index_;
}
private:
unordered_map<Key,Value> shiftTable_;
size_t index_;
};

int HorspoolMatching(const std::u32string & pattern,const std::u32string & text){
if(pattern.empty()||text.empty())return -1;
ShiftTable<char32_t,size_t> table(pattern);
auto m=pattern.size();
auto n=text.size();
auto i=m-1;
while(i<=n-1){
int k=0;
while(k<=m-1&&pattern[m-1-k]==text[i-k]) k++;
if(k==m)
return i-m+1;
else
i+=table[text[i]];
}
return -1;
}

在这里不对horspool 算法进行阐述，仅分享一个实现而已。

实现中使用std::u32string, 并且我们要求字符采用unicode32，以支持任意国建字符窜的搜索。

在这里我强烈推荐大家关注一个轻量开源的utf8转码实现，这个是项目主页utf8

一个使用例子，查找并替换：

int main()
{

    //一种比较高效，纯C++方式把文件读入字符串
    ifstream filestream("/home/ron/input.in");//该文件需要以utf8格式保存（操作系统无要求）
    stringstream ss;
    ss<<filestream.rdbuf();
    
    string text(ss.str());
    string pattern="你是";//此处"你是"是utf8保存的，因为源码在ubuntu下以utf8保存

    std::u32string  text32;
    std::u32string  pattern32;
    utf8::utf8to32(text.begin(), text.end() , back_inserter(text32));
    utf8::utf8to32(pattern.begin(), pattern.end() , back_inserter(pattern32));
    
    string repWord="我";//此处"我"是utf8保存的，因为源码在ubuntu下以utf8保存
    std::u32string repWord32;
    utf8::utf8to32(repWord.begin(), repWord.end() , back_inserter(repWord32));

    //查找文件中的"你是"
    auto index=HorspoolMatching(pattern32,text32);
    if( index!=-1 )
    {
        cout<<"found it,at index "<<index<<endl;
        text32.replace(index,1,repWord32);
         //替换文件中，第一"你是"为"我是"
        ofstream ofilestream("/home/ron/input.in");
        ostream_iterator<char> out(ofilestream);
        utf8::utf32to8(text32.begin(),text32.end(),out);
    }
    else
    {
        cout<<"not found"<<endl;
    }

    return 0;
}

上述代码，是一个使用示例，它可以跨平台（操作系统支不支持utf8无所谓，我们程序支持utf8/16/32 任意转码），所以只要求输入文件和模式字符串是采用utf8编码即可。我们知道utf8是网络传输采用的标准，并且大多数系统均支持utf8。

我们可以做支持任何编码的查找，那样问题就复杂化了，谁愿意无穷尽的陷入到字符编码中，相信只有这个领域的专家吧。

codecvt：

#include <codecvt>

这个头文件是啥？C++11引入的关于字符窜转码的实现，可惜gcc到现在还没有实现，哎，怎么会？。vc2010以后的版本应该是支持的。有兴趣的同学可以自行了解。因为我编译环境是ubuntu gcc所以无法使用codecvt，还有其他的一些字符编码库可以用，类似ICU等等，但他们太大了，用起来也麻烦。终于找到utf8轻量级项目，copy源码即用，它在ubuntu下表现非常好。当然在window下也一样。缺点无非就是仅支持utf而已。

测试：

我使用一些文本对这个实现和标准库实现进行对比，时间性能效率相差无几（标准库略好一点点）。有一个gcc issue，希望采用Boyer-Moore算法实现find。所以我猜测gcc对find实现很可能采用的就是horspool算法，又快又简单，只是最坏复杂度无法保证。

KMP：

怎么不见KMP，KMP太复杂了。除非你对算法有所癖爱，否则没有任何一个程序员会选择效率相同，但实现更复杂的算法。但KMP算法的思想确实对后续其他算法产生了影响。

限于本人水平，欢迎大家批评指正。转载请表明出处，谢谢。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航