您的位置:首页 > 理论基础 > 数据结构算法

AC自动机的一种简单实现

2016-06-14 17:33 281 查看
ProblemDescription(本题源自ACM题库HDU 2222)

In the modern time, Search engine came intothe life of everybody like Google, Baidu, etc.

Wiskey also wants to bring this feature tohis image retrieval system.

Every image have a long description, whenusers type some keywords to find the image, the system will match the keywordswith description of image and show the image which the most keywords bematched.

To simplify the problem, giving you adescription of image, and some keywords, you should tell me how many keywordswill be match.

(给一个描述即目标串,和一些关键词即模式串,请输出描述中关键词的出现次数。)

 

Input

First line will contain one integer meanshow many cases will follow by.

Each case will contain two integers N meansthe number of keywords and N keywords follow. (N <= 10000)

Each keyword will only contains characters'a'-'z', and the length will be not longer than 50.

The last line is the description, and thelength will be not longer than 1000000.

(第一行包含多少个例子将要测试,每个例子将包含两个整数型N,代表有多少个子字符串,字符串只允许包含’a’-‘z’范围的字母,且长度不超过50,最后一行为描述,长度不应该超过1000000。)

 

Output

Print how many keywords are contained inthe description.

(打印有多少个关键词在描述中)

 

SampleInput

1

5

she

he

say

shr

her

yasherhs

 

SampleOutput

3

说明:上述题目旨在希望编写一个可以查询目标串中多个模式串的出现次数的程序,属于典型的AC自动机的应用,利用Trie树和KMP匹配思想构建AC自动机可以快速有效的查找一个目标串中多个模式串的出现次数。“典型应用是用于统计和排序大量的字符串(但不仅限于字符串),所以经常被搜索引擎系统用于文本词频统计。它的优点是:最大限度地减少无谓的字符串比较,查询效率比哈希表高。”——摘自百度百科

本题使用了Trie树+KMP匹配思想来构建AC自动机,对Trie树的存储结构采用的是动态存储分配,结点中的孩子结点的指针保存在数组中,方便查询,另外构建fail指针来进行回溯(若字符串匹配失败则回溯到上一匹配结点)。另外本课题中使用了队列来对Trie树进行BFS(广度遍历),在此不在列出。

步骤:

1.      首先通过用户输入模式串的个数(即所要匹配的子字符串个数);

2.      再通过用户输入的多个子字符串来构建Trie树(Trie_insert);

3.      然后对构建好的Trie树构建fail指针,当匹配失败时通过fail指针回溯到前一匹配结果(KMP算法);

4.      利用构建好的AC自动机来对目标串进行查询并显示查询结果。

AC_auto.h

#ifndef _Trie_H_
#define _Trie_H_

#include<iostream>
using namespace std;

/*
MAX_CHILD 一个结点中最多孩子数目,26为字母数量
MAX_SIZE 队列和栈的最大长度,应该大于MAX_CHILD为准
*/

#define MAX_CHILD 26
#define MAX_SIZE 50

struct TrieNode{
int count;
TrieNode *next[MAX_CHILD];
TrieNode *fail;
bool exist;
TrieNode() :count(0), exist(false), fail(NULL){ for (int i = 0; i < MAX_CHILD; i++)next[i] = NULL; }
};

bool Trie_insert(TrieNode *root, char *str);
bool Trie_search(TrieNode *root, char *str);//测试检查模式串是否已经插入到Trie树中
void construct_Fail(TrieNode *root);//构建Fail指针,对于next[id]为空的则直接指向根结点
int query_str(TrieNode *root, char *str);//将构建好fail指针的AV自动机与目标串进行查询

//队列用于BFS(层次访问)各个树结点
class queue{
public:
queue() :front(0), rear(0){ elem = new TrieNode*[MAX_SIZE]; }
void makeEmpty();
bool isEmpty();
bool isFull();
bool pop(TrieNode *&p);
bool push(TrieNode *p);
private:
TrieNode **elem;
int front, rear;//rear指针用于指向队列的尾元素的后一位
};

#endif

AC_func.cpp
#include"AC_auto.h"
#include<iostream>
using namespace std;

void queue::makeEmpty(){ rear = front; }
bool queue::isEmpty(){ if (front == rear)return true; return false; }
bool queue::isFull(){ if ((rear + 1) % MAX_SIZE == front)return true; return false; }

bool queue::push(TrieNode *p){
if (isFull())return false;
elem[rear] = p;
rear = (rear + 1) % MAX_SIZE;
return true;
}

bool queue::pop(TrieNode *&p){
if (isEmpty())return false;
p = elem[front];
front = (front + 1) % MAX_SIZE;
return true;
}

bool Trie_insert(TrieNode *root, char *str){
TrieNode* tail = root;
char *p = str;
int id;
while (*p){
id = *p - 'a';
if (tail->next[id] == NULL){
tail->next[id] = new TrieNode;
if (tail->next[id] == NULL)return false;
}
tail = tail->next[id];
++p; tail->count++;
}
tail->exist = true;
return true;
}

bool Trie_search(TrieNode *root, char *str){
TrieNode *tail = root;
char *p = str;
int id;
while (*p){
id = *p - 'a';
tail = tail->next[id]; ++p;
if (tail == NULL)return false;
}
if (tail->exist)return true;
else return false;
}

void construct_Fail(TrieNode *root){
TrieNode *p;
queue q;
q.makeEmpty();
root->fail = NULL;
q.push(root);
while (!q.isEmpty()){
q.pop(p);

for (int i = 0; i < MAX_CHILD; i++){
if (p->next[i] == NULL){
p->next[i] = root;
}
else{
p->next[i]->fail = (p == root) ? root : p->fail->next[i];
q.push(p->next[i]);
}
}

}
}

int query_str(TrieNode *root, char *str){
TrieNode *p = root;
int id, count1 = 0;
for (int i = 0; str[i]; i++){
id = str[i] - 'a';
if (id == -1){
p = root;
continue;
}

if (p != root&&p->next[id]->count == 0){
p = p->fail;
if (p->exist)count1++;
}
p = p->next[id];
if (p->exist)count1++;
}
return count1;
}


AC_auto.cpp
#include"AC_auto.h"
#include<iostream>
using namespace std;

int main(){
system("color F0");
cout << "\n" << endl;
cout << "\t***************************************** " << endl;
cout << "\t*\t\t\t\t\t*" << endl;
cout << "\t*\t 本程序通过构建AC自动机 \t*" << endl;
cout << "\t* 统计多模式串在目标串中的出现次数 *" << endl;
cout << "\t*\t\t\t\t\t*" << endl;
cout << "\t***************************************** " << endl;
cout << endl << endl;
TrieNode *root=new TrieNode;
int n = 0, count1 = 0;
cout << "--请输入单词子串数量:" ;
cin >> n;
char **str=new char*
;
char *astring = new char[MAX_SIZE];

cout << "--请逐个输入单词子串(子串字母长度应少于"<<MAX_SIZE <<"):"<< endl;//输入n个模式串
for (int i = 0; i < n; i++){
str[i] = new char[MAX_SIZE];
cin >> str[i];
}

for (int i = 0; i < n; i++){
Trie_insert(root, str[i]);//执行插入结点
}
//搜索模式串是否已经插入结点
/*for (int i = 0; i < n; i++){
if (Trie_search(root, str[i]));
}*/

construct_Fail(root);//构建fail指针

cout << "--请输入目标串(字母长度应少于"<<MAX_SIZE<<"):" << endl;//接下来是执行多模式串匹配过程
cin >> astring;
count1 = query_str(root, astring);
cout <<"--目标串中含"<<count1<<"个模式子串.\n" << endl;
return 0;
}
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息