您的位置：首页 > 编程语言 > PHP开发

php处理敏感词时遇到的相关编码问题

2017-02-02 17:00 369 查看

在过滤敏感词这一块，由于敏感词数量较多，这时候就需要先构建一个字典树(trie)，单纯的字典树占用空间较大，使用 Double-Array Trie 或者 Ternary Search Tree 可以在保证性能的同时节省一部分空间，但是敏感词基本不会很多，几千甚至上万个词基本没压力，所以就实现就选择先构建一个字典树，然后逐字做匹配。

<?php

class SensitiveWordFilter

{

private $dict;

private $dictPath;

public function __construct($dictPath)

{

$this->dict = array();

$this->dictPath = $dictPath;

$this->initDict();

}

private function initDict()

{

$handle = fopen($this->dictPath, 'r');

if (!$handle) {

throw new RuntimeException('open dictionary file error.');

}

while (!feof($handle)) {

$word = trim(fgets($handle, 128));

if (empty($word)) {

continue;

}

$uWord = $this->unicodeSplit($word);

$pdict = &$this->dict;

$count = count($uWord);

for ($i = 0; $i < $count; $i++) {

if (!isset($pdict[$uWord[$i]])) {

$pdict[$uWord[$i]] = array();

}

$pdict = &$pdict[$uWord[$i]];

}

$pdict['end'] = true;

}

fclose($handle);

}

public function filter($str, $maxDistance = 5)

{

if ($maxDistance < 1) {

$maxDistance = 1;

}

$uStr = $this->unicodeSplit($str);

$count = count($uStr);

for ($i = 0; $i < $count; $i++) {

if (isset($this->dict[$uStr[$i]])) {

$pdict = &$this->dict[$uStr[$i]];

$matchIndexes = array();

for ($j = $i + 1, $d = 0; $d < $maxDistance && $j < $count; $j++, $d++) {

if (isset($pdict[$uStr[$j]])) {

$matchIndexes[] = $j;

$pdict = &$pdict[$uStr[$j]];

$d = -1;

}

}

if (isset($pdict['end'])) {

$uStr[$i] = '*';

foreach ($matchIndexes as $k) {

if ($k - $i == 1) {

$i = $k;

}

$uStr[$k] = '*';

}

}

}

}

return implode($uStr);

}

public function unicodeSplit($str)

{

$str = strtolower($str);

$ret = array();

$len = strlen($str);

for ($i = 0; $i < $len; $i++) {

$c = ord($str[$i]);

if ($c & 0x80) {

if (($c & 0xf8) == 0xf0 && $len - $i >= 4) {

if ((ord($str[$i + 1]) & 0xc0) == 0x80 && (ord($str[$i + 2]) & 0xc0) == 0x80 && (ord($str[$i + 3]) & 0xc0) == 0x80) {

$uc = substr($str, $i, 4);

$ret[] = $uc;

$i += 3;

}

} else if (($c & 0xf0) == 0xe0 && $len - $i >= 3) {

if ((ord($str[$i + 1]) & 0xc0) == 0x80 && (ord($str[$i + 2]) & 0xc0) == 0x80) {

$uc = substr($str, $i, 3);

$ret[] = $uc;

$i += 2;

}

} else if (($c & 0xe0) == 0xc0 && $len - $i >= 2) {

if ((ord($str[$i + 1]) & 0xc0) == 0x80) {

$uc = substr($str, $i, 2);

$ret[] = $uc;

$i += 1;

}

}

} else {

$ret[] = $str[$i];

}

}

return $ret;

}

}

使用方法

<?php

require 'SensitiveWordFilter.php';

$filter = new SensitiveWordFilter(__DIR__ . '/sensitive_words.txt');

$filter->filter('这是一个敏感词', 10);

其中，unicodeSplit函数实现将从文件中获取的汉字，一个一个分离出来，那么，判断每个汉字占用几个字节，是该函数的关键所在，该算法针对UTF-8编码。

UTF-8格式字节

4中情况分别是：

1、一个字节: 0XXXXXXX，低7位为有效数据，内码是0x0~0x7f

2、两个字节: 110XXXXX 10YYYYYY，低5位+低6位为有效数据，内码是0x80~0x7ff

3、三个字节：1110XXXX 10YYYYYY 10ZZZZZZ，低4位+低6位+低6位为有效数据，内码是0x800~0xffff，除掉几个保留的特殊内码

4、四个字节: 11110AAA 10XXXXXX 10XXXXXX 10XXXXXX，低3位 +低6位 +低6位 +低6位为有效数据，内码是0x10000~0x10ffff

按前面1的个数就能区分每个字占了多少个字节。

这就是为什么unicodeSplit会出现0x80,0xf0,0xe0,0xc0,

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航