您的位置：首页 > 编程语言 > Lua

lua UTF8字符串操作，截取，索引

2016-12-22 09:20 330 查看

首先引用网络一段说明

UTF-8是一种变长字节编码方式。对于某一个字符的UTF-8编码，如果只有一个字节则其最高二进制位为0；如果是多字节，其第一个字节从最高位开始，连续的二进制位值为1的个数决定了其编码的位数，其余各字节均以10开头。UTF-8最多可用到6个字节。

如表：

1字节 0xxxxxxx

2字节 110xxxxx 10xxxxxx

3字节 1110xxxx 10xxxxxx 10xxxxxx

4字节 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

5字节 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

6字节 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

因此UTF-8中可以用来表示字符编码的实际位数最多有31位，即上表中x所表示的位。除去那些控制位（每字节开头的10等），这些x表示的位与UNICODE编码是一一对应的，位高低顺序也相同。

实际将UNICODE转换为UTF-8编码时应先去除高位0，然后根据所剩编码的位数决定所需最小的UTF-8编码位数。

因此那些基本ASCII字符集中的字符（UNICODE兼容ASCII）只需要一个字节的UTF-8编码（7个二进制位）便可以表示。

对于上面的问题，代码中给出的两个字节是

十六进制：C0 B1

二进制：11000000 10110001

对比两个字节编码的表示方式：

110xxxxx 10xxxxxx

提取出对应的UNICODE编码：

00000 110001

可以看出此编码并非“标准”的UTF-8编码，因为其第一个字节的“有效编码”全为0，去除高位0后的编码仅有6位。由前面所述，此字符仅用一个字节的UTF-8编码表示就够了。

如果是多字节，其第一个字节从最高位开始，连续的二进制位值为1的个数决定了其编码的位数，其余各字节均以10开头。UTF-8最多可用到6个字节。

上面的表对应的10进制数

1字节 0xxxxxxx   ---- 最小值 00000000 ---十进制为0 最大值---01111111 ---十进制为127
2字节 110xxxxx 10xxxxxx   ----第一个字节最小值
：11000000（将x替换成最小值0） ---十进制为192 ，最大值11011111---十进制223。其他字段范围10000000 ~10111111
，十进制为 128  ~191
3字节 1110xxxx 10xxxxxx 10xxxxxx     ----第一个字节最小值
：11100000 ----十进制为224。其他字段范围10000000 ~10111111
，十进制为 128  ~191
4字节 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx     ----第一个字节最小值
：11110000 ----十进制为240。其他字段范围10000000 ~10111111
，十进制为 128  ~191
5字节 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx      ----第一个字节最小值
：11111000 ----十进制为248。其他字段范围10000000 ~10111111
，十进制为 128  ~191
6字节 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
            ----第一个字节最小值
：11111100 ----十进制为252。其他字段范围10000000 ~10111111
，十进制为 128  ~191

从上面可以看出第一个字节的取值范围

1字节 00000000 ---0  ~ 01111111
---127

2字节 11000000 ---192~11011111---223

3字节 11100000 ---244~11101111---239

4字节 11110000 ---240~11110111---247

5字节 11111000 ---248~11111011---251

6字节 11111100 ---252~11111101---253

其他字节 10000000 ---128~10111111 ----191

lua实现utf8操作

方法导向：从上面看出，我们可以将一个字符的编码分为首部（字符编码的第一个字节）和尾部（字符编码除第一个字节的其他字节）

判断字符所占字节数
1字节：如果是在 0~127 内
2字节：如果是在 192~244范围内
3字节：如果是在 244~480范围内
4字节：如果是在 480~496范围内
5字节：如果是在 496~504范围内
6字节:如果是大于 504

还有一个关键点：在十进制为 128  ~191 范围的字节为此字符的一部分
代码如下：

--判断字符所占字节数
function byteNumber(coding)
if 127 >= coding then
return 1
elseif coding < 192 then
return 0
elseif coding < 224 then
return 2
elseif coding < 240 then
return 3
elseif coding < 248 then
return 4
elseif coding < 252 then
return 5
else
return 6
end
end

--截取从n到le的字符串
function string.utf8Sub(s, n, le)
if s ~= nil then
if tostring(type(s)) == "string" then
if n ==nil then
n = 1
else
if tostring(type(n)) ~= "number" or n < 1 then
n = 1
end
end
if le == nil then
le = 1
else
if tostring(type(le)) ~= "number" or le < 1 then
le = 1
end
end
local index = 0
local startIndex = 0
local endIndex = 0
for i = 1 , #s do

local coding = string.byte(s,i)
if coding >= 128 and coding < 192 then
else
index = index + 1
if index == n then
startIndex = i
end
if index == le then
endIndex = i + byteNumber(coding) - 1
end

end
end
return string.sub(s,startIndex,endIndex)
end
else
return nil
end
end

--获取第n个字符
function string.utf8Index(s,n)
if s ~= nil then
if tostring(type(s)) == "string" then
if n ==nil then
n = 1
else
if tostring(type(n)) ~= "number" or n < 1 then
n = 1
end
end
local index = 0
local startIndex = 0
for i = 1 , #s do
local coding = string.byte(s,i)
if coding >= 128 and coding < 192 then
else
index = index + 1
if index == n then

return string.sub(s,i,i + byteNumber(coding) - 1)
end

end
end
end
else
return nil
end
end

--获取字符串长度
function string.utf8Len(s)
if s ~= nil then
if tostring(type(s)) == "string" then
local index = 0
for i = 1 , #s do
local coding = string.byte(s,i)
if coding >= 128 and coding < 192 then
else
index = index + 1
end
end
return index
end
else
return nil
end
end

--以下是不需要传入字符串的方法
--如：local str = "截取字符串" str = str:utf8SelfSub(1,2) --输出str为"截取"
function string:utf8SelfSub(n, le)
if self ~= nil then
if tostring(type(self)) == "string" then
if n ==nil then
n = 1
else
if tostring(type(n)) ~= "number" or n < 1 then
n = 1
end
end
if le == nil then
le = 1
else
if tostring(type(le)) ~= "number" or le < 1 then
le = 1
end
end
local index = 0
local startIndex = 0
local endIndex = 0
for i = 1 , #self do

local coding = string.byte(self,i)
if coding >= 128 and coding < 192 then
else
index = index + 1
if index == n then
startIndex = i
end
if index == le then
endIndex = i + byteNumber(coding) - 1
end

end
end
return string.sub(self,startIndex,endIndex)
end
else
return nil
end
end

function string:utf8SelfIndex(n)
if self ~= nil then
if tostring(type(self)) == "string" then
if n ==nil then
n = 1
else
if tostring(type(n)) ~= "number" or n < 1 then
n = 1
end
end
local index = 0
local startIndex = 0
for i = 1 , #self do
local coding = string.byte(self,i)
if coding >= 128 and coding < 192 then
else
index = index + 1
if index == n then

return string.sub(self,i,i + byteNumber(coding) - 1)
end

end
end
end
else
return nil
end
end

function string:utf8SelfLen()
if self ~= nil then
if tostring(type(self)) == "string" then
local index = 0
for i = 1 , #self do
local coding = string.byte(self,i)
if coding >= 128 and coding < 192 then
else
index = index + 1
end
end
return index
end
else
return nil
end
end

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： lua UTF8字符串操作 lua UTF8截取索引 UTF8 截取

相关文章推荐

新的分享

章节导航