您的位置：首页 > 编程语言 > Python开发

python:正确的对未知编码的字符串进行预处理-Unicode-UTF8-gbk

2016-04-09 01:44 731 查看

由于计算机只能识别二进制数据，所以指望程序自动的猜出字符串是如何编码的很难。

而现实中，我们经常得到编码方式未知的字符串，我们总是希望能将这些字符串先统一预转换为unicode编码，在处理以后再根据需要编码到需要的格式

为了判断原始字符串的编码格式，可以采用chardet模块

我编写了下面的一个函数，用以从文件中读取信息，并统一转换为unicode格式返回，同时返回的还有数据的原始编码格式（如’utf-8‘）

def readFile2UnicodeBuf(filename):

readstring=None

oldCodingType=None

try:

with open(filename, 'rb') as pf:

readstring=pf.read()

if isinstance(readstring, unicode):

oldCodingType='unicode'

else:

oldCodingType=chardet.detect(readstring)['encoding']

readstring=readstring.decode(oldCodingType)

except:

print 'ERROR: read file fail:'+filename

return None,None

return readstring,oldCodingType

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航