您的位置：首页 > 编程语言 > Python开发

自动探测gbk编码文件并转为utf8

2016-07-21 11:20 337 查看

自动探测gbk编码文件并转为utf8

有时候要将一些部分在windows上的代码在linux上打开，但是因为windows上的默认都是gbk，linux上打开都是乱码。

因为只有部分文件是gbk编码的，所以不能粗暴的全部转换，要先对文件编码进行推测，然后才能决定是否要转换。

以下

python

代码实现遍历目录下指定类型文件并自动探测编码，然后将推测出gbk、gb2312的文件转为utf8。

import codecs

def ReadFile (filePath,encoding):
with codecs.open(filePath,"r",encoding) as file:
try:
content = file.read();
return content
except UnicodeDecodeError :
return ''

def WriteFile(filePath,content,encoding):
with codecs.open(filePath,"w",encoding) as file:
file.write(content)

import chardet

def getEncode(filename):
with open(filename) as file:
result = chardet.detect(file.read())
if result['confidence']>0.9:
return result['encoding']
else:
print result
return ''

import os
import os.path

rootdir = os.path.curdir
encoding = ['GBK','GB2312']

for parent,dirnames,filenames in os.walk(rootdir):
for filename in filenames:
if filename.endswith('.h') or filename.endswith('.cpp'):
file = os.path.join(parent,filename)
encode = getEncode(file)
if encode in encoding:
content = ReadFile(file,encode)
if  content!='':
print 'transform ' +filename+ ' from '+ encode+ ' to utf-8'
WriteFile(file,content,'utf8')
else:
if encode =='':
encode = 'unknow'
if encode !='utf-8' and encode !='ascii':
print 'suspicious file' + file + ' by '+encode

代码中使用chardet库来探测文件编码，这个库非自带的系统库。需要自己下载。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： python utf-8

相关文章推荐

新的分享

章节导航