您的位置：首页 > 编程语言 > Python开发

解决python编码问题大总结

2018-03-11 17:45 399 查看

前言

因为计算机只能处理数字，如果要处理文本，就必须先把文本转换为数字才能处理。最早的计算机在设计时采用8个比特（bit）作为一个字节（byte），所以，一个字节能表示的最大的整数就是255（二进制11111111=十进制255），0 - 255被用来表示大小写英文字母、数字和一些符号，这个编码表被称为ASCII编码，

在所有字符集中，最知名的可能要数被称为ASCII的7位字符集了。它是美国标准信息交换代码（American Standard Code for Information Interchange）的缩写, 为美国英语通信所设计。它由128个字符组成，包括大小写字母、数字0-9、标点符号、非打印字符（换行符、制表符等4个）以及控制字符（退格、响铃等）组成。但是，由于他是针对英语设计的，当处理带有音调标号（形如汉语的拼音）的欧洲文字时就会出现问题。

把汉语、日语和越南语的一些相似的字符结合起来，在不同的语言里，使不同的字符代表不同的字，这样只用2个字节就可以编码地球上几乎所有地区的文字。因此，创建了UNICODE编码。它通过增加一个高字节对ISO Latin-1字符集进行扩展，当这些高字节位为0时，低字节就是ISO Latin-1字符。

因为Python的诞生比Unicode标准发布的时间还要早，所以最早的Python只支持ASCII编码，普通的字符串'ABC'在Python内部都是ASCII编码的

python:于1989年发明，第一个公开发行版发行于1991年
Unicode:1990年开始研发，1994年正式公布

事实证明，对可以用ASCII表示的字符使用UNICODE并不高效，因为UNICODE比ASCII占用大一倍的空间，而对ASCII来说高字节的0对他毫无用处。为了解决这个问题，就出现了一些中间格式的字符集，他们被称为Unicode转换格式，即UTF（Unicode Transformation Format）。常见的UTF格式有：UTF-7, UTF-7.5, UTF-8,UTF-16, 以及 UTF-32。

1. python2中的乱码问题解决方法

首先要搞清楚，字符串在Python2内部的表示是unicode编码，因此，在做编码转换时，通常需要以unicode作为中间编码，即先将其他编码的字符串解码（decode）成unicode，再从unicode编码（encode）成另一种编码。

decode的作用是将其他编码的字符串转换成unicode编码，如str1.decode('gb2312')，表示将gb2312编码的字符串转换成unicode编码。

encode的作用是将unicode编码转换成其他编码的字符串，如str2.encode('gb2312')，表示将unicode编码的字符串转换成gb2312编码。

在某些IDE中，字符串的输出总是出现乱码，甚至错误，其实是由于IDE的结果输出控制台自身不能显示字符串的编码，而不是程序本身的问题。
#-*-coding:utf-8-*-

s='中文'
print type(s) #查看s的字符类型
print s

s.decode('utf8') #解码utf8，默认的编码方式是unicode
s.decode('gbk', "ignore") #解码utf8，忽略其中有异常的编码，仅显示有效的编码
s.decode('gbk', 'replace')
print type(s)
print s

s.encode('gb2312') ##编码为utf8
print type(s)
print stest.py编码必需与s.decode('utf8')指定的编码一致，不然会抛出解码异常信息，可以通过s.decode("gbk", "ignore")或s.decode("gbk", "replace")来解决。

另外对于一些包含特殊字符的编码，直接解码可能会报错，可以使用对于的参数来设置。如:
s.decode("utf-8", "ignore") 忽略其中有异常的编码，仅显示有效的编码
s.decode("utf-8", "replace") 替换其中异常的编码，这个相对来可能一眼就知道那些字符编码出问题了。

2. python3中的乱码问题解决方法

在python2里面，u表示unicode string，类型是unicode, 没有u表示byte string，类型是 str。<br>
在python3里面，所有字符串都是unicode string, u前缀没有特殊含义了。python3默认编码为unico
4000
de，由str类型进行表示。二进制数据使用byte类型表示，所以不会将str和byte混在一起<br>
r都表示raw string. 与特殊字符的escape规则有关，一般用在正则表达式里面。<br>
r和u可以搭配使用，例如ur"abc"。所以python3中str类型已经没有decode方法，例如：s = "python"
s.decode('gbk', "ignore")

AttributeError                            Traceback (most recent call last)
<ipython-input-14-9e309b35bad3> in <module>()
1 s = "python"
----> 2 s.decode('gbk', "ignore")

AttributeError: 'str' object has no attribute 'decode'

此外，我们可以用dir(s)查看有无此方法。
有几点需要注意：1：字符串通过编码转换为字节码，字节码通过解码转换为字符串str--->(encode)--->bytes，bytes--->(decode)--->str

import sys
import chardet
print('目前系统的编码为：',sys.getdefaultencoding())
name='小明'
print(type(name))#首先我们来打印下转码前的name类型，因为它是str，所以可以通过encode来进行编码

目前系统的编码为： utf-8
<class 'str'>
{'encoding': 'utf-8', 'language': '', 'confidence': 0.7525}
b'\xe5\xb0\x8f\xe6\x98\x8e'
<class 'bytes'>

name1=name.encode('utf-8') print(chardet.detect(name1))print(name1) print(type(name1))

可以看到name的type:str类型通过encode('utf-8')转换成了bytes类型从unicode转str，被看做是把一个信息文本编码为二进制字节流的过程，要用encode方法name2=name1.decode('utf-8')
print(type(name2))
print(name2)

<class 'str'>
小明

这里要跟大家说下，decode()括号中为什么写utf-8，而不写gbk，可以这样理解，因为要解码，你总得告诉它我是什么编码的吧，比如我原先是utf-8格式的编码，现在要解码，但是如果冒充utf-8，说自己是gbk，那就会出现乱码，见下：

name = "小明"
name1=name.encode('utf-8')
name2=name1.decode('utf-8')
name3=name2.encode('gbk')
name4=name3.decode('gbk')
print("name:")
print(type(name))
print(name)
print("name1:")
print(type(name1))
print(name1)
print("name2:")
print(type(name2))
print(name2)
print("name3:")
print(type(name3))
print(name3)
print("name4:")
print(type(name4))
print(name4)

name:
<class 'str'>
小明
name1:
<class 'bytes'>
b'\xe5\xb0\x8f\xe6\x98\x8e'
name2:
<class 'str'>
小明
name3:
<class 'bytes'>
b'\xd0\xa1\xc3\xf7'
name4:
<class 'str'>
小明

所以不难看出，其实utf-8和gbk之间都是通过unicode来做一个中间转换的操作

3. python3爬虫中的乱码问题解决方法

import urllib.request
res=urllib.request.urlopen('http://www.baidu.com')
htmlBytes=res.read()
print(type(htmlBytes))

htmlStr = htmlBytes.decode('utf-8')

<class 'bytes'>

print(htmlBytes)

b'<!DOCTYPE html>\n<!--STATUS OK-->\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\t\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\t\r\n        \r\n\t\t\t        \r\n\t\r\n\t\t\t        \r\n\t\r\n\t\t\t        \r\n\t\r\n\t\t\t        \r\n\t\t\t    \r\n\r\n\t\r\n

htmlBytes为bytes类型所以可通过decode方法解码为str类型

htmlStr

'<!DOCTYPE html>\n<!--STATUS OK-->\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\t\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\t\r\n        \r\n\t\t\t        \r\n\t\r\n\t\t\t        \r\n\t\r\n\t\t\t        \r\n\t\r\n\t\t\t        \r\n\t\t\t    \r\n\r\n\t\r\n        \r\n\t\t\t        \r\n\t\r\n\t\t\t        \r\n\t\r\n\t\t\t        \r\n\t\r\n\t\t\t        \r\n\t\t\t

print(htmlStr)

<!DOCTYPE html>
<!--STATUS OK>
<html>
<head>

<meta http-equiv="content-type" content="text/html;charset=utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
<meta content="always" name="referrer">
<meta name="theme-color" content="#2932e1">
<link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" />
<link rel="search" type="application/opensearchdescription+xml" href="/content-search.xml" title="百度搜索" />
<link rel="icon" sizes="any" mask href="//www.baidu.com/img/baidu_85beaf5496f291521eb75ba38eacbd87.svg">

<link rel="dns-prefetch" href="//s1.bdstatic.com"/>
<link rel="dns-prefetch" href="//t1.baidu.com"/>
<link rel="dns-prefetch" href="//t2.baidu.com"/>

可以看出：htmlStr直接输出与print(htmlStr)输出有所区别，
原来是print()函数自身有限制，不能完全打印所有的unicode字符。
知道原因后,google了一下解决方法,其实print()函数的局限就是Python默认编码的局限

help(print)

Help on built-in function print in module builtins:

print(...)
print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)

Prints the values to a stream, or to sys.stdout by default.
Optional keyword arguments:
file:  a file-like object (stream); defaults to the current sys.stdout.
sep:   string inserted between values, default a space.
end:   string appended after the last value, default a newline.
flush: whether to forcibly flush the stream.

type(htmlStr)

str

未完待续.....

4.读写文件乱码问题解决方法

未完待续.....

参考资料：https://www.douban.com/note/347617467/http://blog.csdn.net/qq_29053519/article/details/79170519

http://www.cnblogs.com/linjiqin/p/3674825.html

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航