python - encoding
2015-08-25 17:19
585 查看
Why does Python print unicode characters when the default encoding is ASCII?
Character sets, maps and code pages
A “character set” is just what it says: a properly-specified list of distinct characters.
A “character set” in HTTP (and MIME) parlance is the same as a character encoding (but not the same as CCS).
UTF-8 is an encoding, but not a character set. It is an encoding of the Unicode character set(*).
Code page is another name for character encoding. It consists of a table of values that describes the character set for a particular language.[3]
Windows code pages are sets of characters or code pages (known as character encodings in other operating systems) used in Microsoft Windows systems from the 1980s and 1990s.
In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages.
The name “ANSI” is a misnomer, since it doesn’t correspond to any actual ANSI standard, but the name has stuck.[4]
There’s no one fixed ANSI encoding - there are lots of them. Usually when people say “ANSI” they mean “the default locale/codepage for my system” which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.[5]
The basic rules are this:
If a byte starts with a 0 bit, it’s a single byte value less than 128.
If it starts with 11, it’s the first byte of a multi-byte sequence and the number of 1 bits at the start indicates how many bytes there are in total (110xxxxx has two bytes, 1110xxxx has three and 11110xxx has four).
3.If it starts with 10, it’s a continuation byte.
This distinction allows quite handy processing such as being able to back up from any byte in a sequence to find the first byte of that code point. Just search backwards until you find one not beginning with the 10 bits.
Similarly, it can also be used for a UTF-8 strlen by only counting non-10xxxxxx bytes.
谈谈Unicode编码,简要解释UCS、UTF、BMP、BOM等名词
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
Unicode In Python, Completely Demystified
UTF-8 Everywhere
Programming with Unicode
https://en.wikipedia.org/wiki/Category:Character_sets ↩
https://en.wikipedia.org/wiki/Code_page ↩
https://en.wikipedia.org/wiki/Code_page ↩
What is ANSI format? ↩
Unicode, UTF, ASCII, ANSI format differences ↩
UTF-8 Continuation bytes ↩
Terminologies
What’s the difference between an “encoding,” a “character set,” and a “code page”?Character sets, maps and code pages
Character set
A not should be used term.[1]A “character set” is just what it says: a properly-specified list of distinct characters.
A “character set” in HTTP (and MIME) parlance is the same as a character encoding (but not the same as CCS).
Encoding
An ‘encoding’ is a mapping between a character set (typically Unicode today) and a (usually byte-based) technical representation of the characters.UTF-8 is an encoding, but not a character set. It is an encoding of the Unicode character set(*).
Code page
a code page is a table of values that describes the character set used for encoding a particular set of glyphs.[2]Code page is another name for character encoding. It consists of a table of values that describes the character set for a particular language.[3]
Windows code pages are sets of characters or code pages (known as character encodings in other operating systems) used in Microsoft Windows systems from the 1980s and 1990s.
In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages.
ANSI
I have been misunderstanding theANSIencoding.
The name “ANSI” is a misnomer, since it doesn’t correspond to any actual ANSI standard, but the name has stuck.[4]
There’s no one fixed ANSI encoding - there are lots of them. Usually when people say “ANSI” they mean “the default locale/codepage for my system” which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.[5]
UTF-8
The intuition behind UTF-8’s coding scheme.[6]The basic rules are this:
If a byte starts with a 0 bit, it’s a single byte value less than 128.
If it starts with 11, it’s the first byte of a multi-byte sequence and the number of 1 bits at the start indicates how many bytes there are in total (110xxxxx has two bytes, 1110xxxx has three and 11110xxx has four).
3.If it starts with 10, it’s a continuation byte.
This distinction allows quite handy processing such as being able to back up from any byte in a sequence to find the first byte of that code point. Just search backwards until you find one not beginning with the 10 bits.
Similarly, it can also be used for a UTF-8 strlen by only counting non-10xxxxxx bytes.
Links
字符编码笔记:ASCII,Unicode和UTF-8谈谈Unicode编码,简要解释UCS、UTF、BMP、BOM等名词
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
Unicode In Python, Completely Demystified
UTF-8 Everywhere
Programming with Unicode
https://en.wikipedia.org/wiki/Category:Character_sets ↩
https://en.wikipedia.org/wiki/Code_page ↩
https://en.wikipedia.org/wiki/Code_page ↩
What is ANSI format? ↩
Unicode, UTF, ASCII, ANSI format differences ↩
UTF-8 Continuation bytes ↩
相关文章推荐
- Python RuntimeError: thread.__init__() not called
- re,re.search,
- 【Python 练习】随机显示不重复的单词
- python的datetime模块实用小记
- python property装饰器
- Python图像处理库:Pillow 初级教程
- python学习--核心编程3习题解答以及知识点记录
- python学习笔记6—文件操作来生成船只侧面图像的描述文件
- python build in functions
- python学习笔记5—序列与列表的操作
- selenium 不断切换代理 打淘宝小店流量
- python中的set
- python - 跨平台全局快捷键解决方案
- liunx下安装第三方Python(PIP安装)
- Play 迷宫 with python and pygame
- 【python编程】python引导实例参考
- 轻松python文本专题-判断对象里面是否是类字符串(推荐使用isinstance(obj,str))
- 轻松python文本专题-判断对象里面是否是类字符串(推荐使用isinstance(obj,str))
- 用Python操作Mysql
- 使用setuptools打包python项目