您的位置:首页 > 编程语言 > Python开发

python - encoding

2015-08-25 17:19 585 查看
Why does Python print unicode characters when the default encoding is ASCII?

Terminologies

What’s the difference between an “encoding,” a “character set,” and a “code page”?

Character sets, maps and code pages

Character set

A not should be used term.[1]

A “character set” is just what it says: a properly-specified list of distinct characters.

A “character set” in HTTP (and MIME) parlance is the same as a character encoding (but not the same as CCS).

Encoding

An ‘encoding’ is a mapping between a character set (typically Unicode today) and a (usually byte-based) technical representation of the characters.

UTF-8 is an encoding, but not a character set. It is an encoding of the Unicode character set(*).

Code page

a code page is a table of values that describes the character set used for encoding a particular set of glyphs.[2]

Code page is another name for character encoding. It consists of a table of values that describes the character set for a particular language.[3]

Windows code pages are sets of characters or code pages (known as character encodings in other operating systems) used in Microsoft Windows systems from the 1980s and 1990s.

In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages.

ANSI

I have been misunderstanding the
ANSI
encoding.

The name “ANSI” is a misnomer, since it doesn’t correspond to any actual ANSI standard, but the name has stuck.[4]

There’s no one fixed ANSI encoding - there are lots of them. Usually when people say “ANSI” they mean “the default locale/codepage for my system” which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.[5]

UTF-8

The intuition behind UTF-8’s coding scheme.[6]

The basic rules are this:



If a byte starts with a 0 bit, it’s a single byte value less than 128.

If it starts with 11, it’s the first byte of a multi-byte sequence and the number of 1 bits at the start indicates how many bytes there are in total (110xxxxx has two bytes, 1110xxxx has three and 11110xxx has four).

3.If it starts with 10, it’s a continuation byte.



This distinction allows quite handy processing such as being able to back up from any byte in a sequence to find the first byte of that code point. Just search backwards until you find one not beginning with the 10 bits.


Similarly, it can also be used for a UTF-8 strlen by only counting non-10xxxxxx bytes.

Links

字符编码笔记:ASCII,Unicode和UTF-8

谈谈Unicode编码,简要解释UCS、UTF、BMP、BOM等名词

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

Unicode In Python, Completely Demystified

UTF-8 Everywhere

Programming with Unicode

https://en.wikipedia.org/wiki/Category:Character_sets
https://en.wikipedia.org/wiki/Code_page
https://en.wikipedia.org/wiki/Code_page
What is ANSI format?
Unicode, UTF, ASCII, ANSI format differences
UTF-8 Continuation bytes
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: