您的位置：首页 > 其它

是否应该使用utf-8 bom——因DirectVobSub不支持utf-8 no bom带来的问题

2010-01-08 17:33 489 查看

使用DirectVobSub作为播放器的字幕插件。

把字幕转换成utf-8 no bom格式，播放时字幕显示乱码。

把字幕转换成utf-8 bom格式，播放时字幕正常。

看来DirectVobSub不支持utf-8 no bom。

DirectVobSub（vsfilter）官方网站：http://sourceforge.net/projects/guliverkli2/files/DirectShow%20Filters/

utf-8应不应该使用bom呢？ unicode标准是如何规定的？

查了一下，供参考：

http://zh.wikipedia.org/zh-cn/UTF-8#UTF-8.E7.9A.84.E8.A1.8D.E7.94.9F.E7.89.A9

维基百科说：

虽然不是标准，但许多Windows 程序（包括Windows 笔记本）在UTF-8编码的文件的开首加入一段字节串EF BB BF。这是字节顺序记号 U+FEFF 的 UTF-8 编码结果。对于没有预期要处理UTF-8的文本编辑器和浏览器会显示成 ISO-8859-1 字符串 "ï»¿"。

从维基百科的说法看，好像是不应该使用bom。

本着“微软靠得住，母猪会上树” 的成见，由于Windows的记事本另存为utf-8格式会产生bom，而gedit会产生utf-8 no bom，我认为utf-8不应该使用bom。

然后查到http://unicode.org/faq/utf_bom.html#bom1

unicode.org说：

Q: How I should deal with BOMs?

A: Here are some guidelines to follow:

A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM.

Some protocols allow optional BOMs in the case of untagged text. In those cases,

Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything.

Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.

Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided.

Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used. See also [ Q: What is the difference between UCS-2 and UTF-16?] [AF] & [MD]

明白了，我的理解是：

当纯文本文件没有声明编码时，使用bom。如果没有bom，编码不好判断。

如果数据声明了编码，如保存在数据库中的数据（在数据库中声明了编码）、xml（使用encoding="utf-8"声明编码）、html（使用charset=utf-8声明编码），不应使用bom（the BOM should not be used）。

由此可得，在纯文本中使用utf-8 bom是可以的。

突然想起，以前在linux下使用各个播放器（mplayer、smplayer）都出现utf-8格式字幕乱码的问题，难道是因为linux下的文本编辑器（gedit等）生成的是utf-8 no bom ？

参考资料：

谈谈Unicode编码，简要解释UCS、UTF、BMP、BOM等名词

http://blog.csdn.net/fmddlmyy/archive/2005/05/04/372148.aspx

How I should deal with BOMs?

http://unicode.org/faq/utf_bom.html#bom1

字节顺序记号（英：byte-order mark，BOM）

http://zh.wikipedia.org/zh-cn/%E4%BD%8D%E5%85%83%E7%B5%84%E9%A0%86%E5%BA%8F%E8%A8%98%E8%99%9F

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航