您的位置:首页 > Web前端 > HTML

关于htmlparsr在显示繁体中文时出现乱码的原因分析和解决方法

2008-04-05 01:48 1051 查看
最近发现用htmlparser解析一些网页时,繁体中文会变成乱码.分析了下原因,发现在用stringbean的时候htmlparser会自己根据meta来决定用哪种内码来解码,而有的网站在meta中是用gb2312来做charset,实际应用的时候又用到了gbk.gb2312是不能表示繁体的,所以就出现了乱码.解决的办法很简单,gbk是兼容gb2312的,所以在htmlparser的page.java的getcharser()那里加一句判断,如果ret是gb2312就设置为gbk,这样问题就解决了.

修改的page.java的代码如下(/lexer/page.java)

public String getCharset (String content)
{
final String CHARSET_STRING = "charset";
int index;
String ret;

if (null == mSource)
ret = DEFAULT_CHARSET;
else
// use existing (possibly supplied) character set:
// bug #1322686 when illegal charset specified
ret = mSource.getEncoding ();
if (null != content)
{
index = content.indexOf (CHARSET_STRING);

if (index != -1)
{
content = content.substring (index +
CHARSET_STRING.length ()).trim ();
if (content.startsWith ("="))
{
content = content.substring (1).trim ();
index = content.indexOf (";");
if (index != -1)
content = content.substring (0, index);

//remove any double quotes from around charset string
if (content.startsWith ("/"") && content.endsWith ("/"")
&& (1 < content.length ()))
content = content.substring (1, content.length () - 1);

//remove any single quote from around charset string
if (content.startsWith ("'") && content.endsWith ("'")
&& (1 < content.length ()))
content = content.substring (1, content.length () - 1);

ret = findCharset (content, ret);

// Charset names are not case-sensitive;
// that is, case is always ignored when comparing
// charset names.
// if (!ret.equalsIgnoreCase (content))
// {
// System.out.println (
// "detected charset /""
// + content
// + "/", using /""
// + ret
// + "/"");
// }
}
}
}
if(ret.equalsIgnoreCase("gb2312"))ret="GBK"; //to avoid decode problem
//edited by linyunfan
return (ret);
}

在最后加入了这句

if(ret.equalsIgnoreCase("gb2312"))ret="GBK";
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: