您的位置：首页 > Web前端 > HTML

HtmlAgilityPack 抓取中文页面乱码问题的解决方案

2010-08-10 13:39 603 查看

HtmlAgilityPack是用C#写的开源Html Parser。不过它的某些方面设计不尽完善，比如，按照其正常模式抓取中文网页，往往获得的是乱码。比如，抓取新华网首页(http://xinhua.org)。模仿HtmlAgilityPack示例，爬取代码如下：

HtmlWeb hw = new HtmlWeb();

string url = @"http://xinhua.org";

HtmlDocument doc = hw.Load(url);

doc.Save("output.html");

获得的页面用ie打开，是乱码。

穿越HtmlAgilityPack的代码迷宫，最后发现问题出在HtmlWeb类的Get(Uri uri, string method, string path, HtmlDocument doc)方法中。该方法有以下代码：

HttpWebResponse resp;

try
{
resp = req.GetResponse() as HttpWebResponse;
}
……
if ((resp.ContentEncoding != null) && (resp.ContentEncoding.Length>0))
{
respenc = System.Text.Encoding.GetEncoding(resp.ContentEncoding);
}
else
{
respenc = null;
}
……
Stream s = resp.GetResponseStream();
if (s != null)
{
if (UsingCache)
{
// NOTE: LastModified does not contain milliseconds, so we remove them to the file
SaveStream(s, cachePath, RemoveMilliseconds(resp.LastModified), _streamBufferSize);

// save headers
SaveCacheHeaders(req.RequestUri, resp);

if (path != null)
{
// copy and touch the file
IOLibrary.CopyAlways(cachePath, path);
File.SetLastWriteTime(path, File.GetLastWriteTime(cachePath));
}
}
else
{
// try to work in-memory
if ((doc != null) && (html))
{
if (respenc != null)
{
doc.Load(s, respenc);
}
}
else
{
doc.Load(s, true);
}
}
}
resp.Close();
}

其中resp是http请求的response。设置断点发现resp.ContentEncoding为空。于是最后的加载行为便变成了doc.Load(s, true);而这个load方法也可能出了问题，最后得到的是乱码。

解决方法：

不使用HttpWeb，该类不成熟。自己写http请求，代码如下：

HttpWebRequest req;
req = WebRequest.Create(new Uri(@"http://xinhua.org")) as HttpWebRequest;
req.Method = "GET";
WebResponse rs = req.GetResponse();
Stream rss = rs.GetResponseStream();
String url = @"http://xinhua.org";
try
{
HtmlDocument doc = new HtmlDocument();
doc.Load(rss);
doc.Save("output.html");
}
catch (Exception e)
{
Console.WriteLine(e.Message.ToString());
Console.WriteLine(e.StackTrace);
}

上面代码中，doc.Load(…) 使用的编码为System.Text.Encoding.Default，在我机器上为gb2312编码。
HtmlDocument也可以指定编码load stream。获得指定编码有两种方法：
（1）在HttpWebResponse 对象中可以获取html代码中设置的charset；
（2）未提供charset的html页面，HtmlDocument提供了自动检测代码的方法DetectEncoding(…)。这一方法俺为测试过，不知道正确性如何.

摘自：http://community.icburner.com/blogs/vs2010tests/archive/2009/07/09/better-html-parsing-and-validation-with-htmlagilitypack.aspx

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航