Get Html Content use HttpDownloader
2011-11-27 19:13
525 查看
http://stackoverflow.com/questions/2700638/characters-in-string-changed-after-downloading-html-from-the-internet
Issue
Using the following code, I can download the HTML of a file from the internet:
However, sometimes the file contains "interesting" characters like
Resolution
Here's a wrapped download class which supports gzip and checks encoding header and meta tags in order to decode it correctly.
Instantiate the class, and call
Issue
Using the following code, I can download the HTML of a file from the internet:
WebClient wc = new WebClient(); // .... string downloadedFile = wc.DownloadString("http://www.myurl.com/");
However, sometimes the file contains "interesting" characters like
éto
é,
←to
â†and
フシギダネto
フシギダãƒ.
Resolution
Here's a wrapped download class which supports gzip and checks encoding header and meta tags in order to decode it correctly.
Instantiate the class, and call
GetPage()
public class HttpDownloader { private readonly string _referer; private readonly string _userAgent; public Encoding Encoding { get; set; } public WebHeaderCollection Headers { get; set; } public Uri Url { get; set; } public HttpDownloader(string url, string referer, string userAgent) { Encoding = Encoding.GetEncoding("ISO-8859-1"); Url = new Uri(url); // verify the uri _userAgent = userAgent; _referer = referer; } public string GetPage() { HttpWebRequest request = (HttpWebRequest)WebRequest.Create(Url); if (!string.IsNullOrEmpty(_referer)) request.Referer = _referer; if (!string.IsNullOrEmpty(_userAgent)) request.UserAgent = _userAgent; request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate"); using (HttpWebResponse response = (HttpWebResponse)request.GetResponse()) { Headers = response.Headers; Url = response.ResponseUri; return ProcessContent(response); } } private string ProcessContent(HttpWebResponse response) { SetEncodingFromHeader(response); Stream s = response.GetResponseStream(); if (response.ContentEncoding.ToLower().Contains("gzip")) s = new GZipStream(s, CompressionMode.Decompress); else if (response.ContentEncoding.ToLower().Contains("deflate")) s = new DeflateStream(s, CompressionMode.Decompress); MemoryStream memStream = new MemoryStream(); int bytesRead; byte[] buffer = new byte[0x1000]; for (bytesRead = s.Read(buffer, 0, buffer.Length); bytesRead > 0; bytesRead = s.Read(buffer, 0, buffer.Length)) { memStream.Write(buffer, 0, bytesRead); } s.Close(); string html; memStream.Position = 0; using (StreamReader r = new StreamReader(memStream, Encoding)) { html = r.ReadToEnd().Trim(); html = CheckMetaCharSetAndReEncode(memStream, html); } return html; } private void SetEncodingFromHeader(HttpWebResponse response) { string charset = null; if (string.IsNullOrEmpty(response.CharacterSet)) { Match m = Regex.Match(response.ContentType, @";\s*charset\s*=\s*(?<charset>.*)", RegexOptions.IgnoreCase); if (m.Success) { charset = m.Groups["charset"].Value.Trim(new[] { '\'', '"' }); } } else { charset = response.CharacterSet; } if (!string.IsNullOrEmpty(charset)) { try { Encoding = Encoding.GetEncoding(charset); } catch (ArgumentException) { } } } private string CheckMetaCharSetAndReEncode(Stream memStream, string html) { Match m = new Regex(@"<meta\s+.*?charset\s*=\s*(?<charset>[A-Za-z0-9_-]+)", RegexOptions.Singleline | RegexOptions.IgnoreCase).Match(html); if (m.Success) { string charset = m.Groups["charset"].Value.ToLower() ?? "iso-8859-1"; if ((charset == "unicode") || (charset == "utf-16")) { charset = "utf-8"; } try { Encoding metaEncoding = Encoding.GetEncoding(charset); if (Encoding != metaEncoding) { memStream.Position = 0L; StreamReader recodeReader = new StreamReader(memStream, metaEncoding); html = recodeReader.ReadToEnd().Trim(); recodeReader.Close(); } } catch (ArgumentException) { } } return html; } }
相关文章推荐
- C# 使用 GetOleDbSchemaTable 检索架构信息(表、列、主键等)--链接http://hi.baidu.com/useforprograms/blog/item/b2627decd024074778f05587.html
- Http中Get和Post的区别(摘自http://henry2008.teeta.com/blog/data/58260.html)
- PHP http(file_get_content) GET与POST请求方式
- <%@page contentType="text/html;charset=gbk"%>与<meta http-equiv="Content-Type" content="text/html; ch
- <meta http-equiv="Content-Type" content="text/html; charset=utf-8">意思?
- 下载网络文件HttpURLConnection.getContentLength()大小为 0
- HTTP协议与HTML表单(再谈GET与POST的区别)
- 下载网络文件HttpURLConnection.getContentLength()大小为 -1
- HTTP协议与HTML表单(再谈GET与POST的区别)
- HTTP常用对照表(content-type、HTML转义字符、ASCII、TCP常用端口等)
- HttpURLConnection getContentLength();返回时-1或者是0
- HttpURLConnection getContentLength();返回时-1或者是0
- CSS3的calc()使用(转载自 http://www.w3cplus.com/css3/how-to-use-css3-calc-function.html)
- http://127.0.0.1/thinkphp5/public/index/teacher/delete/id/1.html 这样的URL下,页面收不到get参数
- <meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7" />隐藏<!DOCTYPE HTML>
- HTTP协议与HTML表单(再谈GET与POST的区别)
- 【META http-equiv="Content-Type" Content="text/html; Charset=*】意义详解
- Node fs, url, http 组合小型的服务器 ( 满足html请求, get, post 传值 )
- html <meta http-equiv="refresh" content="0; url=">什么意思?
- Android2.2以上的版本HttpURLConnection.getContentLength()获取的size跟下载下来的file的legth不相等