HttpWebRequest获取网页html源代码(并自动获取encoding)
2012-08-14 15:20
375 查看
能识别压缩的文件GZIP压缩
以下内容转载之/article/5653806.html
HttpWebRequest获取网页html源代码(并自动获取encoding) ? using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Net; using System.IO; using System.IO.Compression; using System.Text.RegularExpressions; namespace WikiPageCreater.Common { public class PageHelper { /// <summary> /// 根据 url 获取网页编码 /// </summary> /// <param name="url"></param> /// <returns></returns> public static string GetEncoding(string url) { HttpWebRequest request = null; HttpWebResponse response = null; StreamReader reader = null; try { request = (HttpWebRequest)WebRequest.Create(url); request.Timeout = 20000; request.AllowAutoRedirect = false; response = (HttpWebResponse)request.GetResponse(); if (response.StatusCode == HttpStatusCode.OK && response.ContentLength < 1024 * 1024) { if (response.ContentEncoding != null && response.ContentEncoding.Equals("gzip", StringComparison.InvariantCultureIgnoreCase)) reader = new StreamReader(new GZipStream(response.GetResponseStream(), CompressionMode.Decompress)); else reader = new StreamReader(response.GetResponseStream(), Encoding.ASCII); string html = reader.ReadToEnd(); Regex reg_charset = new Regex(@"charset\b\s*=\s*(?<charset>[^""]*)"); if (reg_charset.IsMatch(html)) { return reg_charset.Match(html).Groups["charset"].Value; } else if (response.CharacterSet != string.Empty) { return response.CharacterSet; } else return Encoding.Default.BodyName; } } catch { } finally { if (response != null) { response.Close(); response = null; } if (reader != null) reader.Close(); if (request != null) request = null; } return Encoding.Default.BodyName; } /// <summary> /// 根据 url 和 encoding 获取当前url页面的 html 源代码 /// </summary> /// <param name="url"></param> /// <param name="encoding"></param> /// <returns></returns> public static string GetHtml(string url, Encoding encoding) { HttpWebRequest request = null; HttpWebResponse response = null; StreamReader reader = null; try { request = (HttpWebRequest)WebRequest.Create(url); request.Timeout = 20000; request.AllowAutoRedirect = false; response = (HttpWebResponse)request.GetResponse(); if (response.StatusCode == HttpStatusCode.OK && response.ContentLength < 1024 * 1024) { if (response.ContentEncoding != null && response.ContentEncoding.Equals("gzip", StringComparison.InvariantCultureIgnoreCase)) reader = new StreamReader(new GZipStream(response.GetResponseStream(), CompressionMode.Decompress), encoding); else reader = new StreamReader(response.GetResponseStream(), encoding); string html = reader.ReadToEnd(); return html; } } catch { } finally { if (response != null) { response.Close(); response = null; } if (reader != null) reader.Close(); if (request != null) request = null; } return string.Empty; } } } ============================================================== 更多示例代码,可以访问微软Codeplex网站: http://1code.codeplex.com ,下载微软的 All-In-One Code Framework.
在获取html编码时,可以一次只读取部分字节。若得到了编码就返回,不需要进行全部读取完毕。如采集谷歌的程序,是不需要加Gzip这种解码头部。最好使用user-agent代理去模防采取。
相关文章推荐
- HttpWebRequest获取网页html源代码(并自动获取encoding)
- asp.net 利用HttpWebRequest自动获取网页编码并获取网页源代码
- asp.net 利用HttpWebRequest自动获取网页编码并获取网页源代码
- HttpWebRequest获取网页源代码时自动识别网页编码
- 用asp.net c# HttpWebRequest获取网页源代码
- C# 利用HttpWebRequest模拟登陆获取数据设置Accept-Encoding为gzip,deflate后返回的网页是乱码处理
- HttpWebRequest自动登录网站并获取网站内容
- c#利用WebClient和WebRequest获取网页源代码
- HttpWebRequest 下载网页Html代码 下载文件(Remote和FTP)Get方式
- c#利用WebClient和WebRequest获取网页源代码的比较
- js_html_input中autocomplete="off"在chrom中失效的解决办法 使用JS模拟锚点跳转 js如何获取url参数 C#模拟httpwebrequest请求_向服务器模拟cookie发送 实习期学到的技术(一) LinqPad的变量比较功能 ASP.NET EF 使用LinqPad 快速学习Linq
- C# HttpWebRequest 绝技 根据URL地址获取网页信息
- c#利用WebClient和WebRequest获取网页源代码的比较
- c#利用WebClient和WebRequest获取网页源代码的比较
- c#利用WebClient和WebRequest获取网页源代码
- C# HttpWebRequest 绝技 根据URL地址获取网页信息
- HttpWebRequest 获取验证码的图片 并针对有验证码的网页进行Winform登陆。
- 使用HttpWebRequest和HtmlAgilityPack抓取网页(拒绝乱码,拒绝正则表达式)
- ASP.NET使用HttpWebRequest读取远程网页源代码
- c#利用WebClient和WebRequest获取网页源代码的比较