您的位置:首页 > Web前端 > HTML

异步拉取html源代码、网页编码自动识别、基本xpath的智能抽取引擎的优化

2009-05-01 00:19 453 查看
今天回顾起来,小旋风垂直搜索平台从构思到现在,竞然差不多有两年的时间了。最初是基本C++的平台,还甚至自己在写类似于lucene的全文索引系统,也初见成果,后发现效果不稳定,效率与不及lucene,遂放弃,采用lucene内核。

C++平台有一个最大的问题,就是对了个人或小团队而言,想做一个像样的界面效果太复杂。后转而学习C#,除了虚拟机的问题之外,个人认为C#对于小团队是个不二的选择。而个人相信不久的将来,微软会集成.net framework到操作系统中去的。(vista应该就已集成了,未考证~,知道的朋友告之一下~)

为了记录开发过程的点点滴滴和心路历程,特开blog,一来记录开发过程遇到的、解决的、待解决的问题;二来分享一些小经验,希望对后来者有帮助。

小旋风垂直搜索平台,包括爬虫模块、数据抽取模块、数据入库、数据全文索引模块,她是一个完整的垂直搜索引擎系统,甚至。

目的是想创建一个任何人都易于使用的垂直搜索平台软件,让大家快速的创建垂直搜索引擎。

由于html代码的开放性,有些问题想简化却带来极大的复杂性。

今天主要分享两点:

一个html编码的自动识别:

也许大家曾经尝试过很多方法, 我也是,包括去取http包头的charset、分别stram的byte的特征等等,但你会发现,作为一个通用的平台,这些方案都行不通的。

通过多日的尝试,百度/google等等,得到的答案是,其实目的没有一个方法能够保证不出错,但有一个解决方案可以基本解决问题。那就是mozilla采用的编码识别模块,我找到了他的.net版本:NUniversalCharDet

using Mozilla.NUniversalCharDet;

 public static string DetectEncoding_Bytes(byte[] DetectBuff)
        {
            int nDetLen = 0;
            UniversalDetector Det = new UniversalDetector(null);
            //while (!Det.IsDone())
            {
                Det.HandleData(DetectBuff, 0, DetectBuff.Length);
            }
            Det.DataEnd();
            if (Det.GetDetectedCharset() != null)
            {
                return Det.GetDetectedCharset();
            }

            return "utf-8";
        }

 

代码不多解决了,这个库是开源的,有兴趣的朋友也可以看看他的实现原理,比较复杂。

 

另外一个,就是html的异步调用,直接上代码好了:

using System;
using System.Collections.Generic;
using System.Text;
using System.Collections;
using System.Net;
using System.IO;
using System.Threading;
using System.Diagnostics;

namespace eLive.Common
{
    public class RequestState
    {
        //存储请求状态
        const int BUFFER_SIZE = 1024;
        public StringBuilder requestData;
        public byte[] BufferRead;
        public HttpWebRequest request;
        public HttpWebResponse response;
        public Stream streamResponse;
        public RequestState()
        {
            BufferRead = new byte[BUFFER_SIZE];
            requestData = new StringBuilder("");
            request = null;
            streamResponse = null;
        }
    }

    public interface IAsyncHttpSink
    {
        void OnReadComplete(string strHtml);
        void OnReadTimeOut();
        void OnReadError();
    }

    public class AsynHttpRequest
    {
        public AsynHttpRequest(IAsyncHttpSink pSink)
        {
            _pSink = pSink;
            _strEncoding = "";
        }

        public void StartRequest(string strUrl)
        {
            try
            {
                _strEncoding = "";
                HttpWebRequest myHttpWebRequest = (HttpWebRequest)WebRequest.Create(strUrl);

                RequestState myRequestState = new RequestState();
                myRequestState.request = myHttpWebRequest;

                IAsyncResult result =
                    (IAsyncResult)myHttpWebRequest.BeginGetResponse(new AsyncCallback(RespCallback), myRequestState);

                //处理超时请求
                ThreadPool.RegisterWaitForSingleObject(result.AsyncWaitHandle, new WaitOrTimerCallback(TimeoutCallback), myHttpWebRequest, DefaultTimeout, true);

            }
            catch (WebException e)
            {
                InfoTransferHandler.Instance.ShowMsgBox(e.Message.ToString());
            }
        }

        //超时,终止请求  
        private void TimeoutCallback(object state, bool bTimeOut)
        {
            if (bTimeOut)
            {
                HttpWebRequest request = state as HttpWebRequest;
                if (request != null)
                {
                    request.Abort();
                }

                if(_pSink != null)
                {
                    _pSink.OnReadTimeOut();
                }
            }
        }

        private void RespCallback(IAsyncResult asynchronousResult)
        {
            try
            {
                //异步状态请求 
                RequestState myRequestState = (RequestState)asynchronousResult.AsyncState;
                HttpWebRequest myHttpWebRequest = myRequestState.request;
                myRequestState.response = (HttpWebResponse)myHttpWebRequest.EndGetResponse(asynchronousResult);

                //把请求读入流对象  
                Stream responseStream = myRequestState.response.GetResponseStream();
                myRequestState.streamResponse = responseStream;

                //读取Html源文件并显示到控制台
                IAsyncResult asynchronousInputRead = responseStream.BeginRead(myRequestState.BufferRead, 0, BUFFER_SIZE, new AsyncCallback(ReadCallBack), myRequestState);
                return;
            }
            catch (WebException e)
            {
                InfoTransferHandler.Instance.ShowMsgBox("异步读取文件RespCallback失败:" + e.Message);
            }
        }

        private void ReadCallBack(IAsyncResult asyncResult)
        {
            RequestState myRequestState = (RequestState)asyncResult.AsyncState;
            Stream responseStream = myRequestState.streamResponse;
            int read = responseStream.EndRead(asyncResult);
            try
            {
                //读取Html源文件
                if (read > 0)
                {
                    string strBuffer = "";
                    if(_strEncoding == "")
                    {
                        _strEncoding =  UtilCoding.DetectEncoding_Bytes(myRequestState.BufferRead).ToString();
                    }
                   
                    strBuffer = Encoding.GetEncoding(_strEncoding).GetString(myRequestState.BufferRead);
                    myRequestState.requestData.Append(strBuffer);
                    IAsyncResult asynchronousResult = responseStream.BeginRead(myRequestState.BufferRead, 0, BUFFER_SIZE, new AsyncCallback(ReadCallBack), myRequestState);
                    return;
                }
                else
                {
                   //读取完成
                    responseStream.Close();
                    if(_pSink != null)
                    {
                        _pSink.OnReadComplete(myRequestState.requestData.ToString());
                    }
                }
    
            }
            catch (Exception e)
            {
                InfoTransferHandler.Instance.ShowCommonMsg("异步读取文件ReadCallBack失败:" + e.Message);
                responseStream.Close();
                if (_pSink != null)
                {
                    _pSink.OnReadComplete(myRequestState.requestData.ToString());
                }
            }
         
        }

        private IAsyncHttpSink _pSink;
        const int BUFFER_SIZE = 1024;
        const int DefaultTimeout = 2 * 60 * 1000; //2分钟超时
        private string _strEncoding;
    }

   
}
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
相关文章推荐