Extracting Structured Data from Web Pages
2008-05-14 11:33
1081 查看
Keywords: Automatic Data Extraction Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from the web pages without any learning examples or other similar human input. We formally define the notion of a template, and propose a model that describes how values are encoded into pages using a template. We present an extraction algorithm that uses sets of words that have similar occurrence pattern in the input pages, to construct the template. The constructed template is then used to extract values from the pages. We show experimentally that the extracted values make semantic sense in most cases. For more information, please visit our website: http://www.knowlesys.com
相关文章推荐
- 使用R语言和XML包抓取网页数据-Scraping data from web pages in R with XML package
- How to Post Data and Fetch Remote Pages from PHP Scripts
- Get Database Connection from Web Application Server Data Source
- How to: Send and Receive Large Amounts of Data to and from a Web Service
- Mining the Web: Discovering Knowledge from Hypertext Data
- Tutorial: Importing and analyzing data from a Web Page using Power BI Desktop
- Web网页中动态数据区域的识别与抽取 Dynamical Data Regions Identification and Extraction in Web Pages
- Get data from specified URI using WebRequest and WebResponse(读取网页数据并存入对应html文档)
- Scrapinghub | Turn web pages into structured content
- Using SharePoint 2003 Web Services to Retrieve Data From A List
- [ZT]Submitting Web Form data from one ASP.NET page to another
- Learning Deep Structured Semantic Models for Web Search using Clickthrough Data笔记
- [论文笔记]Learning Deep Structured Semantic Models for Web Search using Clickthrough Data
- couldn't get cmd pointer (substituting NULL): extracting data from value failed Couldn't materialize
- Get json formatted string from web by sending HttpWebRequest and then deserialize it to get needed data
- Extracting Data from array of hashes Ruby
- Coursera课程Python for everyone:Quiz: Reading Web Data From Python
- 读《Mining Data Records in Web Pages》
- Extracting Datafile Blocks From ASM
- Calling Web Services From Html Pages using Javascript