Monday, April 19, 2004

Learning Page-Independent Heuristics for Extracting Data from Web Pages
Abstract:

One bottleneck in implementing a system that intelligently queries the Web is developing ``wrappers''--programs that extract data from Web pages. Here we describe a method for learning general, page-independent heuristics for extracting data from HTML documents. The input to our learning system is a set of working wrapper programs, paired with HTML pages they correctly wrap. The output is a general procedure for extracting data that works for many formats and many pages. In experiments with a collection of 84 constrained but realistic extraction problems, we demonstrate that 30% of the problems can be handled perfectly by learned extraction heuristics, and around 50% can be handled acceptably. We also demonstrate that learned page-independent extraction heuristics can substantially improve the performance of methods for learning page-specific wrappers.


Keywords: information integration, machine learning, extraction.

0 Comments:

Post a Comment

<< Home