I have extensive experience in semi-automatic and automatic extraction of data from documents, including print-oriented documents such as PDFs and web sites.
Systems for scraping data from web sites typically use the structure of the underlying HTML code to locate the instances of the data to be extracted. For PDF documents, this task is much more difficult, as the file structure does not represent the document’s logical structure. Text positioned using physical operators and (unless the PDF is suitably tagged) there is no machine-readable information about the tabular structure in the document.
Detecting these tables is a complex recognition task, which performs reasoning based on various visual cues, such as alignment and ruling lines, which would otherwise be informing a human reader about the tabular structure. Expert system and machine learning approaches can be used to perform this reasoning, but as with all AI approaches, no algorithm is guaranteed to give perfect results.
The process of table recognition can be split up into three tasks:
- Table location, i.e. locating tabular regions in the document
- Table structure recognition, i.e. detecting the cell structure; rows, columns and spanning cells
- Table interpretation: determining the access and data cells and the relationships between them
By modelling the results of each of these tasks, it is possible to standardize the data extraction process, and make changes to each of the steps independently. This work led to the ICDAR 2013 Table Competition, in which several systems from academia and industry were compared against each other for the first two of these sub-tasks. The resulting data models and ground-truthed datasets are available.
Irregular structures
For print-oriented documents with a less regular, but repetitive structure, I developed the GraphWrap system, which uses graph-based extraction patterns as part of a semi-automatic (i.e. the user provides examples) approach to locate the data.
References
- Göbel, M., Hassan, T., Oro, E., Orsi, G.: ICDAR 2013 Table Competition, ICDAR 2013, Washington, DC.
- Göbel, M., Hassan, T., Oro, E., Orsi, G.: A Methodology for Evaluating Algorithms for Table Understanding in PDF Documents, DocEng 2012, Paris.
- Hassan, T.: Towards a Common Evaluation Strategy for Table Structure Recognition Algorithms, DocEng 2010, Manchester.
- Hassan, T.: GraphWrap: A System for Interactive Wrapping of PDF Documents Using Graph Matching Techniques, DocEng 2009, Munich.
- Hassan, T.: User-Guided Wrapping of PDF Documents using Graph Matching Techniques, ICDAR 2009, Barcelona.
- Hassan, T., Baumgartner, R: Table Recognition and Understanding from PDF Files, ICDAR 2007, Curitiba, Brazil.
- Carme, J., Ceresna, M., Frölich, O., Gottlob, G., Hassan, T., Herzog, M., Holzinger, W., Krüpl, B.: The Lixto Project: Exploring New Frontiers of Web Data Extraction, BNCOD 2006, Belfast.
- Hassan, T., Baumgartner, R: Using Graph Matching Techniques to Wrap Data from PDF Documents, WWW 2006 (Poster track), Edinburgh. You can find the poster here (PDF).