Tabular Data Extraction

I have extensive experience in semi-automatic and automatic extraction of data from documents, including print-oriented documents such as PDFs and web sites.

Systems for scraping data from web sites typically use the structure of the underlying HTML code to locate the instances of the data to be extracted. For PDF documents, this task is much more difficult, as the file structure does not represent the document’s logical structure. Text positioned using physical operators and (unless the PDF is suitably tagged) there is no machine-readable information about the tabular structure in the document.

Detecting these tables is a complex recognition task, which performs reasoning based on various visual cues, such as alignment and ruling lines, which would otherwise be informing a human reader about the tabular structure. Expert system and machine learning approaches can be used to perform this reasoning, but as with all AI approaches, no algorithm is guaranteed to give perfect results.

The process of table recognition can be split up into three tasks:

  • Table location, i.e. locating tabular regions in the document
  • Table structure recognition, i.e. detecting the cell structure; rows, columns and spanning cells
  • Table interpretation: determining the access and data cells and the relationships between them

By modelling the results of each of these tasks, it is possible to standardize the data extraction process, and make changes to each of the steps independently. This work led to the ICDAR 2013 Table Competition, in which several systems from academia and industry were compared against each other for the first two of these sub-tasks. The resulting data models and ground-truthed datasets are available.


Irregular structures

Example of irregular tabular structure

For print-oriented documents with a less regular, but repetitive structure, I developed the GraphWrap system, which uses graph-based extraction patterns as part of a semi-automatic (i.e. the user provides examples) approach to locate the data.