Table Recognition Dataset Format

This page is archived from the competition website as of 2013. The competition datasets, complete with ground truth, are available here.

This page describes the format to which entries for both sub-competitions, table location and table structure recognition, must adhere, and how each entry will be numerically compared to the ground truth. This format is also described in [1]. Entrants can choose to take part in either or both competitions. The format of the ground-truth in the example dataset is a special case of this format. Please note that the example dataset also includes a functional model; however functional analysis does not form part of the present competition.

We have made some tools for automatic visualization and comparison to ground truth available. We hope that this will make it easier for participants to prepare their results for submission in our format.

Table location sub-competition: Region model

Table regions are defined as rectangular areas of a given page by their coordinates. Since a table can span more than one page, several regions can belong to the same table. For example, the ground truth file for a document with a table spanning from the first to the second page may look as follows:

<?xml version="1.0" encoding="UTF-8"?>
<document filename='filename.pdf'>
<table id='0'>
<region id='0' page='1'>
<instruction instr-id='83'/>
<instruction instr-id='90' subinstr-id='0'/>
...
<instruction instr-id='169'/>
<bounding-box x1='87' y1='117' x2='551' y2='220'/>
</region>
<region id='1' page='2'>
<instruction instr-id='202'/>
<instruction instr-id='209' subinstr-id='2'/>
...
<bounding-box x1='87' y1='261' x2='551' y2='364'/>
</region>
</table>
<table id='1'>
...
</table>
...
</document>

For each tabular region that is found, entrants are only required to return its rectangular bounding-box in PDF coordinates. The <instruction> tags, whose purpose is described below, need not be included. Note that the page-numbering is 1-based (all document excerpts in the example begin with page 1). The region and table ID numbering is 0-based; tables within a document and regions within a table can be output in any order.

The XML schema definition for the region model can be downloaded here.

Comparison to ground truth

All regions in the ground truth dataset are specified by the smallest bounding box that encloses all the character elements within the region. Lines and other graphic elements are discounted. For the entries, the bounding boxes need not be minimal; our evaluation procedure will simply take all character elements whose centres fall within the specified area. These elements are then compared with the ground truth to calculate the completeness and purity of the result [2].

*A character element is defined as each character drawn by the <TJ> operator or <Tj> text-drawing operand. Textual content (e.g. in logos, etc.) drawn by vector graphics operators or by placing bitmaps is ignored. Its coordinates are defined as follows:

  • x0: starting horizontal position
  • x1: starting horizontal position + character width
  • y0: baseline
  • y1: baseline + font size

Enclosure is determined based on the character’s midpoint coordinates falling within the bounding box. Please note that these coordinates might be slightly different to the actual space taken up by the character; as only the centre coordinates are taken into account, this should not cause a problem in practice.

In order to determine which tables have been detected completely and/or purely, it is necessary to map each GT table to its corresponding result table. In most cases, this mapping is obvious. However, the following special cases can arise:

  • 2 regions in GT merged to single region (as illustrated above): The GT region with the greater overlap** is allocated to the result region; the other GT region is classified as undetected.
  • 1 region in GT split into two regions: The result region with the greater overlap** is allocated to the GT region; the other result region is classified as a false positive.

**Greater overlap is defined as follows: First, for each region, a list of character elements whose centre coordinates overlap this region is constructed. The regions with the greatest overlap have the largest number of common character elements. Other character elements in the result region mean that this region is impure and are termed foreign objects. If two pairs of regions have exactly the same number of common character elements, the pairing with the fewest foreign objects is chosen.

Please note that for this sub-competition it is irrelevant whether several regions are grouped into a (logical) table or not. Only the <region> and not the <table> tags are used.

Table structure recognition sub-competition: Cell structure model

The aim of the table structure recognition sub-competition is to compare methods for determining the cell structure of tables given correct information about their location. It is therefore permissible to participate only in this sub-competition, and not in the table location sub-competition. We strongly recommend entrants to use manually generated or corrected input regarding the table locations when generating their results, in order to avoid being unnecessarily penalized.

The cell structure of a table is defined as a matrix of cells. Cells are defined by their textual content and their start and end column and row positions. Blank cells are not represented in this format. An example looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<document filename='filename.pdf'>
<table id='0'>
<region id='0' page='3' col-increment='0' row-increment='0'>
<cell id='0' start-row='0' start-col='0'>
<bounding-box x1='70' y1='79' x2='131' y2='91'/>
<content>COUNTRY</content>
<instruction instr-id='65' subinstr-id='0'/>
</cell>
<cell id='1' start-row='0' start-col='1' end-col='2'>
<bounding-box x1='165' y1='79' x2='201' y2='91'/>
<content>3 years</content>
<instruction instr-id='65' subinstr-id='2'/>
</cell>
<cell id='2' start-row='0' start-col='3'>
<bounding-box x1='234' y1='79' x2='271' y2='91'/>
<content>4 years</content>
<instruction instr-id='65' subinstr-id='4'/>
</cell>
...
</region>
...
</table>
...
</document>

In the ground truth for the example dataset, the table numbers correspond to those in the relevant region model files. In this competition, this need not be the case, as entries for the table structure recognition sub-competition will be evaluated independently of the table location competition.

In contrast to the region model, for the cell structure model, entrants are required to return the textual content (<content> tag) for each cell; the tags <bounding-box> and <instruction>, which are present in the ground truth, are not required and will be ignored.

The cell numbering begins at (0,0) for the top-left cell. The attributes end-col and end-row are optional; if they are ommitted, the col and/or rowspan are assumed to be 1. If all cells are shifted, or an entire row or column returned as spanning the same number of cells, this will not make any difference to the final result, as explained below.

The XML schema definition for the cell structure model can be downloaded here.

Comparison to ground truth

First, the content will be stripped of all spaces and special characters so that errors in e.g. detecting spacing do not affect the result. For comparing two cell structures, we use a method inspired by Hurst’s proto-links [3]: for each table region we generate a list of adjacency relations between each content cell and its nearest neighbour in horizontal and vertical directions. No adjacency relations are generated between blank cells or a blank cell and a content cell.

Table structure error example

This 1-D list of adjacency relations is then compared to the ground truth by using precision and recall measures, as shown in the figure below. If both cells are identical and the direction matches, then it is marked as correctly retrieved; otherwise it is marked as incorrect. Using neighbourhoods makes the comparison invariant to the absolute position of the table (e.g. if everything is shifted by one cell) and also avoids ambiguities arising with dealing with different types of errors (merged/split cells, inserted empty column, etc.).

References:
[1] Göbel, M., Hassan, T., Oro, E., Orsi, G.: A Methodology for Evaluating Algorithms for Table Understanding in PDF Documents, DocEng 2012
[2] Silva, A.C.: Metrics for evaluating performance in document analysis: application to tables, IJDAR 14(1):101-109, 2011
[3] Hurst, M.: A constraint-based approach to table structure derivation, ICDAR 2003

Competition organizers: Max Göbel, Tamir Hassan, Ermelinda Oro and Giorgio Orsi.

If you have any further questions, please feel free to get in touch.