We have created two ground-truthed datasets of natively-digital PDF documents containing tables. These documents have been collected systematically from the European Union and US Government websites, and we therefore expect them to have public domain status. Each PDF document is accompanied by three XML (or CSV) file containing its ground truth in the following models:
- table regions (for evaluating table location)
- cell structures (for evaluating table structure recognition)
- functional representation (for evaluating table interpretation)
This work was carried out as a collaboration between Giorgio Orsi, Linda Oro, Max Göbel and myself. We currently have over 50 excerpts, taken from larger PDF documents, and are appealing to the document engineering community to help us increase this number to several hundred or more.
We organized the competition on PDF table detection and structure recognition at ICDAR 2013. The datasets here were made available to all participants for practice. The competition dataset included a further collection of EU and US documents, and has now been made available with ground truth. However there is no information available on the functional representation, as only table location and cell structure recognition were covered in the competition.
The datasets can be downloaded from the Downloads page.
Tools for comparing an algorithm’s results against the ground truth, as well as a beta tool to aid ground-truth generation, are available here.
Please contact me if you would like to join our collaborative effort in improving this dataset.