On my GitHub page you will find the following open-source projects:
- pint-publisher: Formatting engine for creation of editable PDFs and piloting the structured workflow for the DocEng 2018 Workshop Proceedings
- dataset-tools: Command-line tools for comparing PDF table recognition results with the respective ground-truth files
- pdfxtk: PDF Extraction Toolkit based on the PDFBox library (not currently being developed)
The ground-truthed datasets of PDF tables can be downloaded here:
- Practice dataset of EU documents
- Practice dataset of US government documents
- Competition dataset of EU and US documents
And finally, you can find my doctoral thesis here:
- User-Guided Information Extraction from Print-Oriented Documents, TU Wien, May 2010