On my GitHub page you will find the following open-source projects:

  • pint-publisher: Formatting engine for creation of editable PDFs and piloting the structured workflow for the DocEng 2018 Workshop Proceedings
  • dataset-tools: Command-line tools for comparing PDF table recognition results with the respective ground-truth files
  • pdfxtk: PDF Extraction Toolkit based on the PDFBox library (not currently being developed)

The ground-truthed datasets of PDF tables can be downloaded here:

And finally, you can find my doctoral thesis here: