Automated Publishing

Whereas data extraction typically requires that structures are detected in a document, automated publishing is the transformation in reverse: Starting with the content in a structured form, such as HTML or XML, it needs to be transformed into a visually appealing web or print-oriented document that is also free of design errors. In its simplest form, this transformation is typically described by a style sheet, which is a codified form of the design rules adopted by the publisher. In more complex cases, even the design rules afford a significant amount of freedom, and the placement of content such as articles and figures, as well as column and page breaks for paged media, needs to be chosen carefully in order to achieve a professional, aesthetically pleasing result. This process can be fully automated by casting it as an optimization problem.

During my time at HP I developed algorithms for automatic layout optimization and micro-typography, which were implemented both in Java and using the Web stack (HTML, CSS, JS). I have also worked on automatic publishing and PDF generation using PDFBox (see the Pint formatting engine) and other formatting tools such as LaTeX.

Currently, I am involed in a pilot project in collaboration with DocEng 2018 to trial a new, XML-based workflow for the Workshop Proceedings, with the aim of making recommendations for the main conference proceedings.

When logical structure recognition and automated publishing techniques are combined, it is possible to “round-trip” documents and make edits to documents in formats such as PDF, which are not normally editable. My work on the Editable PDF Initiative aims to take the guesswork out of editing PDFs by embedding this structure at the point of creation and providing a framework and specification for making repeated edits to the document, without needing to perform the (error-prone) recognition step every time.