TransWikia.com

How can I extract all PDF Tags related to content with Python?

Stack Overflow Asked by Martin Thoma on December 11, 2020

I’ve indirectly read about tagged PDFs in TabbyPDF: Web-Based System for PDF Table Extraction where it sounds as if one could get semantic information of an PDFs content. So not only author / title / number of pages, but maybe something like sections or where a title is.

Is this possible?

Here are some example PDFs to show, in case it is:

What I’ve tried

I might go in the completely wrong direction, but the kind of information I get is only metadata of the document. Not of its content / the contents structure. I was hoping for something like semantic HTML elements that I would know that there are two sections, one table, three paragraphs. Maybe even that the table has a caption and 42 rows and 123 columns.

PyPDF2

from PyPDF2 import PdfFileReader


def get_info(path):
    with open(path, "rb") as f:
        pdf = PdfFileReader(f)
        info = pdf.getDocumentInfo()
        nb_pages = pdf.getNumPages()
    info = dict(info)
    info['nb_pages'] = nb_pages
    return info


if __name__ == "__main__":
    path = "PDF-export-example.pdf"
    info = get_info(path)
    for key, value in sorted(info.items()):
        print(f"{key:<15}: {value}")

Lorem Ipsum Table Test:

/Author        : Martin Thoma
/CreationDate  : D:20200730020133-07'00'
/Creator       : Microsoft Word
/ModDate       : D:20200730020133-07'00'
nb_pages       : 1

Camelot Edge TOL:

/Producer      : PyPDF2
nb_pages       : 1 

pdfminer

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument


def get_info(path):
    with open(path, "rb") as f:
        parser = PDFParser(f)
        doc = PDFDocument(parser)
    return doc.info


if __name__ == "__main__":
    path = "edge_tol.pdf"
    info = get_info(path)
    for el in info:
        for key, value in el.items():
            print(f"{key:<15}: {value}")

Lorem Ipsum Table Test:

Author         : b'Martin Thoma'
Creator        : b'Microsoft Word'
CreationDate   : b"D:20200730020133-07'00'"
ModDate        : b"D:20200730020133-07'00'"

Camelot Edge TOL:

Producer       : b'PyPDF2'

2 Answers

I don't know the tools you mention, but I can answer the theory behind this and that might point you in the correct direction.

What you're doing get metadata and only a small portion of it, more precisely the part that comes out of the Document Information Dictionary in the PDF. While this still contains some information, it has largely been superseded by the use of XMP information (basically "simple" XML information) embedded in the PDF. However, this too is irrelevant to looking for structured information.

First of all, PDF files don't have to contain structure information as you describe it. It's an optional feature and most (virtually all) PDF documents miss it. The use of structure in PDF is only mandated in certain cases:

  • When the PDF is compliant with the ISO standard for long-term archival (PDF/A), and then only if the PDF wants to be compliant with the more stringent forms of this standard (PDF/A-1a, PDF/A-2a or PDF/A-3a).
  • When the PDF is compliant with the ISO standard for universal accessibility (PDF/UA).

The information you're interested in is used in those cases to identify structure of page content. This usually consists of:

  • defining the order of elements on the page (a PDF file can contain text in a completely illogical order). The structure information would help you figure out which text comes first, followed by what other bits.
  • defining the nature of elements (Is this an image, a title, a paragraph, an artefact, a table, a footnote etc...).

If you want to extract this, I would encourage you to read the PDF specification on the Adobe web site, and specifically the chapters on Marked Content (14.6), Logical Structure (14.7) and Tagged PDF (14.8). The way the information is encoded in PDF is far from trivial, and like I said, most PDF files will probably not have the information.

In my experience, the only PDF files who have this in a completely proper way are those generated by organisations who are legally obligated to support accessibility (governments etc...) or who are using some of these features in their electronic archives. Some OCR tools can automatically generate "some" of this information, though the quality in that case might be sub-par.

Answered by David van Driessche on December 11, 2020

Your PyPDF2 example calls the getDocumentInfo() method which only retrieves the document's metadata and none of the content.

In order to get to the content/structure of the pdf you need to access the pages by retrieving the Page-object.

Once you retrieved the Page-object you can try to extract the text by calling extractText() on the Page-object.

How well that works will depend on your specific pdf, how it was created, if there has been OCR on the text etc.

Once you have the text itself, you will need to re-assemble it to format it into a table.

Answered by tomanizer on December 11, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP