# How can I extract all PDF Tags related to content with Python?

Stack Overflow Asked by Martin Thoma on December 11, 2020

I’ve indirectly read about tagged PDFs in TabbyPDF: Web-Based System for PDF Table Extraction where it sounds as if one could get semantic information of an PDFs content. So not only author / title / number of pages, but maybe something like sections or where a title is.

Is this possible?

Here are some example PDFs to show, in case it is:

## What I’ve tried

I might go in the completely wrong direction, but the kind of information I get is only metadata of the document. Not of its content / the contents structure. I was hoping for something like semantic HTML elements that I would know that there are two sections, one table, three paragraphs. Maybe even that the table has a caption and 42 rows and 123 columns.

### PyPDF2

from PyPDF2 import PdfFileReader

def get_info(path):
with open(path, "rb") as f:
info = pdf.getDocumentInfo()
nb_pages = pdf.getNumPages()
info = dict(info)
info['nb_pages'] = nb_pages
return info

if __name__ == "__main__":
path = "PDF-export-example.pdf"
info = get_info(path)
for key, value in sorted(info.items()):
print(f"{key:<15}: {value}")


Lorem Ipsum Table Test:

/Author        : Martin Thoma
/CreationDate  : D:20200730020133-07'00'
/Creator       : Microsoft Word
/ModDate       : D:20200730020133-07'00'
nb_pages       : 1


Camelot Edge TOL:

/Producer      : PyPDF2
nb_pages       : 1


### pdfminer

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument

def get_info(path):
with open(path, "rb") as f:
parser = PDFParser(f)
doc = PDFDocument(parser)
return doc.info

if __name__ == "__main__":
path = "edge_tol.pdf"
info = get_info(path)
for el in info:
for key, value in el.items():
print(f"{key:<15}: {value}")


Lorem Ipsum Table Test:

Author         : b'Martin Thoma'
Creator        : b'Microsoft Word'
CreationDate   : b"D:20200730020133-07'00'"
ModDate        : b"D:20200730020133-07'00'"


Camelot Edge TOL:

Producer       : b'PyPDF2'


I don't know the tools you mention, but I can answer the theory behind this and that might point you in the correct direction.

What you're doing get metadata and only a small portion of it, more precisely the part that comes out of the Document Information Dictionary in the PDF. While this still contains some information, it has largely been superseded by the use of XMP information (basically "simple" XML information) embedded in the PDF. However, this too is irrelevant to looking for structured information.

First of all, PDF files don't have to contain structure information as you describe it. It's an optional feature and most (virtually all) PDF documents miss it. The use of structure in PDF is only mandated in certain cases:

• When the PDF is compliant with the ISO standard for long-term archival (PDF/A), and then only if the PDF wants to be compliant with the more stringent forms of this standard (PDF/A-1a, PDF/A-2a or PDF/A-3a).
• When the PDF is compliant with the ISO standard for universal accessibility (PDF/UA).

The information you're interested in is used in those cases to identify structure of page content. This usually consists of:

• defining the order of elements on the page (a PDF file can contain text in a completely illogical order). The structure information would help you figure out which text comes first, followed by what other bits.
• defining the nature of elements (Is this an image, a title, a paragraph, an artefact, a table, a footnote etc...).

If you want to extract this, I would encourage you to read the PDF specification on the Adobe web site, and specifically the chapters on Marked Content (14.6), Logical Structure (14.7) and Tagged PDF (14.8). The way the information is encoded in PDF is far from trivial, and like I said, most PDF files will probably not have the information.

In my experience, the only PDF files who have this in a completely proper way are those generated by organisations who are legally obligated to support accessibility (governments etc...) or who are using some of these features in their electronic archives. Some OCR tools can automatically generate "some" of this information, though the quality in that case might be sub-par.

Answered by David van Driessche on December 11, 2020

Your PyPDF2 example calls the getDocumentInfo() method which only retrieves the document's metadata and none of the content.

In order to get to the content/structure of the pdf you need to access the pages by retrieving the Page-object.

Once you retrieved the Page-object you can try to extract the text by calling extractText() on the Page-object.

How well that works will depend on your specific pdf, how it was created, if there has been OCR on the text etc.

Once you have the text itself, you will need to re-assemble it to format it into a table.

Answered by tomanizer on December 11, 2020

## Related Questions

### Mutator in tabulator not working on same edited functions

1  Asked on February 18, 2021

### Recording using sox in c/c++

1  Asked on February 18, 2021

### is it a good idea to use // ignore: missing_return in an future fuction where we are using conditioning to return answer?

2  Asked on February 18, 2021 by princeoo7

### Firestore: Flutter pugin, error when trying to sort the data by a field

1  Asked on February 18, 2021 by thurahtetaung

### nodejs writing to JSON into a specific path

6  Asked on February 18, 2021 by c4llm3p3t3r

### Unusual growth tempdb

0  Asked on February 18, 2021 by pablo-sanchez

### What’s faster than a mirror driver in Remote Control Software?

1  Asked on February 18, 2021 by rudi

### Redirect to a wrong url when a model of singular resource gets updated

1  Asked on February 17, 2021 by shin-yamagami

### When should I override the configure(AuthenticationManagerBuilder auth) from Spring Security in a Spring Boot app?

6  Asked on February 17, 2021

### Know how function with external methods was called without editing it in Python

1  Asked on February 17, 2021 by thlik

### Avoid multiple events on a dropdown: ReactJS

2  Asked on February 17, 2021 by li97

### Delphi Parse JSON

1  Asked on February 17, 2021 by chris-johnson

### How to create a filter using bootstrap and jQuery

2  Asked on February 17, 2021 by jefin-winston

### Change Canvas sizes with zoom

0  Asked on February 17, 2021 by reit-abdullah-yavuzkol

### Is there a way to make the Loop Code make it faster?

4  Asked on February 17, 2021 by levesque-xylia

### Pass data on Back navigation in Ionic5 and Angular Router

0  Asked on February 17, 2021

### Using libcurl in g++

0  Asked on February 16, 2021 by marcel-kopera

### Why do my numpy constructs ignore tuple deconstruction?

1  Asked on February 16, 2021 by trisimix