AnswerBun.com

Python function to loop through PDFs in a folder, and find keywords

thank you so much for taking your time. Please see code below. The code works, but instead of searching for one word, I need to search for several words. I’ve tried:

search_word = [‘python’ , ‘aws’ , ‘sql’]

but this doesn’t work. Any ideas on how to make this work?

Any suggestions to improve the code are all welcome!

Code:

directory = r"/Users/resumes_for_testing/"

# define keywords
search_word = 'python'

# Loop through all PDFs in specified directory:
for filename in os.listdir(directory):
    if filename.endswith(".pdf"):
        # open the pdf file
        f = open(filename,'rb')
        object = PyPDF2.PdfFileReader(f)
        
        # search for keywords
        for i in range(object.numPages):
            page = object.getPage(i)
            text = page.extractText()
            search_text = text.lower().split()
            for word in search_text:
                if search_word in word:
                    print("The word '{}' was found in '{}'".format(search_word,filename))

Stack Overflow Asked by Michael H on January 1, 2021

2 Answers

2 Answers

Try pdfreader to extract texts:

import os
from pdfreader import SimplePDFViewer, PageDoesNotExist

def search_in_file(fname, search_words):
    fd = open(fname, "rb")
    viewer = SimplePDFViewer(fd)
    try:
        while True:
            viewer.render()
            text = "".join(viewer.canvas.strings)
            for word in search_words:
                if word in text:
                    print("The word '{}' was found in '{}' on page {}".format(word, fname, viewer.current_page_number))
            viewer.next()
    except PageDoesNotExist:
        pass

# define keywords
search_words = ['python', 'aws', 'sql']

# define directory
directory = "./"

# Loop through all PDFs in specified directory:
for fname in os.listdir(directory):
    if fname.endswith(".pdf"):
        search_in_file(fname, search_words)

Answered by Maksym Polshcha on January 1, 2021

You could try small change in approach where instead of looping the search_text you could loop through your list of search_words and then use if statement to see whether it is in search_text

e.g.

# define keywords
search_words = ['python', 'aws', 'sql']

# Loop through all PDFs in specified directory:
for filename in os.listdir(directory):
    if filename.endswith(".pdf"):
        # open the pdf file
        f = open(filename,'rb')
        object = PyPDF2.PdfFileReader(f)
    
        # search for keywords
        for i in range(object.numPages):
            page = object.getPage(i)
            text = page.extractText()
            search_text = text.lower().split()

            for word in search_words:
                if word in search_text:
                    print("The word '{}' was found in '{}'".format(word, filename))

Answered by Matthew King on January 1, 2021

Add your own answers!

Related Questions

Vimeo offline video playback issue

0  Asked on September 26, 2020 by ipatel

         

How to print data from a few arrays

3  Asked on September 25, 2020 by protodimbo

   

why this.key is not working properly in javacript?

2  Asked on September 25, 2020 by carlos-daniel

 

Pass the dynamic variable when button is pressed

1  Asked on September 19, 2020 by francy

 

detect a table part from entire image in python

5  Asked on September 19, 2020 by suji

   

C# Change number of decimals show, but it always skips a number

0  Asked on September 18, 2020 by theoverly

   

Custom font bold weight not working in Dompdf

1  Asked on September 17, 2020 by user13286

   

How to use has one through in laravel?

0  Asked on September 17, 2020 by ahz

       

Ask a Question

Get help from others!

© 2022 AnswerBun.com. All rights reserved.