Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
93 views
in Technique[技术] by (71.8m points)

How to extract multiple instances of a word from PDF files on python?

I'm writing a script on python to read a PDF file and record both the string that appears after every instance that "time" is mentioned as well as the page number its mentioned on.

I have gotten it to recognize when each page has the string "time" on it and send me the page number, however if the page has "time" more than once, it does not tell me. I'm assuming this is because it has already fulfilled the criteria of having the string "time" on it at least once, and therefore it skips to the next page to perform the check.

How would I go about finding multiple instances of the word "time"?

This is my code:

import PyPDF2

def pdf_read():
    pdfFile = "recordsdocument.pdf"
    
    pdf = PyPDF2.PdfFileReader(pdfFile)
    pageCount = pdf.getNumPages()
    
    for pageNumber in range(pageCount):
        page = pdf.getPage(pageNumber)
        pageContent = page.extractText()   
        if "Time" in pageContent or "time" in pageContent:
            print(pageNumber)

Also as a side note, this pdf is a scanned document and therefore when I read the text on python (or copy and paste onto word) there are a lot words which come up with multiple random symbols and characters even though its perfectly legible. Is this a limitation of computer programming without having to apply more complex concepts such as machine learning in order to read the files accurately?

question from:https://stackoverflow.com/questions/65851174/how-to-extract-multiple-instances-of-a-word-from-pdf-files-on-python

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

A solution would be to create a list of strings off pageContent and count the frequency of the word 'time' in the list. It is also easier to select the word following 'time' - you can simply retrieve the next item in the list:

import PyPDF2
import string

pdfFile = "recordsdocument.pdf"

pdf = PyPDF2.PdfFileReader(pdfFile)
pageCount = pdf.getNumPages()

for pageNumber in range(pageCount):
    page = pdf.getPage(pageNumber)
    pageContent = page.extractText()   
    pageContent = ''.join(pageContent.splitlines()).split() # words to list
    pageContent = ["".join(j.lower() for j in i if j not in string.punctuation) for i in pageContent] # remove punctuation

    print(pageContent.count('time') + pageContent.count('Time')) # count occurances of time in list
    print([(j, pageContent[i+1] if i+1 < len(pageContent) else '') for i, j in enumerate(pageContent) if j == 'Time' or j == 'time']) # list time and following word

Note that this example also strips all words from characters that are not letters or digits. Hopefully this sufficiently cleans up the bad OCR.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...