I am trying to use PDFMiner or any PDF extraction tools to extract texts from PDF.
I want achieve: I search a keyword 'bank' and it returns the bank name or the whole row in the table
PDF Format like
bank,(sth,sth)|is|A ,(B , C)
I tried:
import io
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
def extract_text_by_page(pdf_path):
with open(pdf_path, 'rb') as fh:
for page in PDFPage.get_pages(fh,
caching=True,
check_extractable=True):
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager, converter)
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
yield text
# close open handles
converter.close()
fake_file_handle.close()
def extract_text(pdf_path):
for page in extract_text_by_page(pdf_path):
print(page)
print()
if __name__ == '__main__':
print(extract_text('test.pdf'))
It currently return tables, but I just want the exact row with bank:A,
Any help will be appreciated!
question from:
https://stackoverflow.com/questions/66061165/python-pdfminer-to-search-for-keyword-the-return-texts-to-csv 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…