Block 46: PDF Reading with pdfplumber
Extract text and tables from PDF files programmatically.
Concepts
- pip install pdfplumber
- pdfplumber.open() context manager
- page.extract_text() and page.extract_tables()
- Handling multi-page PDFs
Code Examples
See exercise below.
Exercise
Open a PDF, extract text from page 1, and print the first 500 characters. Extract a table from a PDF and convert it to a pandas DataFrame.
Homework
What are the challenges of PDF parsing? Why do tables in PDFs cause problems? Thursday