Week 5 • Wednesday

Block 46: PDF Reading with pdfplumber

Extract text and tables from PDF files programmatically.

Concepts

Code Examples

See exercise below.

Exercise

Open a PDF, extract text from page 1, and print the first 500 characters. Extract a table from a PDF and convert it to a pandas DataFrame.

Homework

What are the challenges of PDF parsing? Why do tables in PDFs cause problems? Thursday