Block 56: Scraping Tables & Building DataFrames
Turn HTML tables into pandas DataFrames.
Concepts
- Locating <table> tags in HTML
- pd.read_html() for quick extraction
- Manual row/cell extraction for complex tables
- Cleaning scraped data: strip, replace, convert types
Code Examples
See exercise below.
Exercise
Use pd.read_html() to extract a table from a Wikipedia page. Manually parse a table row-by-row with BeautifulSoup and build a DataFrame.
Homework
pd.read_html() is convenient but sometimes wrong. When would you prefer manual parsing? Thursday