Block 126: Dask DataFrames for Large Data
Process larger-than-memory datasets with Dask.
Concepts
- pip install dask[dataframe]
- dask.dataframe.read_csv() returns lazy DataFrame
- Operations are same as pandas but lazy
- .compute() to trigger execution
- dask.visualize() to see task graph
Code Examples
See exercise below.
Exercise
Load a large CSV with Dask. Compute groupby mean — compare speed/behavior vs pandas. Use dask to filter and compute statistics on a file too large to fit in memory (simulate with a large file).
Homework
When should you use Dask instead of pandas? List 3 scenarios. Thursday