Python Mastery β€” Complete Course Book

Full contents from python_mastery_coursebook.docx, exactly as in the original document.

🐍
PYTHON MASTERY
Major Libraries Training β€” Complete Course Book



Data Science Β· Machine Learning Β· Scientific Computing
For Absolute Beginners β†’ Production-Ready Practitioners


Online Delivery | Sandboxed Coding Environment | Cohort-Based

Table of Contents


About This Course
3
How to Use This Book
4
PART I – FOUNDATIONS (Weeks 1–4)

  Week 1: Python Foundations & Standard Library
6
  Week 2: NumPy – Arrays & Vectorized Computing
14
  Week 3: Pandas – DataFrames & Data Handling
22
  Week 4: Visualization – Matplotlib & Seaborn
30
PART II – WORKING WITH DATA & WEB (Weeks 5–8)

  Week 5: File Formats & Data Pipelines
38
  Week 6: HTTP, APIs & Web Scraping
45
  Week 7: Web Frameworks – Flask & FastAPI
52
  Week 8: Machine Learning with scikit-learn
60
PART III – SCIENCE, NLP & VISION (Weeks 9–12)

  Week 9: Scientific Python – SciPy & SymPy
70
  Week 10: Practical Statistics for Data Science
78
  Week 11: NLP – NLTK, spaCy & Transformers
85
  Week 12: Computer Vision with OpenCV
93
PART IV – ENGINEERING & AUTOMATION (Weeks 13–14)

  Week 13: Databases & Data Engineering
100
  Week 14: Automation, Scripting & Testing
107
PART V – CAPSTONE (Weeks 15–16)

  Week 15: Capstone – Design & Implementation
114
  Week 16: Capstone – Polish, Package & Present
119
Pricing & Delivery Structure
124
Appendix: Quick Reference & Cheat Sheets
127

About This Course

Python Mastery is a rigorous 16-week, 160-block online program designed to take absolute beginners to confident, production-ready Python practitioners in the areas of Data Science, Machine Learning, and Scientific Computing.
Every 30-minute block follows the same proven pedagogical loop:


Who This Course Is For
Complete beginners with no prior programming experience
Analysts, scientists, or engineers who want to automate their work
Anyone targeting a career in Data Science or ML Engineering
Developers from other languages who need Python proficiency quickly

Learning Outcomes
By the end of this course, you will be able to:
Write clean, idiomatic Python across all major standard library modules
Manipulate and analyze datasets with NumPy and Pandas
Visualize data with Matplotlib and Seaborn
Build and evaluate ML models with scikit-learn
Consume REST APIs and scrape websites responsibly
Build web services with Flask and FastAPI
Apply scientific computing (SciPy, SymPy) and statistics
Work with NLP (NLTK, spaCy, Transformers) and Computer Vision (OpenCV)
Design and maintain production data pipelines and databases
Automate tasks, write tests, and package a complete Python project

Technologies Covered
Core & Scientific
Web, ML & Engineering
Python 3.11+
SciPy
NumPy
SymPy
Pandas
NLTK
Matplotlib
spaCy
Seaborn
Transformers (HuggingFace)
requests
OpenCV
BeautifulSoup
SQLAlchemy
Flask
Dask
FastAPI
pytest
scikit-learn
Selenium

How to Use This Book

This book is both a course companion and a standalone reference. Each week is self-contained and follows a consistent structure.
Chapter Structure
Each week chapter contains:
Week Overview – learning goals, prerequisites, and deliverables
Day-by-Day Breakdown – two 30-minute blocks per day, Mon–Fri (10 blocks/week)
Concept Sections – clear explanations with diagrams and tables
Illustrated Code Examples – annotated, copy-ready snippets
Sandboxed Exercises – hands-on challenges to complete in the live environment
Homework Assignments – deeper tasks to complete between sessions
Week Review & Quick Quiz – consolidate your learning

πŸ’» Sandbox Icon
Wherever you see this icon, open your sandboxed coding environment and type the code yourself. Reading code is not enough β€” you must run it.

⚠️ Common Pitfalls
Highlighted sections warn you of errors beginners frequently make, saving hours of debugging time.

πŸ“Œ Key Concept
Blue boxes mark the single most important idea of each section. Memorise these.

πŸ”— Real-World Connection
Teal boxes connect the topic to real-world applications in industry.

Conventions Used
Code is shown in monospaced dark blocks
File/folder names use inline code style
Bold text highlights key terms on first use
>>> prompt indicates interactive Python shell; no prompt means a .py file

Environment Setup
All hands-on work is done in the provided online sandboxed environment. Nothing needs to be installed locally. The sandbox provides:
A full JupyterLab interface accessible in any browser
Python 3.11 with all course libraries pre-installed
Persistent storage for your code across sessions
Auto-graded exercise feedback
Instructor-visible progress dashboard

WEEK 1: PYTHON FOUNDATIONS & STANDARD LIBRARY
PART I β€” FOUNDATIONS | Environment setup, syntax, core data structures, functions

Week Overview
Learning Goals
Prerequisites
Understand Python's place in the ecosystem
No prior programming knowledge
Set up and navigate the sandbox
A web browser
Write expressions, variables, and control flow
Curiosity!
Define and call functions

Use built-in types: list, dict, tuple, set

Explore the standard library essentials


🎯 Week Deliverable
A working Python script that reads a data file, processes it with core data structures, and prints a formatted summary report.

Day 1
Block 1: What is Python & Your First Program
Python is a high-level, interpreted, dynamically-typed language known for its readable syntax. It is the #1 language in Data Science and ML. We start by understanding the Python execution model and writing our first lines of code.
Example:
# Your very first Python programprint("Hello, Python World!")# Python as a calculator>>> 2 + 3 * 4      # 14  (PEMDAS applies)>>> 10 / 3         # 3.333... (true division)>>> 10 // 3        # 3   (floor division)>>> 10 % 3         # 1   (modulo / remainder)>>> 2 ** 8         # 256 (exponentiation)
πŸ“Œ Key Concept
Python uses indentation (4 spaces) to define code blocks, not curly braces. Getting this right is non-negotiable.

Block 2: Variables, Types & String Formatting
Variables are named references to values. Python infers the type automatically. The four primitive types are int, float, str, and bool. F-strings (Python 3.6+) are the modern way to embed values in strings.
Example:
# Variable assignmentname     = 'Alice'          # strage      = 30               # intheight   = 5.7              # floatis_adult = True             # bool# Check the typeprint(type(age))            # <class 'int'># F-string formatting (modern, preferred)print(f"{name} is {age} years old and {height:.1f} ft tall.")# Alice is 30 years old and 5.7 ft tall.# Type conversionage_str = str(age)          # "30"pi      = float('3.14159')  # 3.14159

Day 2
Block 3: Control Flow: if / elif / else
Control flow lets your program make decisions. The if/elif/else chain evaluates conditions in order and executes the first matching block. Conditions use comparison operators (==, !=, <, >, <=, >=) and logical operators (and, or, not).
Example:
score = 82if score >= 90:    grade = "A"elif score >= 80:    grade = "B"elif score >= 70:    grade = "C"else:    grade = "F"print(f"Score: {score} β†’ Grade: {grade}")  # Score: 82 β†’ Grade: B# Ternary (one-liner)result = "pass" if score >= 50 else "fail"

Block 4: Loops: for & while
Iteration is fundamental in data processing. The for loop iterates over any iterable (list, range, string, dict…). The while loop repeats as long as a condition is True. Use break to exit, continue to skip.
Example:
# for loop with rangefor i in range(5):          # 0, 1, 2, 3, 4    print(i, end=' ')       # 0 1 2 3 4# iterating a listfruits = ['apple', 'banana', 'cherry']for fruit in fruits:    print(f"  - {fruit}")# enumerate gives index + valuefor idx, fruit in enumerate(fruits, start=1):    print(f"{idx}. {fruit}")# while loopcount = 0while count < 3:    print(count)    count += 1

Day 3
Block 5: Lists & Tuples
A list is an ordered, mutable sequence of items. A tuple is ordered but immutable. Lists are the workhorse data structure for sequential data. Master slicing and list comprehensions β€” they appear in every real Python codebase.
Example:
# List creation & indexingnums = [10, 20, 30, 40, 50]print(nums[0])              # 10  (first)print(nums[-1])             # 50  (last)print(nums[1:4])            # [20, 30, 40]  (slice)print(nums[::2])            # [10, 30, 50]  (step)# Mutating a listnums.append(60)             # add to endnums.insert(0, 0)           # insert at indexnums.remove(30)             # remove by valuepopped = nums.pop()         # remove & return last# List comprehension β€” the Pythonic waysquares = [x**2 for x in range(1, 6)]   # [1, 4, 9, 16, 25]evens   = [x for x in range(10) if x % 2 == 0]# Tuples (immutable)point = (3, 4)x, y  = point               # unpacking

Block 6: Dictionaries & Sets
A dictionary stores key-value pairs; lookups are O(1). A set stores unique items with fast membership testing. Both are essential for data aggregation, counting, and deduplication.
Example:
# Dictionarystudent = {'name': 'Alice', 'age': 20, 'grade': 'A'}print(student['name'])      # Alicestudent['age'] = 21         # updatestudent['gpa'] = 3.9        # add new key# Safe access with .get()print(student.get('score', 0))  # 0 (default if missing)# Iterating a dictfor key, value in student.items():    print(f"  {key}: {value}")# Dict comprehensionsq_map = {x: x**2 for x in range(5)}# Setsa = {1, 2, 3, 4}b = {3, 4, 5, 6}print(a | b)  # union    {1,2,3,4,5,6}print(a & b)  # intersect {3,4}print(a - b)  # diff     {1,2}

Day 4
Block 7: Functions: Definition, Arguments & Scope
Functions are reusable blocks of code. Python supports positional args, keyword args, default values, *args (variable positional), and **kwargs (variable keyword). Understanding scope (LEGB rule) prevents subtle bugs.
Example:
# Basic functiondef greet(name, greeting='Hello'):    # default arg    return f"{greeting}, {name}!"print(greet("Alice"))              # Hello, Alice!print(greet("Bob", greeting="Hi")) # Hi, Bob!# *args collects extra positional args as tupledef total(*nums):    return sum(nums)print(total(1, 2, 3, 4))           # 10# **kwargs collects extra keyword args as dictdef profile(**info):    for k, v in info.items():        print(f"  {k}: {v}")profile(name="Alice", age=20, city="NYC")# Lambda (anonymous function)square = lambda x: x ** 2print(square(5))                   # 25

Block 8: Standard Library: os, datetime, collections
Python ships with a vast standard library β€” 'batteries included'. Three of the most useful modules for data work are os/pathlib (file system), datetime (dates/times), and collections (specialized data structures).
Example:
import osfrom pathlib import Pathfrom datetime import datetime, timedeltafrom collections import Counter, defaultdict# File systemcwd = Path.cwd()                    # current directoryfiles = list(cwd.glob('*.py'))       # list .py files# Datetimenow = datetime.now()print(now.strftime("%Y-%m-%d %H:%M"))  # 2025-06-01 09:30deadline = now + timedelta(days=7)# Counter β€” instant frequency mapwords = ['cat', 'dog', 'cat', 'bird', 'dog', 'cat']freq  = Counter(words)print(freq.most_common(2))          # [('cat',3),('dog',2)]# defaultdict β€” no KeyError on first accessgroups = defaultdict(list)groups['A'].append('Alice')          # no key check needed
πŸ”— Real-World Connection
Counter is used daily in text analytics to count word frequencies. defaultdict simplifies grouping operations that would otherwise need if-key-exists checks.

Day 5
Block 9: File I/O & Exception Handling
Reading and writing files is fundamental. Always use context managers (with open(...) as f) to ensure files are closed. Exceptions are Python's error signalling mechanism; wrapping risky code in try/except prevents program crashes.
Example:
# Writing a filewith open('data.txt', 'w') as f:    f.write("Line 1\n")    f.write("Line 2\n")# Reading a filewith open('data.txt', 'r') as f:    lines = f.readlines()           # list of lines# Reading line by line (memory efficient)with open('data.txt') as f:    for line in f:        print(line.strip())# Exception handlingtry:    result = 10 / 0except ZeroDivisionError as e:    print(f"Error: {e}")            # Error: division by zeroexcept (ValueError, TypeError) as e:    print(f"Value/Type error: {e}")finally:    print("Always runs")

Block 10: Week 1 Review & Mini-Project
We consolidate the week's learning with a complete mini-script. This script reads a CSV-like text file, aggregates data with a dictionary, and prints a sorted frequency report β€” combining everything from Week 1.
Example:
# Week 1 Mini-Project: Word Frequency Analyzerfrom collections import Counterfrom pathlib import Pathdef analyze_text(filepath):    """Read a text file and return word frequency stats."""    path = Path(filepath)    if not path.exists():        raise FileNotFoundError(f"No such file: {filepath}")    with open(path) as f:        text = f.read().lower()    # Clean and split    words = [w.strip('.,!?;:"') for w in text.split()]    words = [w for w in words if len(w) > 2]  # skip short words    freq = Counter(words)    print(f"Total words : {sum(freq.values())}")    print(f"Unique words: {len(freq)}")    print("\nTop 10 words:")    for word, count in freq.most_common(10):        bar = "#" * count        print(f"  {word:<15} {count:>4}  {bar}")analyze_text("sample.txt")

Week 1 Quick Quiz
What is the difference between // and / in Python?
How do you iterate over both keys and values in a dictionary?
What does enumerate() return?
Write a list comprehension that produces the cubes of all odd numbers from 1 to 20.
What is the LEGB rule in Python scope resolution?


WEEK 2: NUMPY: ARRAYS & VECTORIZED COMPUTING
PART I β€” FOUNDATIONS | Numerical arrays, vectorization, broadcasting, linear algebra

Week Overview
Learning Goals
Prerequisites
Create and manipulate ndarray objects
Week 1 complete
Understand dtypes, shapes, and axes
Comfortable with Python lists and functions
Apply vectorized operations (no Python loops!)

Use broadcasting for shape-mismatched arrays

Perform linear algebra operations

Generate random data for simulations


🎯 Week Deliverable
A NumPy-based simulation: generate 10,000 random stock price paths using vectorized operations and compute risk statistics.

Why NumPy?
Python lists are flexible but slow for numerical computation β€” they store object references, not contiguous numeric data. NumPy's ndarray stores typed numeric data in contiguous memory, enabling C-speed computation via vectorization (replacing slow Python loops with a single array operation).

πŸ“Œ Key Concept
Vectorization: instead of looping over elements in Python, pass the entire array to a NumPy function. This is 10–100x faster and the code is shorter.

Day 1
Block 1: Creating Arrays & Understanding Shape
The fundamental NumPy object is the ndarray. You can create arrays from Python lists, from built-in generators, or by loading files. The shape, ndim, dtype, and size attributes describe the array's structure.
Example:
import numpy as np# From Python lista = np.array([1, 2, 3, 4, 5])print(a.shape)   # (5,)   β€” 1-D, 5 elementsprint(a.dtype)   # int64# 2-D array (matrix)m = np.array([[1, 2, 3],              [4, 5, 6]])print(m.shape)   # (2, 3)  β€” 2 rows, 3 colsprint(m.ndim)    # 2# Built-in constructorsnp.zeros((3, 4))          # 3x4 matrix of 0.0np.ones((2, 2))           # 2x2 matrix of 1.0np.eye(4)                 # 4x4 identity matrixnp.arange(0, 10, 2)       # [0, 2, 4, 6, 8]np.linspace(0, 1, 5)      # [0, .25, .5, .75, 1]np.full((2, 3), 7)        # 2x3 filled with 7

Block 2: Indexing, Slicing & Reshaping
NumPy indexing mirrors Python list indexing but extends to multiple dimensions. Boolean indexing (fancy indexing) is especially powerful for filtering data without explicit loops.
Example:
a = np.array([10, 20, 30, 40, 50])# Basic indexingprint(a[0])      # 10print(a[-1])     # 50print(a[1:4])    # [20 30 40]# 2D indexing [row, col]m = np.arange(12).reshape(3, 4)   # [[0..3],[4..7],[8..11]]print(m[1, 2])   # 6   (row 1, col 2)print(m[:, 1])   # [1, 5, 9]  (all rows, col 1)print(m[0, :])   # [0, 1, 2, 3]  (row 0, all cols)# Boolean indexing β€” filter without a loopdata = np.array([3, -1, 4, -1, 5, -9, 2, 6])positives = data[data > 0]   # [3, 4, 5, 2, 6]data[data < 0] = 0           # replace negatives with 0# Reshapeflat = np.arange(24)cube = flat.reshape(2, 3, 4)   # 3-D: 2 layers, 3 rows, 4 cols

Day 2
Block 3: Vectorized Operations & Universal Functions
NumPy operations apply element-wise to entire arrays without Python loops. Universal Functions (ufuncs) are pre-compiled C functions that operate element-wise: np.sqrt, np.exp, np.log, np.sin, etc.
Example:
a = np.array([1.0, 4.0, 9.0, 16.0])b = np.array([2.0, 2.0, 3.0, 4.0])# Arithmetic β€” element-wiseprint(a + b)         # [ 3.  6. 12. 20.]print(a * b)         # [ 2.  8. 27. 64.]print(a / b)         # [0.5  2.  3.  4.]# Scalar broadcastprint(a * 2)         # [ 2.  8. 18. 32.]print(a > 4)         # [F  F  T  T]# Universal functions (ufuncs)print(np.sqrt(a))    # [1. 2. 3. 4.]print(np.log(a))     # [0.   1.386 2.197 2.773]print(np.exp([0,1,2]))  # [1.    2.718  7.389]# Aggregation along axesm = np.arange(12).reshape(3, 4)print(m.sum())       # 66   (total)print(m.sum(axis=0)) # [12 15 18 21]  (column sums)print(m.sum(axis=1)) # [ 6 22 38]     (row sums)print(m.mean(), m.std(), m.max())

Block 4: Broadcasting Rules
Broadcasting allows NumPy to operate on arrays of different shapes by virtually expanding smaller arrays β€” without copying data. Master the 3 rules: (1) prepend 1s to shorter shape, (2) dimensions of size 1 stretch to match, (3) shapes must be compatible.
Example:
# Rule illustrationa = np.array([[1, 2, 3],    # shape (2, 3)              [4, 5, 6]])b = np.array([10, 20, 30])  # shape (3,) β†’ broadcasts to (2,3)print(a + b)# [[11 22 33]#  [14 25 36]]# Column vector broadcastcol = np.array([[100],      # shape (2, 1)               [200]])print(a + col)# [[101 102 103]#  [204 205 206]]# Practical: center each feature columndata = np.random.randn(100, 5)  # 100 samples, 5 featuresmeans = data.mean(axis=0)        # shape (5,)stds  = data.std(axis=0)normalized = (data - means) / stds   # z-score normalization
πŸ”— Real-World Connection
Z-score normalization (subtracting mean, dividing by std) is a critical preprocessing step before training ML models. NumPy broadcasting makes it a one-liner.

Day 3
Block 5: Linear Algebra: np.linalg
NumPy provides a full suite of linear algebra operations via np.linalg. These underpin virtually all ML algorithms β€” matrix multiplication (@), eigendecomposition, singular value decomposition, and solving linear systems.
Example:
A = np.array([[2, 1], [5, 3]])B = np.array([[1, 2], [3, 4]])# Matrix multiplicationC = A @ B                    # preferred over np.dotprint(C)  # [[5, 8], [14, 22]]# Transposeprint(A.T)   # [[2,5],[1,3]]# Determinantprint(np.linalg.det(A))      # 1.0# InverseA_inv = np.linalg.inv(A)print(A @ A_inv)             # identity matrix# Solve linear system  Ax = bb = np.array([3, 7])x = np.linalg.solve(A, b)   # [2, -1]# Eigenvalues & eigenvectorsvals, vecs = np.linalg.eig(A)print(vals)   # eigenvalues

Block 6: Random Number Generation: np.random
NumPy's random module generates arrays of random numbers from many distributions. Always set a random seed for reproducible results.
Example:
rng = np.random.default_rng(seed=42)   # reproducible# Uniform [0, 1)u = rng.random(size=(3, 3))# Standard normal (mean=0, std=1)n = rng.standard_normal(size=1000)# Normal distributionprices = rng.normal(loc=100, scale=15, size=500)# Integersdice = rng.integers(1, 7, size=10)  # 10 dice rolls# Choice / samplingitems = ['A', 'B', 'C', 'D']sample = rng.choice(items, size=5, replace=True)# Monte Carlo: estimate piN = 1_000_000x, y = rng.random(N), rng.random(N)inside = (x**2 + y**2) < 1.0pi_est = 4 * inside.mean()print(f"Pi β‰ˆ {pi_est:.4f}")  # Pi β‰ˆ 3.1416

Day 4
Block 7: Structured Arrays & Record Arrays
NumPy can store heterogeneous data (mixed dtypes) using structured arrays β€” a lightweight alternative to Pandas for simple cases. Each field is accessed by name.
Example:
# Structured array dtypedt = np.dtype([('name', 'U20'), ('age', 'i4'), ('score', 'f8')])records = np.array([    ('Alice', 25, 92.5),    ('Bob',   30, 87.0),    ('Carol', 22, 95.3)], dtype=dt)print(records['name'])       # ['Alice' 'Bob' 'Carol']print(records['score'].mean()) # 91.6# Sort by scoresorted_r = np.sort(records, order='score')[::-1]for r in sorted_r:    print(f"{r['name']:<8} age={r['age']}  score={r['score']}")

Block 8: Performance: NumPy vs Pure Python
One of the most important things to understand is HOW MUCH faster NumPy is. This block benchmarks equivalent operations to build intuition for when vectorization matters.
Example:
import timeN = 1_000_000data = list(range(N))arr  = np.arange(N, dtype=float)# Python loopt0 = time.perf_counter()result = [x**2 for x in data]t1 = time.perf_counter()print(f"Python loop:  {(t1-t0)*1000:.1f} ms")# NumPy vectorizedt0 = time.perf_counter()result = arr ** 2t1 = time.perf_counter()print(f"NumPy vector: {(t1-t0)*1000:.1f} ms")# Typical output:# Python loop:  320.4 ms# NumPy vector:   1.2 ms   ← 267x faster
⚠️ Common Pitfall
Never loop over NumPy arrays with Python for loops β€” this destroys performance. Always vectorize. If you cannot, consider numba or Cython.

Day 5
Block 9: Practical Patterns: Masking, Stacking & Saving
Real-world NumPy usage involves combining arrays (stack, concatenate, hstack, vstack), using masks to filter and replace data, and saving/loading arrays efficiently.
Example:
# Masking patternsensor_data = np.array([1.2, -999.0, 3.5, -999.0, 4.1])mask = sensor_data == -999.0          # flag missing valuesclean = sensor_data.copy()clean[mask] = np.nanmedian(sensor_data[~mask])  # impute# Stack arraysa = np.array([1, 2, 3])b = np.array([4, 5, 6])print(np.vstack([a, b]))   # [[1,2,3],[4,5,6]]print(np.hstack([a, b]))   # [1,2,3,4,5,6]# Save & loadnp.save('array.npy', clean)loaded = np.load('array.npy')# Save multiple arraysnp.savez('arrays.npz', x=a, y=b)d = np.load('arrays.npz')print(d['x'], d['y'])# CSVnp.savetxt('data.csv', np.random.rand(5, 3),           delimiter=",", header="A,B,C", comments="")

Block 10: Week 2 Review: Stock Simulation Project
We build the week's capstone: a Monte Carlo stock price simulation using Geometric Brownian Motion β€” a model used in quantitative finance. All computation is vectorized.
Example:
# Monte Carlo Stock Price Simulation (GBM)import numpy as nprng = np.random.default_rng(42)S0    = 100.0    # initial pricemu    = 0.10     # annual return (10%)sigma = 0.20     # annual volatility (20%)T     = 1.0      # 1 yearN     = 252      # trading dayspaths = 10_000   # number of simulationsdt = T / N# Random shocks: shape (N, paths)Z = rng.standard_normal((N, paths))# Daily returns via GBM formuladaily_ret = np.exp((mu - 0.5*sigma**2)*dt + sigma*np.sqrt(dt)*Z)# Price paths (cumulative product, prepend S0)price_paths = S0 * np.cumprod(daily_ret, axis=0)price_paths = np.vstack([np.full(paths, S0), price_paths])# Statistics at maturityfinal_prices = price_paths[-1, :]print(f"Mean final price : ${final_prices.mean():.2f}")print(f"Prob(profit>20%) : {(final_prices > 120).mean():.2%}")print(f"5th percentile   : ${np.percentile(final_prices, 5):.2f}")

Week 2 Quick Quiz
What is the difference between shape (5,) and shape (5,1)?
Why is a @ b preferred over np.dot(a, b) for matrix multiplication?
Explain the 3 broadcasting rules with an example.
Write a one-line NumPy expression to compute the column-wise standard deviation of a 2D array.
What does np.linalg.solve() do and why is it better than computing the inverse?


WEEK 3: PANDAS: DATAFRAMES & DATA HANDLING
PART I β€” FOUNDATIONS | Series, DataFrame, loading, cleaning, grouping, merging

Week Overview
Learning Goals
Prerequisites
Create and navigate Series and DataFrame
Week 1 complete
Load data from CSV, Excel, JSON
Basic NumPy familiarity
Clean messy data (nulls, dtypes, duplicates)

Filter, sort, and transform data

Group and aggregate with groupby()

Merge and join multiple datasets


🎯 Week Deliverable
A complete data-cleaning and analysis pipeline on a real-world dataset (e.g., NYC taxi rides or sales records) producing a grouped summary and exported report.

Pandas in Context
Pandas is built on top of NumPy and provides two core data structures: Series (1-D labeled array) and DataFrame (2-D labeled table). Think of a DataFrame as a programmable spreadsheet with SQL-like operations. It is the central tool for all tabular data work in Python.

Day 1
Block 1: Series & DataFrame Basics
A Series is a one-dimensional array with an index (labels). A DataFrame is a collection of Series sharing the same index. Understanding the index is fundamental β€” it is what makes Pandas different from NumPy.
Example:
import pandas as pdimport numpy as np# Seriess = pd.Series([10, 20, 30, 40], index=['a','b','c','d'])print(s['b'])   # 20    β€” label-based accessprint(s[1])     # 20    β€” positional accessprint(s[s > 15]) # b:20, c:30, d:40# DataFrame from dictdf = pd.DataFrame({    'name':   ['Alice', 'Bob', 'Carol', 'Dave'],    'age':    [25, 30, 22, 35],    'salary': [70000, 85000, 62000, 90000],    'dept':   ['Eng', 'Mkt', 'Eng', 'HR']})print(df.head(2))      # first 2 rowsprint(df.info())       # dtypes & null countsprint(df.describe())   # statistical summaryprint(df.shape)        # (4, 4)

Block 2: Selecting Data: loc, iloc, and Boolean Masks
Pandas offers three main ways to select data: attribute access (df.col), loc (label-based), and iloc (integer-based). Boolean masks work exactly like NumPy β€” a boolean Series filters rows.
Example:
# Column selectiondf['name']              # Seriesdf[['name', 'salary']]  # DataFrame (list of cols)# .loc[rows, cols]  β€” label-baseddf.loc[0, 'name']       # 'Alice'df.loc[0:2, ['name', 'age']]  # rows 0,1,2 (inclusive!)# .iloc[rows, cols]  β€” integer-based (exclusive end)df.iloc[0, 0]           # 'Alice'df.iloc[0:2, 0:2]       # rows 0,1  cols 0,1# Boolean maskingeng = df[df['dept'] == 'Eng']       # filter rowssenior = df[(df['age'] > 25) & (df['salary'] > 80000)]# isin β€” filter by listselected = df[df['dept'].isin(['Eng', 'HR'])]# query() β€” SQL-like string syntaxresult = df.query('age > 25 and salary > 70000')

Day 2
Block 3: Loading Data from Files
Pandas can load almost any file format. The most common are CSV, Excel, and JSON. Always inspect the loaded DataFrame with .head(), .info(), and .describe() immediately after loading.
Example:
# CSV β€” most commondf = pd.read_csv('data.csv')df = pd.read_csv('data.csv',     parse_dates=['order_date'],  # auto-parse dates     index_col='id',              # use id as index     dtype={'price': float},     na_values=['N/A', '-', '']   # custom null markers)# Exceldf = pd.read_excel('report.xlsx', sheet_name='Sheet1')# JSON (records format)df = pd.read_json('api_response.json')# From URLurl = 'https://example.com/data.csv'df  = pd.read_csv(url)# Quick inspectionprint(df.shape)          # rows, colsprint(df.dtypes)         # column data typesprint(df.isnull().sum()) # nulls per columnprint(df.duplicated().sum())  # duplicate rows

Block 4: Data Cleaning: Nulls, Dtypes & Duplicates
Real data is messy. Data cleaning is typically 60–80% of a data scientist's time. Core operations: handle missing values, fix incorrect data types, remove or resolve duplicate records.
Example:
# Inspect missing datamissing_pct = df.isnull().mean() * 100print(missing_pct[missing_pct > 0])# Drop rows/columns with too many nullsdf.dropna(subset=['price', 'date'], inplace=True)  # must have thesedf.dropna(axis=1, thresh=len(df)*0.5, inplace=True)  # drop cols >50% null# Fill missing valuesdf['age'].fillna(df['age'].median(), inplace=True)df['category'].fillna('Unknown', inplace=True)df['price'].ffill(inplace=True)   # forward fill# Fix dtypesdf['price'] = pd.to_numeric(df['price'], errors='coerce')df['date']  = pd.to_datetime(df['date'], errors='coerce')df['zip']   = df['zip'].astype(str).str.zfill(5)# Remove duplicatesdf.drop_duplicates(subset=['customer_id', 'date'], keep='first', inplace=True)# Rename columns (snake_case standard)df.columns = df.columns.str.lower().str.replace(' ', '_')

Day 3
Block 5: Transforming Data: apply, map & String Operations
Pandas provides vectorized string methods (Series.str.*) and .apply()/.map() for custom transformations. Avoid explicit Python loops on DataFrames β€” use these instead.
Example:
# Vectorized string operations via .str accessordf['name'] = df['name'].str.strip().str.title()df['email'] = df['email'].str.lower()df['domain'] = df['email'].str.split('@').str[1]# Extract with regexdf['year'] = df['date_str'].str.extract(r'(\d{4})')df['valid'] = df['phone'].str.match(r'^\+?[0-9\-]{10,15}$')# .map() β€” element-wise dict or functiondept_code = {'Eng': 1, 'Mkt': 2, 'HR': 3}df['dept_id'] = df['dept'].map(dept_code)# .apply() β€” row or column functiondf['bonus'] = df['salary'].apply(lambda s: s * 0.1 if s > 80000 else s * 0.05)# Row-wise apply (axis=1)def categorize(row):    if row['age'] < 30 and row['salary'] > 75000:        return 'High-potential'    return 'Standard'df['category'] = df.apply(categorize, axis=1)

Block 6: Sorting & Ranking
Sorting in Pandas returns a new DataFrame by default. You can sort by one or multiple columns, ascending or descending, and rank values using a variety of methods.
Example:
# Sort by single columndf_sorted = df.sort_values('salary', ascending=False)# Sort by multiple columnsdf_sorted = df.sort_values(['dept', 'salary'],                           ascending=[True, False])# Rankdf['salary_rank'] = df['salary'].rank(ascending=False, method='dense')# nlargest / nsmallesttop5 = df.nlargest(5, 'salary')[['name', 'salary']]bottom5 = df.nsmallest(5, 'age')[['name', 'age']]# Reset index after filtering/sortingdf.reset_index(drop=True, inplace=True)

Day 4
Block 7: GroupBy: Split-Apply-Combine
GroupBy is one of Pandas' most powerful features. The pattern: split the DataFrame into groups based on a column, apply an aggregation or transformation to each group, then combine the results.
Example:
# Basic groupby + aggregationdept_stats = df.groupby('dept')['salary'].agg(['mean','min','max','count'])print(dept_stats)# Multiple columnssummary = df.groupby(['dept', 'year']).agg(    avg_salary=('salary', 'mean'),    headcount=('name', 'count'),    total_payroll=('salary', 'sum')).reset_index()# Transform β€” returns same-shape result (for feature engineering)df['dept_avg'] = df.groupby('dept')['salary'].transform('mean')df['above_avg'] = df['salary'] > df['dept_avg']# Filter groups by sizelarge_depts = df.groupby('dept').filter(lambda g: len(g) >= 3)# Apply custom function to each groupdef top2(group):    return group.nlargest(2, 'salary')top2_per_dept = df.groupby('dept').apply(top2).reset_index(drop=True)

Block 8: Merging & Joining DataFrames
Real datasets usually come from multiple sources that need to be combined. pd.merge() implements all SQL join types. pd.concat() stacks DataFrames vertically or horizontally.
Example:
orders   = pd.DataFrame({'order_id':[1,2,3], 'customer_id':[10,11,10], 'amount':[50,75,30]})customers= pd.DataFrame({'customer_id':[10,11,12], 'name':['Alice','Bob','Carol'], 'city':['NY','LA','NY']})# Inner join β€” only matching rowsresult = pd.merge(orders, customers, on='customer_id', how='inner')# Left join β€” all orders, customer info where availableresult = pd.merge(orders, customers, on='customer_id', how='left')# Right / outer joinresult = pd.merge(orders, customers, on='customer_id', how='outer')# Join on different column namesresult = pd.merge(orders, customers,                  left_on='customer_id', right_on='id')# Concat β€” stack DataFramescombined = pd.concat([df_2023, df_2024], ignore_index=True)side_by_side = pd.concat([df_features, df_labels], axis=1)

Day 5
Block 9: Pivot Tables & Time Series Basics
Pivot tables provide multi-dimensional aggregation in a single call β€” essentially a grouped table with column headers. Pandas has excellent time series support: resampling, rolling windows, and date-based filtering.
Example:
# Pivot tablepivot = df.pivot_table(    values='sales',    index='region',    columns='quarter',    aggfunc='sum',    fill_value=0)# Time series β€” set datetime indexts = df.set_index('date').sort_index()# Resample to monthlymonthly = ts['sales'].resample('ME').sum()# Rolling 7-day windowts['rolling_avg'] = ts['sales'].rolling(window=7, min_periods=1).mean()# Shift (lag features)ts['sales_lag1'] = ts['sales'].shift(1)# Date-based slicingq1 = ts.loc['2024-01':'2024-03']

Block 10: Week 3 Capstone: Sales Analysis Pipeline
We build a complete end-to-end analysis: load β†’ clean β†’ enrich β†’ aggregate β†’ export.
Example:
import pandas as pd# 1. Loaddf = pd.read_csv('sales.csv', parse_dates=['order_date'])# 2. Cleandf.dropna(subset=['product', 'quantity', 'price'], inplace=True)df['quantity'] = pd.to_numeric(df['quantity'], errors='coerce')df['price']    = pd.to_numeric(df['price'],    errors='coerce')df.dropna(inplace=True)# 3. Enrichdf['revenue']  = df['quantity'] * df['price']df['year']     = df['order_date'].dt.yeardf['month']    = df['order_date'].dt.month# 4. Aggregatemonthly = df.groupby(['year','month','product']).agg(    units_sold=('quantity','sum'),    total_revenue=('revenue','sum')).reset_index()monthly['cum_rev'] = monthly.groupby('product')['total_revenue'].cumsum()# 5. Exportmonthly.to_csv('monthly_summary.csv', index=False)print(monthly.head(10).to_string())

Week 3 Quick Quiz
What is the difference between .loc and .iloc?
When would you use .apply(axis=1) vs a vectorized operation?
Explain the Split-Apply-Combine pattern in your own words.
What is the difference between pd.merge() and pd.concat()?
How do you resample a time series from daily to weekly frequency?


WEEK 4: VISUALIZATION: MATPLOTLIB & SEABORN
PART I β€” FOUNDATIONS | Line, bar, scatter, histogram, boxplot, subplots, visual EDA

Week Overview
Learning Goals
Prerequisites
Create publication-quality figures with Matplotlib
Weeks 1–3 complete
Use Seaborn for statistical visualization

Choose the right chart for the data type

Customize colors, fonts, labels, and layouts

Build multi-panel dashboards with subplots

Perform visual exploratory data analysis (EDA)


🎯 Week Deliverable
An EDA report: 6-panel visualization dashboard for a real dataset, covering distribution, correlation, trend, and comparison β€” exported as a PNG.

Day 1
Block 1: Matplotlib Architecture & Basic Plots
Matplotlib has two interfaces: pyplot (quick/procedural) and OO (Figure/Axes, recommended for control). A Figure is the canvas; Axes is a single plot panel. Always use the OO interface for anything beyond a quick sketch.
Example:
import matplotlib.pyplot as pltimport numpy as np# OO interface (recommended)fig, ax = plt.subplots(figsize=(8, 5))x = np.linspace(0, 2*np.pi, 300)ax.plot(x, np.sin(x), label='sin(x)', color='steelblue', linewidth=2)ax.plot(x, np.cos(x), label='cos(x)', color='coral',     linewidth=2, linestyle='--')ax.set_title('Trigonometric Functions', fontsize=14, fontweight='bold')ax.set_xlabel('x (radians)')ax.set_ylabel('Amplitude')ax.legend(framealpha=0.9)ax.grid(True, alpha=0.3)ax.set_xlim(0, 2*np.pi)plt.tight_layout()plt.savefig('trig.png', dpi=150, bbox_inches='tight')plt.show()

Block 2: Bar Charts & Histograms
Bar charts compare discrete categories. Histograms show the distribution of a continuous variable. Both are among the most commonly used chart types in data analysis.
Example:
import pandas as pd# Bar chart β€” category comparisoncategories = ['Q1', 'Q2', 'Q3', 'Q4']values      = [120, 150, 135, 180]fig, axes = plt.subplots(1, 2, figsize=(12, 5))axes[0].bar(categories, values,           color=['#3B82F6','#10B981','#F59E0B','#EF4444'],           edgecolor='white', linewidth=0.8)axes[0].set_title('Quarterly Revenue')axes[0].set_ylabel('Revenue ($k)')for i, v in enumerate(values):    axes[0].text(i, v+2, str(v), ha='center', fontweight='bold')# Histogram β€” distributiondata = np.random.normal(100, 15, 1000)axes[1].hist(data, bins=30, color='steelblue', edgecolor='white',            alpha=0.85, density=True)axes[1].set_title('Score Distribution')axes[1].set_xlabel('Score')axes[1].set_ylabel('Density')plt.tight_layout()plt.show()

Day 2
Block 3: Scatter Plots & Bubble Charts
Scatter plots reveal relationships between two continuous variables. Adding a third variable as point size (bubble chart) or color encodes additional information in a single chart.
Example:
rng = np.random.default_rng(42)n = 100age    = rng.integers(22, 65, n)salary = 30000 + age * 1200 + rng.normal(0, 8000, n)exp    = age - 22 + rng.integers(0, 3, n)dept   = rng.choice(['Eng','Mkt','HR','Sales'], n)fig, ax = plt.subplots(figsize=(9, 6))colors = {'Eng':'#3B82F6','Mkt':'#10B981','HR':'#F59E0B','Sales':'#EF4444'}for d in ['Eng','Mkt','HR','Sales']:    mask = dept == d    ax.scatter(age[mask], salary[mask],              s=exp[mask]*10,               # size = experience              c=colors[d], label=d,              alpha=0.7, edgecolors='white', linewidth=0.5)ax.set_xlabel('Age (years)')ax.set_ylabel('Salary ($)')ax.set_title('Salary vs Age by Department\n(bubble size = years experience)')ax.legend(title='Dept')ax.grid(True, alpha=0.3)plt.tight_layout()plt.show()

Block 4: Seaborn: Statistical Visualization
Seaborn is built on Matplotlib and adds statistical chart types with minimal code: heatmaps, violin plots, pair plots, and regression plots. It also integrates directly with Pandas DataFrames.
Example:
import seaborn as snssns.set_theme(style='whitegrid', palette='muted', font_scale=1.1)# Load built-in datasettips = sns.load_dataset('tips')fig, axes = plt.subplots(2, 2, figsize=(12, 10))# Distribution with KDEsns.histplot(data=tips, x='total_bill', hue='sex', kde=True,            ax=axes[0,0])axes[0,0].set_title('Bill Distribution by Sex')# Box + strip plotsns.boxplot(data=tips, x='day', y='total_bill', palette='Set2', ax=axes[0,1])axes[0,1].set_title('Bills by Day')# Violin plotsns.violinplot(data=tips, x='day', y='tip', hue='sex',              split=True, ax=axes[1,0])axes[1,0].set_title('Tip Distribution')# Scatter with regression linesns.regplot(data=tips, x='total_bill', y='tip', ax=axes[1,1],           scatter_kws={'alpha':0.5}, line_kws={'color':'red'})axes[1,1].set_title('Bill vs Tip (with regression)')plt.suptitle('Tips Dataset EDA', fontsize=16, fontweight='bold', y=1.02)plt.tight_layout()plt.show()

Day 3
Block 5: Correlation Heatmaps & Pair Plots
Heatmaps display a matrix of values using color, most commonly used for correlation matrices. Pair plots show every pairwise relationship in a dataset β€” essential for initial EDA on multidimensional data.
Example:
# Correlation heatmapiris = sns.load_dataset('iris')corr = iris.drop('species', axis=1).corr()fig, axes = plt.subplots(1, 2, figsize=(14, 5))sns.heatmap(corr, annot=True, fmt='.2f', cmap='RdBu_r',           center=0, vmin=-1, vmax=1, square=True,           ax=axes[0], cbar_kws={'shrink':0.8})axes[0].set_title('Iris Feature Correlations')# Clustermap β€” reorders rows/cols by similarity# sns.clustermap(corr, annot=True, cmap='RdBu_r')# Pair plot (all pairwise scatter plots)pair_fig = sns.pairplot(iris, hue='species', diag_kind='kde',                       plot_kws={'alpha':0.6})pair_fig.fig.suptitle('Iris Pair Plot', y=1.02)pair_fig.savefig('pairplot.png', dpi=120, bbox_inches='tight')plt.show()

Block 6: Customization: Themes, Colors & Annotations
Professional charts require careful attention to color palettes, typography, and annotations. Matplotlib's rcParams and Seaborn's themes let you define a house style. Annotations draw attention to key data points.
Example:
# Set global styleplt.rcParams.update({    'font.family':    'sans-serif',    'font.size':      11,    'axes.titlesize': 13,    'axes.titleweight':'bold',    'axes.spines.top':   False,    'axes.spines.right': False,})x = np.array([2018, 2019, 2020, 2021, 2022, 2023, 2024])y = np.array([120, 145, 138, 170, 195, 220, 248])fig, ax = plt.subplots(figsize=(9, 5))ax.plot(x, y, marker='o', color='#2563EB', linewidth=2.5, markersize=8)ax.fill_between(x, y, alpha=0.1, color='#2563EB')# Annotate a specific pointax.annotate('COVID dip', xy=(2020, 138), xytext=(2020.3, 125),           arrowprops=dict(arrowstyle='->', color='gray'),           color='gray', fontsize=10)ax.set_title('Annual Revenue Growth')ax.set_ylabel('Revenue ($M)')ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda v,_: f'${v:.0f}M'))plt.tight_layout()

Day 4
Block 7: Subplots & Multi-Panel Dashboards
Real reports require multiple panels. plt.subplots() creates a grid; GridSpec enables irregular layouts. This skill is essential for building EDA dashboards.
Example:
from matplotlib.gridspec import GridSpecdata = np.random.normal(0, 1, 1000)fig = plt.figure(figsize=(14, 8))gs  = GridSpec(2, 3, figure=fig, hspace=0.4, wspace=0.35)ax1 = fig.add_subplot(gs[0, :2])  # top-left, 2 cols wideax2 = fig.add_subplot(gs[0, 2])   # top-rightax3 = fig.add_subplot(gs[1, 0])   # bottom-leftax4 = fig.add_subplot(gs[1, 1])   # bottom-middleax5 = fig.add_subplot(gs[1, 2])   # bottom-rightax1.plot(data[:200], color='steelblue', alpha=0.7)ax1.set_title('Time Series (first 200 pts)')ax2.hist(data, bins=30, color='coral', edgecolor='white')ax2.set_title('Distribution')ax3.boxplot(data, vert=False, patch_artist=True)ax3.set_title('Boxplot')ax4.scatter(range(len(data[:100])), data[:100], s=10, alpha=0.5)ax4.set_title('Scatter')ax5.bar(['Min','Mean','Max'], [data.min(), data.mean(), data.max()],       color=['#EF4444','#3B82F6','#10B981'])ax5.set_title('Stats')fig.suptitle('EDA Dashboard', fontsize=16, fontweight='bold')plt.savefig('dashboard.png', dpi=150, bbox_inches='tight')

Block 8: Choosing the Right Chart
Visualization is communication. The wrong chart misleads even with correct data. This block covers the chart selection framework: distribution, comparison, relationship, composition, and change-over-time.
Purpose
Chart Types
Question Answered
Example
Distribution
Histogram, KDE, Violin, Boxplot
How are values spread?
np.random.normal() data
Comparison
Bar, Grouped Bar, Dot plot
How do categories differ?
Sales by department
Relationship
Scatter, Bubble, Heatmap
How do variables co-vary?
Age vs. salary
Composition
Pie, Stacked Bar, Treemap
What are the parts of a whole?
Market share
Trend over Time
Line, Area, Step
How does a value change?
Stock price series
⚠️ Common Pitfall
Never use pie charts for more than 5 categories or when values are similar in size β€” readers cannot accurately compare slice angles. Use a bar chart instead.

Day 5
Block 9: Interactive Plots with Plotly Express
For exploratory work and web dashboards, interactive charts allow zoom, hover, and filter. Plotly Express creates interactive charts with one-liners and works seamlessly in Jupyter.
Example:
import plotly.express as px# Interactive scatterdf = px.data.gapminder().query('year == 2007')fig = px.scatter(df, x='gdpPercap', y='lifeExp',                size='pop', color='continent',                hover_name='country', log_x=True,                title='GDP vs Life Expectancy (2007)')fig.show()# Animated time seriesfig2 = px.scatter(px.data.gapminder(),                 x='gdpPercap', y='lifeExp',                 animation_frame='year',                 animation_group='country',                 size='pop', color='continent',                 hover_name='country', log_x=True)fig2.show()# Interactive barfig3 = px.bar(df.nlargest(10,'pop'), x='country', y='pop',             color='continent', title='Top 10 Countries by Population')fig3.show()

Block 10: Week 4 Capstone: EDA Dashboard
Produce a 6-panel EDA dashboard on the tips dataset using both Matplotlib and Seaborn, with annotations, a consistent color palette, and saved to high-resolution PNG.
Example:
import seaborn as sns, matplotlib.pyplot as plt, numpy as npsns.set_theme(style='whitegrid', palette='muted')tips = sns.load_dataset('tips')fig, axes = plt.subplots(2, 3, figsize=(15, 9))fig.suptitle('Tips Dataset β€” Exploratory Analysis',             fontsize=16, fontweight='bold', y=0.98)# 1. Distribution of total billsns.histplot(tips['total_bill'], kde=True, ax=axes[0,0], color='steelblue')axes[0,0].set_title('Bill Distribution')# 2. Tip % by daytips['tip_pct'] = tips['tip'] / tips['total_bill'] * 100sns.boxplot(data=tips, x='day', y='tip_pct', ax=axes[0,1], palette='Set2')axes[0,1].set_title('Tip % by Day')# 3. Heatmap: corrcorr = tips[['total_bill','tip','size']].corr()sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm',           ax=axes[0,2], square=True)axes[0,2].set_title('Correlation')# 4. Scatter bill vs tipsns.scatterplot(data=tips, x='total_bill', y='tip', hue='sex',               style='smoker', ax=axes[1,0])axes[1,0].set_title('Bill vs Tip')# 5. Count by day & timect = tips.groupby(['day','time'])['tip'].count().reset_index()ct = ct.pivot(index='day', columns='time', values='tip')ct.plot(kind='bar', ax=axes[1,1], colormap='tab10')axes[1,1].set_title('Visits by Day & Time')# 6. Average tip by party sizetips.groupby('size')['tip'].mean().plot(ax=axes[1,2],    marker='o', color='coral', linewidth=2)axes[1,2].set_title('Avg Tip by Party Size')plt.tight_layout()plt.savefig('eda_dashboard.png', dpi=150, bbox_inches='tight')plt.show()



WEEK 5: FILE FORMATS & DATA PIPELINES
PART II β€” WORKING WITH DATA & WEB | CSV, JSON, XML, Excel, text parsing, regex, cleaning pipelines

Day 1: CSV & TSV Deep Dive
Beyond pd.read_csv: handle encoding issues, malformed rows, chunking large files, and custom delimiters using the csv module for more control.
Key Example:
import csv, pandas as pd# Read in chunks for large fileschunks = pd.read_csv('big.csv', chunksize=10_000)result = pd.concat([chunk[chunk['status']=='active'] for chunk in chunks])# Custom CSV with csv modulewith open('out.csv','w', newline='', encoding='utf-8') as f:    writer = csv.DictWriter(f, fieldnames=['id','name','score'])    writer.writeheader()    writer.writerows([{'id':1,'name':'Alice','score':95}])

Day 2: JSON β€” Nested & Semi-structured Data
JSON APIs often return nested structures. Use json module for raw parsing, Pandas json_normalize() to flatten nested objects, and jmespath for deep queries.
Key Example:
import json, pandas as pdfrom pandas import json_normalizedata = json.load(open('api.json'))# Flatten nested JSONdf = json_normalize(data['results'],    record_path=['orders'],    meta=['customer_id','customer_name'],    errors='ignore')print(df.head())

Day 3: XML & HTML Parsing
XML is still prevalent in legacy enterprise systems and government data. Use xml.etree.ElementTree for lightweight parsing, and lxml for XPath queries and large documents.
Key Example:
import xml.etree.ElementTree as ETtree = ET.parse('catalog.xml')root = tree.getroot()records = []for item in root.findall('book'):    records.append({'title': item.find('title').text,                    'price': float(item.find('price').text)})import pandas as pd; df = pd.DataFrame(records)

Day 4: Regular Expressions for Data Cleaning
Regex is indispensable for pattern matching in text data. The re module provides compile, search, match, findall, sub, and split. Always compile patterns you reuse.
Key Example:
import re# Extract phone numberspattern = re.compile(r'\b(\+?\d[\d\s\-\.]{8,14}\d)\b')phones = pattern.findall(raw_text)# Clean HTML tagsclean = re.sub(r'<[^>]+>', '', html_str)# Validate emailis_valid = bool(re.match(r'^[\w.+-]+@[\w-]+\.[\w.]+$', email))# Named groupsm = re.match(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})', '2025-06-01')print(m.group('year'), m.group('month'))

Day 5: Building a Cleaning Pipeline with Functions
A cleaning pipeline chains transformation steps, making the process reproducible and testable. Use a list of functions applied in sequence β€” the functional pipeline pattern.
Key Example:
def clean_pipeline(df, steps):    for step in steps:        df = step(df)    return dfdef drop_nulls(df): return df.dropna(subset=['id','date'])def fix_dtypes(df):    df['amount'] = pd.to_numeric(df['amount'], errors='coerce')    df['date']   = pd.to_datetime(df['date'], errors='coerce')    return dfdef remove_dupes(df): return df.drop_duplicates(subset=['id'])clean_df = clean_pipeline(raw_df, [drop_nulls, fix_dtypes, remove_dupes])

Week Quick Quiz
How do you read a 10 GB CSV file without running out of memory?
What does json_normalize() do for deeply nested JSON?
Write a regex that extracts all valid email addresses from a string.
Explain the difference between re.search() and re.match().
What is the functional pipeline pattern and why is it useful for data cleaning?

WEEK 6: HTTP, APIS & WEB SCRAPING
PART II β€” WORKING WITH DATA & WEB | requests, REST APIs, JSON handling, BeautifulSoup, ethics

Day 1: HTTP Basics & the requests Library
Every web API uses HTTP. Understand GET, POST, PUT, DELETE methods, status codes (200, 201, 400, 401, 403, 404, 500), headers, and query parameters. The requests library is the de-facto Python HTTP client.
Key Example:
import requests# GET request with query paramsresp = requests.get('https://api.github.com/search/repositories',    params={'q': 'python data science', 'sort': 'stars'},    headers={'Accept': 'application/vnd.github.v3+json'},    timeout=10)resp.raise_for_status()  # raises HTTPError on 4xx/5xxdata = resp.json()for repo in data['items'][:3]:    print(repo['full_name'], repo['stargazers_count'])

Day 2: REST API Authentication
Most production APIs require authentication. Common mechanisms: API keys in headers, Bearer tokens (OAuth2), and Basic Auth. Never hard-code secrets β€” use environment variables.
Key Example:
import os, requests# API key from environment variableAPI_KEY = os.environ['MY_API_KEY']headers = {'Authorization': f'Bearer {API_KEY}'}resp = requests.get('https://api.example.com/data', headers=headers)# Session β€” reuse connection and headerswith requests.Session() as session:    session.headers.update({'Authorization': f'Bearer {API_KEY}'})    for page in range(1, 5):        r = session.get('https://api.example.com/items', params={'page': page})        items = r.json()['items']

Day 3: Pagination & Rate Limiting
Real APIs paginate large datasets. Patterns: page/limit, cursor-based, and link headers. Always respect rate limits β€” add sleep intervals and handle 429 Too Many Requests errors.
Key Example:
import time, requestsdef fetch_all_pages(base_url, params, page_key='page', delay=0.5):    all_items, page = [], 1    while True:        resp = requests.get(base_url, params={**params, page_key: page})        data = resp.json()        items = data.get('results', [])        if not items: break        all_items.extend(items)        page += 1        time.sleep(delay)  # be polite    return all_items

Day 4: Web Scraping with BeautifulSoup
When no API is available, HTML scraping extracts data from web pages. BeautifulSoup parses HTML into a tree and provides find/find_all/select (CSS selector) methods. Always check robots.txt first.
Key Example:
import requestsfrom bs4 import BeautifulSoupurl  = 'https://books.toscrape.com'resp = requests.get(url, timeout=10)soup = BeautifulSoup(resp.text, 'lxml')# CSS selectorbooks = []for article in soup.select('article.product_pod'):    title = article.select_one('h3 a')['title']    price = float(article.select_one('.price_color').text[1:])    rating = article.select_one('p.star-rating')['class'][1]    books.append({'title':title,'price':price,'rating':rating})import pandas as pd; df = pd.DataFrame(books)

Day 5: Ethical Scraping & Caching
Web scraping has legal and ethical dimensions. Always: read robots.txt, add delays, identify your bot in User-Agent, cache responses to avoid re-fetching, and never overwhelm a server.
Key Example:
# Cache responses to diskimport requests_cacherequests_cache.install_cache('scrape_cache', expire_after=3600)# Now all requests are cached for 1 hourresp = requests.get('https://example.com')print(resp.from_cache)   # True on 2nd call
⚠️ Ethics & Legality
Check robots.txt, respect Crawl-delay directives, do not scrape personal data without permission, and review the site's Terms of Service. Some jurisdictions have explicit computer access laws.

Week Quick Quiz
What HTTP status codes indicate client errors vs server errors?
How do you pass query parameters with requests.get()?
Why should you never hard-code API keys in your source code?
What is the difference between BeautifulSoup's .find() and .select()?
List three ethical guidelines for web scraping.

WEEK 7: WEB FRAMEWORKS: FLASK & FASTAPI
PART II β€” WORKING WITH DATA & WEB | Routing, templates, REST APIs, Pydantic, async basics

Day 1: Flask: Routes, Methods & Templates
Flask is a micro web framework: minimal, flexible, and approachable. Define routes with @app.route(), render HTML templates with Jinja2, and handle GET/POST requests.
Key Example:
from flask import Flask, request, render_template, jsonifyapp = Flask(__name__)@app.route('/')def home(): return '<h1>Hello Flask!</h1>'@app.route('/greet/<name>')def greet(name): return f'Hello, {name}!'@app.route('/api/square', methods=['GET'])def square():    n = int(request.args.get('n', 0))    return jsonify({'input': n, 'result': n**2})if __name__ == '__main__': app.run(debug=True)

Day 2: Flask: Forms, Validation & Blueprints
Flask-WTF handles form validation. Blueprints split large applications into modules. This pattern is essential for any real Flask application beyond a toy example.
Key Example:
from flask import Blueprint# auth.pyauth_bp = Blueprint('auth', __name__, url_prefix='/auth')@auth_bp.route('/login', methods=['POST'])def login(): ...# app.pyfrom auth import auth_bpapp.register_blueprint(auth_bp)

Day 3: FastAPI: Typed APIs & Pydantic
FastAPI is modern, async-first, and auto-generates OpenAPI documentation. Pydantic models provide request/response validation with Python type hints β€” catching bugs at API boundary.
Key Example:
from fastapi import FastAPIfrom pydantic import BaseModel, Fieldapp = FastAPI()class Item(BaseModel):    name: str    price: float = Field(..., gt=0)    in_stock: bool = True@app.post('/items/', response_model=Item)async def create_item(item: Item):    return item@app.get('/items/{item_id}')async def get_item(item_id: int, q: str | None = None):    return {'id': item_id, 'q': q}# Run: uvicorn main:app --reload

Day 4: FastAPI: Dependencies, Auth & Background Tasks
FastAPI's dependency injection system (Depends) is a clean way to handle authentication, database sessions, and shared logic. Background tasks let you offload slow work after returning a response.
Key Example:
from fastapi import Depends, HTTPException, BackgroundTasksfrom fastapi.security import OAuth2PasswordBeareroauth2 = OAuth2PasswordBearer(tokenUrl='token')async def get_current_user(token: str = Depends(oauth2)):    if token != 'valid': raise HTTPException(401, 'Invalid token')    return {'user': 'alice'}@app.get('/protected')async def protected(user=Depends(get_current_user)):    return {'message': f'Hello {user["user"]}'  }def send_email(email: str): print(f'Sending to {email}')@app.post('/register')async def register(bg: BackgroundTasks, email: str):    bg.add_task(send_email, email)    return {'status': 'registered'}

Day 5: Deploying Flask & FastAPI Apps
Both frameworks can be deployed on cloud platforms. Containerization with Docker is the standard. Key concepts: WSGI vs ASGI, gunicorn/uvicorn workers, environment variables for config.
Key Example:
# Dockerfile for FastAPI# FROM python:3.11-slim# WORKDIR /app# COPY requirements.txt .# RUN pip install -r requirements.txt# COPY . .# CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"]# gunicorn for Flask production# gunicorn -w 4 -b 0.0.0.0:8000 app:app# Test your API with httpxfrom fastapi.testclient import TestClientclient = TestClient(app)resp = client.get('/items/1?q=test')assert resp.status_code == 200

Week Quick Quiz
What is the difference between Flask and FastAPI?
What does Pydantic's Field(gt=0) constraint do?
Explain FastAPI's dependency injection with an example.
What is the difference between WSGI and ASGI?
How do you auto-document a FastAPI app?

WEEK 8: MACHINE LEARNING WITH SCIKIT-LEARN
PART II β€” WORKING WITH DATA & WEB | ML workflow, preprocessing, classification, regression, pipelines, evaluation

Day 1: The ML Workflow
Machine learning follows a repeatable workflow: problem framing β†’ data collection β†’ EDA β†’ preprocessing β†’ model selection β†’ training β†’ evaluation β†’ iteration β†’ deployment. scikit-learn provides a consistent API across all steps.
Key Example:
from sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerimport pandas as pddf = pd.read_csv('dataset.csv')X = df.drop('target', axis=1)y = df['target']X_train, X_test, y_train, y_test = train_test_split(    X, y, test_size=0.2, random_state=42, stratify=y)scaler = StandardScaler()X_train_s = scaler.fit_transform(X_train)  # fit on train only!X_test_s  = scaler.transform(X_test)         # transform test
⚠️ Data Leakage
Always fit preprocessors (scalers, encoders) on training data ONLY. Applying fit_transform to the entire dataset leaks information about the test set into training β€” a fundamental error that inflates metrics.

Day 2: Classification: Logistic Regression & Decision Trees
Classification predicts a discrete label. Logistic Regression is interpretable and fast. Decision Trees are non-linear and intuitive. Both follow sklearn's fit/predict/score interface.
Key Example:
from sklearn.linear_model import LogisticRegressionfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.metrics import classification_report, confusion_matrix# Logistic Regressionlr = LogisticRegression(max_iter=1000, random_state=42)lr.fit(X_train_s, y_train)y_pred = lr.predict(X_test_s)print(classification_report(y_test, y_pred))# Decision Treedt = DecisionTreeClassifier(max_depth=5, random_state=42)dt.fit(X_train, y_train)print(dt.score(X_test, y_test))  # accuracy

Day 3: Regression & Evaluation Metrics
Regression predicts a continuous value. Key metrics: MAE, MSE, RMSE, RΒ². Ridge and Lasso add regularization to prevent overfitting.
Key Example:
from sklearn.linear_model import LinearRegression, Ridge, Lassofrom sklearn.metrics import mean_absolute_error, r2_score, mean_squared_errorimport numpy as nplr = LinearRegression()lr.fit(X_train, y_train)y_pred = lr.predict(X_test)mae  = mean_absolute_error(y_test, y_pred)rmse = np.sqrt(mean_squared_error(y_test, y_pred))r2   = r2_score(y_test, y_pred)print(f'MAE: {mae:.2f}  RMSE: {rmse:.2f}  RΒ²: {r2:.3f}')# Ridge (L2 regularization)ridge = Ridge(alpha=1.0).fit(X_train, y_train)print('Ridge RΒ²:', ridge.score(X_test, y_test))

Day 4: Pipelines & Cross-Validation
sklearn Pipelines chain preprocessing and modeling into a single object, preventing leakage and simplifying code. Cross-validation gives a robust estimate of model performance.
Key Example:
from sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScaler, OneHotEncoderfrom sklearn.compose import ColumnTransformerfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import cross_val_scorenum_pipe = Pipeline([('scaler', StandardScaler())])cat_pipe = Pipeline([('ohe', OneHotEncoder(handle_unknown='ignore'))])preprocessor = ColumnTransformer([    ('num', num_pipe, num_cols),    ('cat', cat_pipe, cat_cols)])full_pipe = Pipeline([    ('prep', preprocessor),    ('model', RandomForestClassifier(n_estimators=100, random_state=42))])scores = cross_val_score(full_pipe, X, y, cv=5, scoring='accuracy')print(f'CV Accuracy: {scores.mean():.3f} Β± {scores.std():.3f}')

Day 5: Hyperparameter Tuning & Feature Importance
Finding optimal model parameters is key to good performance. GridSearchCV and RandomizedSearchCV automate this. Feature importance reveals which variables drive predictions.
Key Example:
from sklearn.model_selection import RandomizedSearchCVfrom scipy.stats import randintparam_dist = {'model__n_estimators': randint(50, 300),              'model__max_depth': [None, 5, 10, 20]}search = RandomizedSearchCV(full_pipe, param_dist,    n_iter=20, cv=5, scoring='accuracy', n_jobs=-1, random_state=42)search.fit(X_train, y_train)print('Best params:', search.best_params_)print('Best CV score:', search.best_score_:.3f)best_model = search.best_estimator_# Feature importancesrf = best_model.named_steps['model']importances = pd.Series(rf.feature_importances_, index=feature_names)importances.nlargest(10).plot(kind='barh')

Week Quick Quiz
What is data leakage and how do sklearn Pipelines prevent it?
Explain the difference between precision and recall.
What is cross-validation and why is it more reliable than a single train/test split?
When would you choose Ridge over Lasso regularization?
What does a confusion matrix tell you that accuracy alone cannot?

WEEK 9: SCIENTIFIC PYTHON: SCIPY & SYMPY
PART III β€” SCIENCE, NLP & VISION | Optimization, integration, statistics, symbolic math, equation solving

Day 1: SciPy Overview & Optimization
SciPy extends NumPy with scientific algorithms organized in submodules: scipy.optimize (minimization, root-finding), scipy.integrate (numerical integration), scipy.stats (distributions, tests), scipy.interpolate, scipy.signal, scipy.linalg.
Key Example:
from scipy.optimize import minimize, brentqimport numpy as np# Minimize a functiondef rosenbrock(x): return (1-x[0])**2 + 100*(x[1]-x[0]**2)**2result = minimize(rosenbrock, x0=[0,0], method='BFGS')print(result.x)   # [1.0, 1.0]  β€” true minimum# Find root of f(x)=0f = lambda x: x**3 - 2*x - 5root = brentq(f, 2, 3)         # bracket [2,3]print(f'Root: {root:.6f}')     # 2.094551

Day 2: Numerical Integration & Interpolation
scipy.integrate provides quad() for definite integrals and solve_ivp() for ordinary differential equations. scipy.interpolate creates smooth curves through sparse data points.
Key Example:
from scipy.integrate import quad, solve_ivpfrom scipy.interpolate import interp1d, CubicSplineimport numpy as np# Definite integral of sin(x) from 0 to piresult, error = quad(np.sin, 0, np.pi)print(f'∫sin(x)dx = {result:.6f}')  # 2.0# ODE: dy/dt = -2y  (exponential decay)sol = solve_ivp(lambda t, y: -2*y, t_span=(0,5), y0=[1.0],               t_eval=np.linspace(0,5,100))# Cubic spline interpolationx_sparse = np.array([0, 1, 2, 3, 4])y_sparse = np.array([0, 1, 0.5, 2, 1.5])cs = CubicSpline(x_sparse, y_sparse)x_fine = np.linspace(0, 4, 100)y_fine = cs(x_fine)  # smooth curve

Day 3: SymPy: Symbolic Mathematics
SymPy performs exact symbolic algebra: differentiation, integration, simplification, solving equations, and matrix algebra β€” without floating-point rounding errors.
Key Example:
import sympy as spx, y, t = sp.symbols('x y t')# Symbolic differentiationf = sp.sin(x) * sp.exp(-x**2)df = sp.diff(f, x)print(sp.simplify(df))# Symbolic integrationintegral = sp.integrate(sp.exp(-x**2), (x, -sp.oo, sp.oo))print(integral)   # sqrt(pi)# Solve equationsolutions = sp.solve(x**3 - 6*x + 2, x)print(solutions)# System of equationseqs = [x + y - 5, 2*x - y - 4]print(sp.solve(eqs, [x, y]))  # {x:3, y:2}# Expand / factor / simplifyexpr = (x + 1)**6print(sp.expand(expr))

Day 4: Signal Processing with scipy.signal
scipy.signal provides filtering, Fourier analysis, and convolution. Essential for time-series preprocessing, audio processing, and sensor data analysis.
Key Example:
from scipy import signalimport numpy as npt = np.linspace(0, 1, 500)# Composite signal: 5 Hz + 50 Hz noisex = np.sin(2*np.pi*5*t) + 0.5*np.sin(2*np.pi*50*t)# Low-pass Butterworth filterb, a = signal.butter(4, Wn=0.1, btype='low')filtered = signal.filtfilt(b, a, x)# Power spectral density (Welch method)freqs, psd = signal.welch(x, fs=500)

Day 5: Combining NumPy, SciPy & SymPy: Physics Simulation
We build a pendulum simulation combining: SymPy (derive equations of motion), SciPy (solve ODE), NumPy (vectorize), Matplotlib (visualize).
Key Example:
# Pendulum: dΒ²ΞΈ/dtΒ² + (g/L)sin(ΞΈ) = 0from scipy.integrate import solve_ivpimport numpy as npg, L = 9.81, 1.0def pendulum(t, y): return [y[1], -(g/L)*np.sin(y[0])]# Initial: 45 degrees, at resttheta0 = np.pi/4sol = solve_ivp(pendulum, t_span=(0,20),               y0=[theta0, 0],               t_eval=np.linspace(0, 20, 2000),               method='RK45')import matplotlib.pyplot as pltplt.plot(sol.t, np.degrees(sol.y[0]))plt.xlabel('Time (s)'); plt.ylabel('Angle (Β°)')plt.title('Pendulum Motion'); plt.grid(True); plt.show()

Week Quick Quiz
What is the difference between scipy.optimize.minimize and brentq?
How does symbolic integration differ from numerical integration?
Solve the equation xΒ² - 5x + 6 = 0 using SymPy.
What does a Butterworth filter do?
What is the Runge-Kutta (RK45) method used for?

WEEK 10: PRACTICAL STATISTICS FOR DATA SCIENCE
PART III β€” SCIENCE, NLP & VISION | Descriptive stats, distributions, correlation, hypothesis testing

Day 1: Descriptive Statistics & Distributions
Descriptive statistics summarize data: measures of central tendency (mean, median, mode), dispersion (variance, std, IQR), and shape (skewness, kurtosis). Understanding distributions is foundational for choosing statistical tests.
Key Example:
import scipy.stats as stats, numpy as np, pandas as pddata = np.array([23,45,67,34,89,55,42,61,78,49])print(f'Mean:  {data.mean():.2f}')print(f'Median:{np.median(data):.2f}')print(f'Std:   {data.std():.2f}')print(f'IQR:   {stats.iqr(data):.2f}')print(f'Skew:  {stats.skew(data):.3f}')# Fit a distributionmu, std = stats.norm.fit(data)print(f'Best-fit Normal: ΞΌ={mu:.2f}  Οƒ={std:.2f}')

Day 2: Probability Distributions in scipy.stats
scipy.stats provides over 100 continuous and discrete distributions. Key ones: normal, t, chi2, f, binomial, poisson. Each has pdf/pmf, cdf, ppf (inverse CDF), and rvs (random samples).
Key Example:
from scipy import stats# Normal distributionnorm = stats.norm(loc=170, scale=10)  # height in cmprint(norm.pdf(175))    # probability density at 175print(norm.cdf(180))    # P(X ≀ 180)print(norm.ppf(0.95))   # 95th percentilesamples = norm.rvs(size=1000)# Binomial: P(X=k) for n=20 trials, p=0.3binom = stats.binom(n=20, p=0.3)print(binom.pmf(6))     # P(exactly 6 successes)print(binom.cdf(8))     # P(at most 8 successes)

Day 3: Hypothesis Testing: t-tests & ANOVA
Hypothesis testing formalizes the question 'is this difference real or due to chance?' A t-test compares two means. ANOVA compares multiple means. Always check assumptions before applying a test.
Key Example:
from scipy import stats# One-sample t-test: is mean != 100?data = [102, 98, 107, 95, 110, 103]t_stat, p_val = stats.ttest_1samp(data, popmean=100)print(f't={t_stat:.3f}  p={p_val:.4f}',      'β†’ reject H0' if p_val < 0.05 else 'β†’ fail to reject H0')# Two-sample t-testgroupA = [85, 90, 92, 88, 95]groupB = [78, 82, 80, 85, 88]t, p = stats.ttest_ind(groupA, groupB, equal_var=False)  # Welchprint(f'Two-sample: t={t:.3f}  p={p:.4f}')# One-way ANOVAf, p = stats.f_oneway(groupA, groupB, [70,75,72,68,74])print(f'ANOVA: F={f:.3f}  p={p:.4f}')

Day 4: Correlation, Chi-Square & Non-Parametric Tests
Pearson correlation measures linear association. Chi-square tests association between categorical variables. Non-parametric tests (Mann-Whitney U, Kruskal-Wallis) work when normality cannot be assumed.
Key Example:
from scipy import statsimport numpy as npx = np.array([1,2,3,4,5])y = np.array([1.1,1.9,3.2,3.8,5.1])# Pearson correlationr, p = stats.pearsonr(x, y)print(f'r={r:.3f}  p={p:.4f}')# Spearman (non-parametric)rho, p = stats.spearmanr(x, y)# Chi-square test of independencecontingency = [[25,15],[10,50]]chi2, p, dof, expected = stats.chi2_contingency(contingency)print(f'chi2={chi2:.2f}  p={p:.4f}  dof={dof}')# Mann-Whitney U (non-parametric t-test)u, p = stats.mannwhitneyu([1,2,3,4,5], [3,4,5,6,7])

Day 5: Confidence Intervals & Bootstrap
A confidence interval quantifies uncertainty in an estimate. The bootstrap method resamples the data to estimate any statistic's distribution without distributional assumptions.
Key Example:
import numpy as npfrom scipy import statsdata = np.array([52,58,63,49,71,55,60,48,67,54])# 95% CI for the mean (using t-distribution)n, se = len(data), stats.sem(data)ci = stats.t.interval(0.95, df=n-1, loc=data.mean(), scale=se)print(f'Mean: {data.mean():.1f}  95% CI: ({ci[0]:.1f}, {ci[1]:.1f})')# Bootstrap CI (no distributional assumption)rng = np.random.default_rng(42)boot_means = [rng.choice(data, len(data)).mean() for _ in range(10000)]boot_ci = np.percentile(boot_means, [2.5, 97.5])print(f'Bootstrap 95% CI: ({boot_ci[0]:.1f}, {boot_ci[1]:.1f})')

Week Quick Quiz
What is the null hypothesis in a t-test?
When should you use a non-parametric test instead of a t-test?
Explain the p-value in plain English.
What does a chi-square test measure?
What is the bootstrap method and why is it useful?

WEEK 11: NLP: NLTK, SPACY & TRANSFORMERS
PART III β€” SCIENCE, NLP & VISION | Tokenization, stemming, NER, POS tagging, text classification, Hugging Face

Day 1: Text Preprocessing with NLTK
Natural Language Processing converts raw text into structured features. NLTK provides tokenization, stop word removal, stemming, and lemmatization β€” the standard preprocessing pipeline.
Key Example:
import nltknltk.download('punkt'); nltk.download('stopwords')from nltk.tokenize import word_tokenize, sent_tokenizefrom nltk.corpus import stopwordsfrom nltk.stem import PorterStemmer, WordNetLemmatizertext = 'Natural language processing transforms raw text into insights.'# Tokenizewords = word_tokenize(text.lower())# Remove stop words and punctuationstops = set(stopwords.words('english'))clean = [w for w in words if w.isalnum() and w not in stops]# Stemming vs Lemmatizationps = PorterStemmer(); lm = WordNetLemmatizer()print([ps.stem(w) for w in clean])       # aggressive, lossyprint([lm.lemmatize(w) for w in clean])   # dictionary form

Day 2: spaCy: Industrial-Strength NLP
spaCy processes text at scale with pre-trained models. In one pass it provides: tokenization, POS tagging, dependency parsing, Named Entity Recognition (NER), and sentence segmentation.
Key Example:
import spacynlp = spacy.load('en_core_web_sm')doc = nlp('Apple Inc. is looking to buy a UK startup for $1 billion.')# Named Entity Recognitionfor ent in doc.ents:    print(f'{ent.text:<20} {ent.label_:<10} {spacy.explain(ent.label_)}')# POS taggingfor token in doc[:5]:    print(f'{token.text:<10} {token.pos_:<8} {token.dep_}')# Noun chunksfor chunk in doc.noun_chunks:    print(chunk.text, '->', chunk.root.head.text)# Similarity (requires larger model)# doc1.similarity(doc2)  β†’ float 0-1

Day 3: TF-IDF & Text Classification
TF-IDF converts text into numerical features by weighting words by their frequency in a document vs their rarity across documents. Combined with sklearn classifiers, it enables fast text classification.
Key Example:
from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.pipeline import Pipelinefrom sklearn.datasets import fetch_20newsgroupsfrom sklearn.metrics import classification_reportcategories = ['sci.space','rec.sport.baseball','talk.politics.guns']train = fetch_20newsgroups(subset='train', categories=categories)test  = fetch_20newsgroups(subset='test',  categories=categories)pipe = Pipeline([    ('tfidf', TfidfVectorizer(max_features=10000, ngram_range=(1,2))),    ('clf',   MultinomialNB())])pipe.fit(train.data, train.target)preds = pipe.predict(test.data)print(classification_report(test.target, preds, target_names=categories))

Day 4: Hugging Face Transformers
Transformer models (BERT, GPT, etc.) achieve state-of-the-art results on virtually all NLP tasks. Hugging Face's transformers library makes them accessible via a high-level pipeline() API.
Key Example:
from transformers import pipeline# Sentiment analysissentiment = pipeline('sentiment-analysis')results = sentiment(['I love this product!', 'Terrible experience.'])for r in results: print(r['label'], f"{r['score']:.3f}")# Named entity recognitionner = pipeline('ner', grouped_entities=True)print(ner('Elon Musk founded SpaceX in Hawthorne, California.'))# Text generationgen = pipeline('text-generation', model='gpt2')out = gen('Once upon a time in', max_new_tokens=50, num_return_sequences=1)print(out[0]['generated_text'])# Question answeringqa = pipeline('question-answering')print(qa(question='Who founded SpaceX?',         context='Elon Musk founded SpaceX in 2002.'))

Day 5: Sentence Embeddings & Semantic Search
Modern NLP uses dense vector embeddings to capture semantic meaning. Sentence Transformers encode sentences into fixed-size vectors; cosine similarity finds semantically related texts.
Key Example:
from sentence_transformers import SentenceTransformerfrom sklearn.metrics.pairwise import cosine_similarityimport numpy as npmodel = SentenceTransformer('all-MiniLM-L6-v2')sentences = [    'A dog is running in the park.',    'The canine is jogging outdoors.',    'Python is a programming language.',    'Snakes are reptiles.']embeddings = model.encode(sentences)   # shape (4, 384)sim_matrix = cosine_similarity(embeddings)print('Similar pairs:')for i in range(len(sentences)):    for j in range(i+1, len(sentences)):        if sim_matrix[i,j] > 0.5:            print(f'  ({i},{j}): {sim_matrix[i,j]:.3f}')

Week Quick Quiz
What is the difference between stemming and lemmatization?
What does TF-IDF stand for and how is it calculated?
Name three types of Named Entities that spaCy recognizes.
What is the attention mechanism in Transformer models?
How do sentence embeddings enable semantic search?

WEEK 12: COMPUTER VISION WITH OPENCV
PART III β€” SCIENCE, NLP & VISION | Image representation, transformations, filters, edge detection, contours

Day 1: Image Basics & OpenCV Fundamentals
OpenCV reads images as NumPy arrays. A color image is a (H, W, 3) array in BGR format (not RGB!). Grayscale is (H, W). Understanding this is foundational β€” every CV operation is a NumPy operation.
Key Example:
import cv2, numpy as npimg = cv2.imread('photo.jpg')          # (H, W, 3) BGRprint(img.shape, img.dtype)            # (480,640,3) uint8gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)  # grayscalergb  = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)   # fix for matplotlib# Crop (just a NumPy slice)roi = img[100:200, 150:300]# Resizeresized = cv2.resize(img, (320, 240))# RotateM = cv2.getRotationMatrix2D((img.shape[1]//2, img.shape[0]//2), 45, 1.0)rotated = cv2.warpAffine(img, M, (img.shape[1], img.shape[0]))import matplotlib.pyplot as pltplt.imshow(rgb); plt.axis('off'); plt.show()

Day 2: Image Filtering & Morphological Operations
Filtering (convolution) modifies each pixel based on its neighborhood. Blurring removes noise; sharpening enhances detail. Morphological operations (erosion, dilation) work on binary images.
Key Example:
# Blur (reduce noise)gaussian = cv2.GaussianBlur(img, (7,7), sigmaX=0)median   = cv2.medianBlur(img, 5)    # good for salt-pepper noise# Sharpenkernel = np.array([[0,-1,0],[-1,5,-1],[0,-1,0]])sharp  = cv2.filter2D(img, -1, kernel)# Morphologicalgray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)_, binary = cv2.threshold(gray, 127, 255, cv2.THRESH_BINARY)kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5,5))eroded  = cv2.erode(binary, kernel, iterations=1)dilated = cv2.dilate(binary, kernel, iterations=1)opened  = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel)

Day 3: Edge Detection & Contours
Edge detection identifies boundaries in images. Canny is the industry standard. Contours are continuous curves of same intensity β€” used for shape analysis, object detection, and measurement.
Key Example:
# Canny edge detectiongray  = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)blurred = cv2.GaussianBlur(gray, (5,5), 0)edges = cv2.Canny(blurred, threshold1=50, threshold2=150)# Find contourscontours, _ = cv2.findContours(edges, cv2.RETR_EXTERNAL,                               cv2.CHAIN_APPROX_SIMPLE)print(f'Found {len(contours)} contours')# Draw contoursoutput = img.copy()cv2.drawContours(output, contours, -1, (0,255,0), 2)# Bounding box of largest contourlargest = max(contours, key=cv2.contourArea)x, y, w, h = cv2.boundingRect(largest)cv2.rectangle(output, (x,y), (x+w,y+h), (255,0,0), 2)area = cv2.contourArea(largest)

Day 4: Feature Detection & Matching
Feature detectors (ORB, SIFT) find distinctive keypoints that remain stable under rotation and scale changes. Feature matching enables image stitching, object recognition, and 3D reconstruction.
Key Example:
# ORB feature detection (free to use)orb = cv2.ORB_create(nfeatures=500)kp1, des1 = orb.detectAndCompute(gray1, None)kp2, des2 = orb.detectAndCompute(gray2, None)img_kp = cv2.drawKeypoints(gray1, kp1, None,    flags=cv2.DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS)# Brute-force matchingbf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)matches = bf.match(des1, des2)matches = sorted(matches, key=lambda x: x.distance)good_matches = matches[:50]result = cv2.drawMatches(img1, kp1, img2, kp2, good_matches, None,    flags=cv2.DrawMatchesFlags_NOT_DRAW_SINGLE_POINTS)

Day 5: Face Detection & Video Processing
Haar cascades and DNN-based detectors detect faces and objects in real-time. Video processing reads frames in a loop β€” the same image operations apply frame-by-frame.
Key Example:
# Face detection with Haar cascadeface_cascade = cv2.CascadeClassifier(    cv2.data.haarcascades + 'haarcascade_frontalface_default.xml')gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)faces = face_cascade.detectMultiScale(gray, 1.3, 5)for (x,y,w,h) in faces:    cv2.rectangle(img, (x,y), (x+w,y+h), (255,0,0), 2)# Video capturecap = cv2.VideoCapture(0)   # webcamwhile cap.isOpened():    ret, frame = cap.read()    if not ret: break    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)    edges = cv2.Canny(gray, 50, 150)    cv2.imshow('Edges', edges)    if cv2.waitKey(1) & 0xFF == ord('q'): breakcap.release(); cv2.destroyAllWindows()

Week Quick Quiz
Why does OpenCV use BGR instead of RGB?
What is the difference between cv2.THRESH_BINARY and cv2.THRESH_OTSU?
Explain what Canny edge detection does in three steps.
What are contours and how are they used in object detection?
What is the difference between erosion and dilation in morphological operations?

WEEK 13: DATABASES & DATA ENGINEERING
PART IV β€” ENGINEERING & AUTOMATION | SQLite, SQLAlchemy ORM, ETL pipelines, Dask for large data

Day 1: SQLite with sqlite3 & SQL Fundamentals
SQLite is an embedded database β€” no server required. sqlite3 is in the standard library. Master SQL fundamentals (SELECT, JOIN, GROUP BY, subqueries) as they transfer to PostgreSQL, MySQL, etc.
Key Example:
import sqlite3, pandas as pdconn = sqlite3.connect('school.db')   # creates filecursor = conn.cursor()cursor.execute('''CREATE TABLE IF NOT EXISTS students    (id INTEGER PRIMARY KEY, name TEXT, grade REAL)''')cursor.executemany('INSERT INTO students VALUES (?,?,?)',    [(1,'Alice',3.9),(2,'Bob',3.5),(3,'Carol',3.7)])conn.commit()# Query β†’ Pandasdf = pd.read_sql('SELECT * FROM students WHERE grade > 3.6', conn)print(df)# Update / deletecursor.execute('UPDATE students SET grade=4.0 WHERE name=?',('Alice',))conn.commit(); conn.close()

Day 2: SQLAlchemy ORM
SQLAlchemy's ORM maps Python classes to database tables. Write Python instead of SQL strings β€” your code is database-agnostic, safer (no SQL injection), and more maintainable.
Key Example:
from sqlalchemy import create_engine, Column, Integer, String, Floatfrom sqlalchemy.orm import declarative_base, SessionBase = declarative_base()class Product(Base):    __tablename__ = 'products'    id    = Column(Integer, primary_key=True)    name  = Column(String(100))    price = Column(Float)engine = create_engine('sqlite:///shop.db')Base.metadata.create_all(engine)with Session(engine) as session:    p = Product(name='Widget', price=9.99)    session.add(p)    session.commit()    products = session.query(Product).filter(Product.price < 50).all()    for p in products: print(p.name, p.price)

Day 3: Building ETL Pipelines
ETL (Extract, Transform, Load) is the backbone of data engineering. Extract from source(s), transform (clean, join, aggregate), load to destination (DB, data warehouse, file). We build a modular ETL with logging.
Key Example:
import logging, pandas as pd, sqlite3logging.basicConfig(level=logging.INFO,    format='%(asctime)s %(levelname)s %(message)s')def extract(filepath):    logging.info(f'Extracting {filepath}')    return pd.read_csv(filepath)def transform(df):    logging.info('Transforming...')    df.dropna(subset=['id','amount'], inplace=True)    df['amount'] = pd.to_numeric(df['amount'], errors='coerce')    df['month']  = pd.to_datetime(df['date']).dt.to_period('M')    return df.groupby(['month','product'])['amount'].sum().reset_index()def load(df, db_path, table):    conn = sqlite3.connect(db_path)    df.to_sql(table, conn, if_exists='replace', index=False)    conn.close()    logging.info(f'Loaded {len(df)} rows β†’ {table}')raw = extract('sales.csv')clean = transform(raw)load(clean, 'warehouse.db', 'monthly_sales')

Day 4: Dask: Parallel Computing for Large Data
Dask scales Pandas and NumPy to datasets larger than memory. It uses lazy evaluation β€” builds a task graph and executes only when .compute() is called. Works seamlessly on a laptop or a cluster.
Key Example:
import dask.dataframe as dd# Read a massive CSV (lazy β€” no data loaded yet)ddf = dd.read_csv('huge_dataset_*.csv')   # glob patternprint(ddf.dtypes)   # fast β€” reads only metadata# Operations (lazy)result = (ddf    .query('amount > 0')    .groupby('category')['amount']    .sum()    .nlargest(10))# Execute (data flows now)answer = result.compute()print(answer)# Dask arrays (like NumPy but chunked)import dask.array as dax = da.random.random((100_000, 100_000), chunks=(1000, 1000))print(x.mean().compute())   # works even if array doesn't fit in RAM

Day 5: Database Performance & Indexing
Slow queries are one of the most common production issues. Indexes dramatically speed up SELECT with WHERE. Explain plans reveal how the database executes queries.
Key Example:
import sqlite3conn = sqlite3.connect('analytics.db')cursor = conn.cursor()# Create index on frequently queried columncursor.execute('CREATE INDEX IF NOT EXISTS idx_date ON orders(order_date)')cursor.execute('CREATE INDEX IF NOT EXISTS idx_cust ON orders(customer_id)')# EXPLAIN QUERY PLANcursor.execute('EXPLAIN QUERY PLAN SELECT * FROM orders WHERE customer_id=?', (42,))for row in cursor.fetchall(): print(row)# Bulk insert with executemany (much faster than loop)data = [(i, f'Item {i}', i*1.5) for i in range(100000)]cursor.executemany('INSERT INTO products VALUES (?,?,?)', data)conn.commit()# Use parameterized queries β€” NEVER string concatenation (SQL injection!)# CORRECT: cursor.execute('SELECT * FROM users WHERE id=?', (user_id,))# WRONG:   cursor.execute(f'SELECT * FROM users WHERE id={user_id}')

Week Quick Quiz
What is the difference between SQL's LEFT JOIN and INNER JOIN?
Why use SQLAlchemy ORM instead of raw SQL strings?
What does ETL stand for and what does each step involve?
Why does Dask use lazy evaluation?
What is an index and when does it improve query performance?

WEEK 14: AUTOMATION, SCRIPTING & TESTING
PART IV β€” ENGINEERING & AUTOMATION | os/pathlib, subprocess, selenium, logging, pytest

Day 1: File System Automation with os & pathlib
Automating file and directory operations β€” copying, moving, renaming, searching β€” saves hours of manual work. pathlib is the modern, object-oriented way; os.walk() handles recursive traversal.
Key Example:
from pathlib import Pathimport shutil# Iterate all Python files recursivelyfor f in Path('.').rglob('*.py'):    print(f.stem, f.suffix, f.stat().st_size)# Create directory treePath('output/reports/2025').mkdir(parents=True, exist_ok=True)# Copy / move / renameshutil.copy('source.txt', 'backup/source.txt')shutil.move('old_name.csv', 'new_name.csv')Path('tmp.log').rename('archive/tmp.log')# Pattern matchingcsvs = list(Path('data').glob('sales_202?.csv'))# DeletePath('tmp.txt').unlink(missing_ok=True)shutil.rmtree('tmp_dir', ignore_errors=True)

Day 2: subprocess & System Commands
subprocess runs shell commands from Python β€” compiling code, calling CLI tools, running scripts. The subprocess.run() function is the modern interface; always use check=True to catch errors.
Key Example:
import subprocess# Run command, capture outputresult = subprocess.run(['python', '--version'],    capture_output=True, text=True, check=True)print(result.stdout.strip())# Shell pipelineresult = subprocess.run('find . -name *.py | wc -l',    shell=True, capture_output=True, text=True)# Compile a C filesubprocess.run(['gcc', '-O2', 'program.c', '-o', 'program'], check=True)# Python subprocess with timeouttry:    subprocess.run(['long_running_script.py'], timeout=30, check=True)except subprocess.TimeoutExpired:    print('Script timed out!')

Day 3: Browser Automation with Selenium
Selenium controls a real web browser programmatically β€” filling forms, clicking buttons, scraping JavaScript-rendered pages. Use it for testing UIs and automating browser-based workflows.
Key Example:
from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECoptions = webdriver.ChromeOptions()options.add_argument('--headless')driver = webdriver.Chrome(options=options)try:    driver.get('https://example.com')    # Wait for element to appear    el = WebDriverWait(driver, 10).until(        EC.presence_of_element_located((By.ID, 'main')))    print(el.text[:100])    # Fill a form    driver.find_element(By.NAME, 'q').send_keys('Python')    driver.find_element(By.CSS_SELECTOR, 'button[type=submit]').click()finally:    driver.quit()

Day 4: Logging & Structured Observability
Print statements do not scale. Python's logging module provides severity levels, formatters, handlers (file, stream, rotating), and contextual information. Good logging is the difference between debugging in 5 minutes vs 5 hours.
Key Example:
import loggingfrom logging.handlers import RotatingFileHandler# Configure root loggerlogger = logging.getLogger('myapp')logger.setLevel(logging.DEBUG)# Console handlerch = logging.StreamHandler()ch.setLevel(logging.INFO)# Rotating file (5 MB, keep 3 backups)fh = RotatingFileHandler('app.log', maxBytes=5*1024*1024, backupCount=3)fh.setLevel(logging.DEBUG)fmt = logging.Formatter('%(asctime)s %(name)s %(levelname)s %(message)s')ch.setFormatter(fmt); fh.setFormatter(fmt)logger.addHandler(ch); logger.addHandler(fh)logger.debug('Detailed debug info')logger.info('Pipeline started')logger.warning('Missing value in row 42')logger.error('Database connection failed')

Day 5: Testing with pytest
pytest is the standard Python testing framework. Tests are plain functions starting with test_. Fixtures provide reusable test dependencies. Parametrize runs one test with multiple inputs.
Key Example:
# test_math.pyimport pytestfrom myapp.math_utils import add, divide# Basic testdef test_add():    assert add(2, 3) == 5    assert add(-1, 1) == 0# Test exceptiondef test_divide_by_zero():    with pytest.raises(ZeroDivisionError):        divide(10, 0)# Parametrize β€” run same test with different inputs@pytest.mark.parametrize('a,b,expected', [    (2, 3, 5), (0, 0, 0), (-1, 1, 0), (100, -50, 50)])def test_add_parametrized(a, b, expected):    assert add(a, b) == expected# Fixture β€” shared setup@pytest.fixturedef sample_df():    import pandas as pd    return pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})def test_dataframe_shape(sample_df):    assert sample_df.shape == (3, 2)# Run: pytest test_math.py -v --tb=short

Week Quick Quiz
What is the difference between Path.glob() and Path.rglob()?
Why is shell=True in subprocess.run() potentially dangerous?
What is the difference between logging.warning() and logging.error()?
What is a pytest fixture?
How does @pytest.mark.parametrize reduce code duplication in tests?

WEEK 15: CAPSTONE PROJECT: DESIGN & IMPLEMENTATION
PART V β€” CAPSTONE | Project planning, data acquisition, preprocessing, modeling, integration

Overview
Weeks 15–16 are project-only. You will design, build, and present a complete Python data project that demonstrates mastery of at least 4 libraries from the course. This week covers planning through first working version.

Capstone Project Options
Track
Focus
Core Libraries
Deliverable
Track A
Data Analysis
Pandas + Seaborn + Statistics
Interactive EDA report with hypothesis tests
Track B
ML Application
sklearn + FastAPI + Pandas
Train a model, expose it via REST API
Track C
NLP Pipeline
spaCy + Transformers + Flask
Named entity extractor web app
Track D
Computer Vision
OpenCV + sklearn + Flask
Image classifier web application
Track E
Data Engineering
Dask + SQLAlchemy + Airflow
Automated ETL pipeline with scheduling

Day-by-Day Schedule
Day
Task
Day 1: Problem Definition & Data Sourcing
Define problem clearly (one sentence)
Day 2: EDA & Data Cleaning
Source dataset (API, Kaggle, scraping)
Day 3: Core Feature Engineering
Clean and validate all features
Day 4: Model/Analysis Building & Iteration
Train/evaluate with cross-validation
Day 5: API or Interface Integration
Wire up Flask/FastAPI front-end or report

🎯 Day 1 Deliverable
A project brief (1 page): problem statement, dataset source, libraries to use, success metrics, and a rough timeline. Submit for instructor feedback before proceeding.

Project Brief Template
## Project Brief Template**Project Title:** [Your Title]**Track:** [A/B/C/D/E]**Problem Statement (one sentence):**We want to [action] on [data] in order to [benefit].**Dataset:**  - Source: [URL / API / uploaded file]  - Size: ~[N] rows Γ— [M] columns  - Time period: [if applicable]**Core Libraries:** [list at least 4]**Success Metrics:**  - Primary: [e.g. F1 β‰₯ 0.85 on test set]  - Secondary: [e.g. API responds in < 200ms]**Risk / Uncertainties:**  - [e.g. dataset may have significant missing values]

WEEK 16: CAPSTONE PROJECT: POLISH, PACKAGE & PRESENT
PART V β€” CAPSTONE | Documentation, packaging, README, demo, retrospective

Week 16 Goals
Polish code: refactor, add docstrings, fix edge cases
Write comprehensive README.md
Package with pyproject.toml or setup.py
Write at least 10 meaningful pytest tests
Deploy or demo the project
Retrospective: what would you do differently?

Professional README Template
# Project NameOne-paragraph description of what this project does and why it matters.## Features- Bullet of key feature 1- Bullet of key feature 2## Quick Start```bashpip install -r requirements.txtpython main.py --input data/sample.csv```## Project Structure```project/β”œβ”€β”€ data/         # raw and processed dataβ”œβ”€β”€ notebooks/    # exploratory notebooksβ”œβ”€β”€ src/          # production source codeβ”œβ”€β”€ tests/        # pytest test suiteβ”œβ”€β”€ README.md└── requirements.txt```## Results| Metric | Value ||--------|-------|| Accuracy | 0.923 || F1 Score | 0.908 |## LicenseMIT

Presentation Structure (10 minutes)
Slide
Section
Content
Time
1
Problem & Motivation
Why does this matter?
2 min
2
Data & Approach
What data, what method?
2 min
3
Live Demo
Show it working!
3 min
4
Results & Learnings
What did you achieve?
2 min
5
Q&A
Peer and instructor questions
1 min

PRICING & DELIVERY STRUCTURE
Online | Sandboxed Environment | Cohort-Based

Market Positioning
Pricing is benchmarked against comparable online Python/data science bootcamps and courses (DataCamp, Coursera, Udacity, Springboard, General Assembly, TripleTen, and instructor-led corporate training). This is a structured, cohort-based program with live instruction, sandboxed environment, and project mentorship β€” commanding a premium over self-paced video courses.

Delivery Tiers

Format
Recorded videos + sandbox exercises
Live online classes, 2Γ—30min/day, Mon–Fri
Private cohort, custom schedule
Cohort Size
Unlimited (async)
12–20 learners
4–15 (min 4 guaranteed)
Instructor Access
Forum only
Live Q&A + Slack channel
Dedicated instructor + Slack
Mentoring
None
Weekly 1:1 office hours (30 min/wk)
Unlimited 1:1 sessions
Sandbox Env.
Included (JupyterLab)
Included + auto-grading
Included + custom datasets
Certificate
Completion only
Completion + graded
Completion + custom branding
Duration
16 weeks (self-paced 6 months)
16 weeks fixed schedule
Flexible (compressed available)
Price (USD)
$497 / $399 early bird
$1,997 / $1,597 early bird
From $4,500 per learner (min 4)
Payment
One-time or $49/moΓ—12
Full or 3Γ—$699
Invoice net 30, volume discounts

⭐ Recommended
The Cohort Live tier ($1,997) is the recommended offering. It provides structure, peer accountability, live instruction, and mentorship β€” consistently producing the highest completion rates and learner satisfaction.

Volume & Early-Bird Discounts
Discount Type
Amount
Early-bird (>30 days before start)
20% off list price
Student / recent grad (<2 years)
25% off (with proof)
Returning learner (completed another course)
15% off
Referral discount (per referred enrolment)
$100 credit
Non-profit / NGO
40% off (approved orgs)

Corporate Pricing
Team Size
Price/Learner
Includes
Duration
4–6 learners
$4,500 / learner
Dedicated session
16 weeks standard
7–14 learners
$3,800 / learner
Dedicated session + custom projects
16 or 8-week compressed
15–24 learners
$3,200 / learner
Full program + LMS integration
Flexible scheduling
25+ learners
Custom quote
Enterprise SLA + custom content
On-demand or cohort

Add-On Services
Service
Price
1:1 Coaching Session (60 min)
$120 per session
Resume & Portfolio Review
$149 one-time
Code Review Sprint (3 files)
$199 per sprint
Custom Module Development
From $2,500 per module
Post-course Support (3 months)
$297 one-time
Recorded session archive (lifetime)
$97 (Cohort Live learners)

Cohort Schedule & Batch Size Rationale
The recommended cohort size of 12–20 learners is based on educational research and practitioner experience:
Below 12: insufficient peer diversity for project feedback; cohort cancellation risk
12–20: optimal for live Q&A (everyone can speak), study groups, pair programming
Above 20: instructor cannot give individual attention during exercises; breakout rooms required
Cohorts run on a 6-week intake cycle: Jan, Mar, May, Jul, Sep, Nov. This gives 6 starts per year and enough lead time for marketing and enrolment.

Technology Platform
Component
Platform
Video Conferencing
Sandboxed IDE
Zoom / Google Meet with breakout rooms
JupyterHub (cloud-hosted) or Google Colab Pro
Component
Platform
Video Conferencing
Zoom / Google Meet (with breakout rooms for pair exercises)
Sandboxed IDE
JupyterHub (cloud-hosted) β€” all libraries pre-installed, persistent storage
Course LMS
Teachable / Thinkific / Canvas for recordings, quizzes, assignments
Messaging / Community
Dedicated Slack workspace (channels per week, alumni channel)
Code Repository
GitHub Classroom β€” assignments auto-checked via GitHub Actions
Progress Tracking
Instructor dashboard: exercise completion, attendance, quiz scores
Auto-grading
nbgrader or custom pytest-based graders inside JupyterHub
Live Coding
VS Code Live Share for instructor screen-share and pair programming


Appendix: Quick Reference & Cheat Sheets


Python Built-ins Quick Reference
Function
Description
Function
Description
len(x)
Length of sequence/dict
range(n)
Generate integers 0..n-1
type(x)
Type of object
isinstance(x, T)
Check if x is type T
print(*args)
Print to stdout
input(prompt)
Read from stdin
int/float/str/bool(x)
Type conversion
list/tuple/set/dict(x)
Container conversion
sorted(x, key=f)
Return sorted list
reversed(x)
Return reversed iterator
enumerate(x)
Index + value pairs
zip(a, b)
Pair up iterables
map(f, x)
Apply f to each element
filter(f, x)
Keep elements where f is True
any(x)
True if any element is True
all(x)
True if all elements are True

NumPy Cheat Sheet
import numpy as npnp.array([1,2,3])         # create from listnp.zeros((m,n))           # mΓ—n zerosnp.arange(start,stop,step)# range as arraynp.linspace(a,b,n)        # n evenly spacedarr.reshape(m,n)          # reshape (total must match)arr.T                     # transposearr[arr > 5]              # boolean indexnp.concatenate([a,b], axis=0) # stack verticallynp.vstack([a,b])          # vertical stacknp.hstack([a,b])          # horizontal stackarr.mean()/std()/sum()    # statistics (add axis=0/1)np.linalg.inv(A)          # matrix inverseA @ B                     # matrix multiplynp.linalg.solve(A,b)      # solve Ax=b

Pandas Cheat Sheet
import pandas as pdpd.read_csv('f.csv', parse_dates=['d'], index_col='id')df.head(n) / .tail(n) / .info() / .describe()df['col'] / df[['c1','c2']]       # column selectiondf.loc[rows, cols]                # label-baseddf.iloc[rows, cols]               # integer-baseddf[df['x'] > 0]                   # boolean filterdf.query('x > 0 and y < 10')      # SQL-like filterdf.dropna() / df.fillna(v)         # handle nullsdf.astype({'col': float})          # change dtypedf.groupby('g')['v'].agg([...])    # split-apply-combinepd.merge(df1, df2, on='key', how='left') # joinpd.concat([df1, df2], ignore_index=True) # stackdf.pivot_table(values,index,columns,aggfunc)df.to_csv('out.csv', index=False)

sklearn Pipeline Template
from sklearn.pipeline import Pipelinefrom sklearn.compose import ColumnTransformerfrom sklearn.preprocessing import StandardScaler, OneHotEncoderfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import cross_val_score, train_test_splitnum_features = ['age', 'salary']cat_features = ['department', 'level']preprocessor = ColumnTransformer([    ('num', StandardScaler(), num_features),    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_features)])model = Pipeline([    ('prep', preprocessor),    ('clf',  RandomForestClassifier(n_estimators=200, random_state=42))])X_train, X_test, y_train, y_test = train_test_split(    X, y, test_size=0.2, stratify=y, random_state=42)cv_scores = cross_val_score(model, X_train, y_train, cv=5)print(f'CV: {cv_scores.mean():.3f} Β± {cv_scores.std():.3f}')model.fit(X_train, y_train)print('Test score:', model.score(X_test, y_test))

Git Workflow Cheat Sheet
git init                          # initialise repogit clone <url>                   # clone remotegit add .                         # stage all changesgit commit -m "message"          # commitgit push origin main              # push to remotegit pull                          # fetch + mergegit checkout -b feature/name      # create branchgit merge feature/name            # merge branchgit log --oneline --graph         # pretty loggit stash / git stash pop         # save/restore uncommitted.gitignore: __pycache__/, *.pyc, .env, data/, models/

Course Completion Checklist
Before receiving your certificate, verify you have:
Completed all 160 block exercises (auto-graded)
Submitted all 16 weekly homework assignments
Passed at least 12 of 16 weekly quizzes (score β‰₯ 70%)
Completed and submitted the Capstone project
Delivered the 10-minute Capstone presentation (live or recorded)
Peer-reviewed at least 2 other learner projects


Python Mastery β€” End of Course Book