cook county housing

Project for UC Berkeley's DATA 100 class (Fall 2024).

Timeline: 1 week

Tools: Python, Pandas, scikit-learn

Skills: exploratory data analysis, feature engineering, linear regression modeling

the context

In 2017, a lawsuit accused Illinois' Cook County Assessor's Office of systematically overvaluing inexpensive homes and undervaluing expensive ones — shifting the tax burden onto working-class, disproportionaly non-white, homeowners.

Using Cook County's housing dataset of over 500,000 records, this project tests whether a "neutral" statistical approach (no race, no demographics, just home features) can escape that bias, or if bias is embedded in the historical data itself.

*Click here to skip to my findings

exploring & cleaning

Since raw sale prices were heavily skewed, I log-transformed the sale price to get a picture of the distribution.

Breaking pricing out by neighborhood code showed what averages hid: similar median prices masked dozens of overlapping housing markets, each with its own price ceiling, density, and volume of sales. Models treating the whole county as one market would skew towards high-volume neighborhoods.

log sale price

log-transforming showed a distribution close to normal, but encodes assumptions about the "typical".

price by bedroom count

more bedrooms generally reflects a higher price, but the large spread within each group hints that bedroom count alone is a weak predictor.

price & sale volume by neigbhorhood

median prices look similar across neighborhoods (top), but sale volume (bottom) varies by nearly 4x; models trained on this data learn the most about neighborhoods that sell the most.

log sale price

log-transforming showed a distribution close to normal, but encodes assumptions about the "typical".

price by bedroom count

more bedrooms generally reflects a higher price, but the large spread within each group hints that bedroom count alone is a weak predictor.

price & sale volume by neigbhorhood

median prices look similar across neighborhoods (top), but sale volume (bottom) varies by nearly 4x; models trained on this data learn the most about neighborhoods that sell the most.

modeling & validating

Using 15+ engineered features — bedrooms, building square footage, neighborhood, repair condition, to name a few — I trained a linear regression model in scikit-learn to predict log sale price, then evaluated it through cross validation and an unseen test set.

01

cleaned and engineered features from 62 raw columns, handling missing values, encoding categorical variables, and constructing derived features like price-per-square-foot.

03

fit a linear regression model and iterated on feature selection to minimize root mean squared error (RMSE).

02

split the data into training, validation, and held-out test sets to evaluate generalization rather than memorization.

04

stress-tested the model by splitting predictions by price tier not just to check for overall accuracy, but whether errors were evenly distributed — or systematically skewed.

01

cleaned and engineered features from 62 raw columns, handling missing values, encoding categorical variables, and constructing derived features like price-per-square-foot.

02

split the data into training, validation, and held-out test sets to evaluate generalization rather than memorization.

03

fit a linear regression model and iterated on feature selection to minimize root mean squared error (RMSE).

04

stress-tested the model by splitting predictions by price tier not just to check for overall accuracy, but whether errors were evenly distributed — or systematically skewed.

the finding

Bias wasn't injected, it was already there. The model didn't fix the bias, it simply reproduced it.

After fitting the model, I split predictions into above- and below-median prices and measured error separately for each group. A clear pattern showed: higher accuracy for expensive homes, higher prediction error for cheaper homes (more likely to overestimate), and higher frequency of overvaluing cheap homes than expensive ones — all without a single demographic feature in the model.

56.3%

of below-median homes had their value overestimated by the model.

30.5%

of above-median homes had their value overestimated by the model — roughly half as often.

error by price tier

left: prediction error (RMSE) is highest for the cheapest homes and drops steadily as price rises; right: the share of overestimated homes follows the same slope.

takeaways

A model is only as fair as the history it's trained on; statistical neutrality is not the same as fairness.

Models with no demographic inputs can still encode demographic bias — historical sale patterns embedded bias into the data, not the model. Objective systems like the CCAO's may let historical patterns of underinvestment and segregation get re-encoded as "data", which kept the rebuilt model in that space, even when rebuilt with care.

Responsible, ethical data work means asking not just "how accurate is this model", but also "whose errors are these, and who carries these costs?"

say hi! 打招呼!