cook county housing
Project for UC Berkeley's DATA 100 class (Fall 2024).
Timeline: 1 week
Tools: Python, Pandas, scikit-learn
Skills: exploratory data analysis, feature engineering, linear regression modeling
the context
In 2017, a lawsuit accused Illinois' Cook County Assessor's Office of systematically overvaluing inexpensive homes and undervaluing expensive ones โ shifting the tax burden onto working-class, disproportionaly non-white, homeowners.
Using Cook County's housing dataset of over 500,000 records, this project tests whether a "neutral" statistical approach (no race, no demographics, just home features) can escape that bias, or if bias is embedded in the historical data itself.
*Click here to skip to my findings
exploring & cleaning
Since raw sale prices were heavily skewed, I log-transformed the sale price to get a picture of the distribution.
Breaking pricing out by neighborhood code showed what averages hid: similar median prices masked dozens of overlapping housing markets, each with its own price ceiling, density, and volume of sales. Models treating the whole county as one market would skew towards high-volume neighborhoods.
modeling & validating
Using 15+ engineered features โ bedrooms, building square footage, neighborhood, repair condition, to name a few โ I trained a linear regression model in scikit-learn to predict log sale price, then evaluated it through cross validation and an unseen test set.
the finding
Bias wasn't injected, it was already there. The model didn't fix the bias, it simply reproduced it.
After fitting the model, I split predictions into above- and below-median prices and measured error separately for each group. A clear pattern showed: higher accuracy for expensive homes, higher prediction error for cheaper homes (more likely to overestimate), and higher frequency of overvaluing cheap homes than expensive ones โ all without a single demographic feature in the model.
56.3%
of below-median homes had their value overestimated by the model.
30.5%
of above-median homes had their value overestimated by the model โ roughly half as often.

error by price tier
left: prediction error (RMSE) is highest for the cheapest homes and drops steadily as price rises; right: the share of overestimated homes follows the same slope.
takeaways
A model is only as fair as the history it's trained on; statistical neutrality is not the same as fairness.
Models with no demographic inputs can still encode demographic bias โ historical sale patterns embedded bias into the data, not the model. Objective systems like the CCAO's may let historical patterns of underinvestment and segregation get re-encoded as "data", which kept the rebuilt model in that space, even when rebuilt with care.
Responsible, ethical data work means asking not just "how accurate is this model", but also "whose errors are these, and who carries these costs?"


