Regression Analysis

In this section, we learn how to perform regression analysis using Python. Regression analysis is a powerful statistical method used to examine the relationship between a dependent variable and one or more independent variables. It is widely used in various fields such as economics, finance, social sciences, and more.

Importing the Dataset

We first import a dataset. This is a dataset on housing prices and air pollution in Harrison & Rubinfeld (1978). The dataset is also used throughout an undergraduate econometrics text book by Wooldridge: Introductory Econometrics: A Modern Approach.

The data dictionary is as follows (source)

Variable Description
1. price median housing price, $
2. crime crimes committed per capita
3. nox nitrous oxide, parts per 100 mill.
4. rooms avg number of rooms per house
5. dist weighted dist. to 5 employ centers
6. radial accessibiliy index to radial hghwys
7. proptax property tax per $1000
8. stratio average student-teacher ratio
9. lowstat % of people ‘lower status’

Quick Inspection

Let’s focus on price, nox, rooms and stratio for this analysis, and quickly inspect these four variables. (By no means the data exploration done here is complete and thorough.)

Linear Regression

We can use the statsmodels package to perform linear regression analysis. The package supports the OLS() model (Ordinary Least Square model is just another name for linear regression). It also offers an easy way to write regression formula, and produces a nice regression report. This regression report is especially useful for causal analysis, where you care about statistical inference in the regression analysis (e.g., confidence intervals or hypothesis tests for the estimated coefficients).

Simple Linear Regression

We start by runing a simple regression to investigate the effect of air pollution on housing price.

\(log(price) = \beta_0 + \beta_1log(nox) + u\).

Multiple Linear Regression

Let’s run a mulitple regression to investigate the effect of air pollution on housing price, but this time we control for rooms and student-teacher ratio.

\(log(price) = \beta_0 + \beta_1log(nox) + \beta_2rooms + \beta_3stratio + u\).

Exercise ☕📝

Run a similiar multiple linear regression analysis, but this time include the squared term for rooms.

\(log(price) = \beta_0 + \beta_1log(nox) + \beta_2rooms + \beta_2rooms^2 + \beta_4stratio + u\).