Regression Analysis
In this section, we learn how to perform regression analysis using Python. Regression analysis is a powerful statistical method used to examine the relationship between a dependent variable and one or more independent variables. It is widely used in various fields such as economics, finance, social sciences, and more.
Importing the Dataset
We first import a dataset. This is a dataset on housing prices and air pollution in Harrison & Rubinfeld (1978). The dataset is also used throughout an undergraduate econometrics text book by Wooldridge: Introductory Econometrics: A Modern Approach.
The data dictionary is as follows (source)
Variable | Description |
---|---|
1. price | median housing price, $ |
2. crime | crimes committed per capita |
3. nox | nitrous oxide, parts per 100 mill. |
4. rooms | avg number of rooms per house |
5. dist | weighted dist. to 5 employ centers |
6. radial | accessibiliy index to radial hghwys |
7. proptax | property tax per $1000 |
8. stratio | average student-teacher ratio |
9. lowstat | % of people ‘lower status’ |
Quick Inspection
Let’s focus on price
, nox
, rooms
and stratio
for this analysis, and quickly inspect these four variables. (By no means the data exploration done here is complete and thorough.)
Linear Regression
We can use the statsmodels
package to perform linear regression analysis. The package supports the OLS()
model (Ordinary Least Square model is just another name for linear regression). It also offers an easy way to write regression formula, and produces a nice regression report. This regression report is especially useful for causal analysis, where you care about statistical inference in the regression analysis (e.g., confidence intervals or hypothesis tests for the estimated coefficients).
Simple Linear Regression
We start by runing a simple regression to investigate the effect of air pollution on housing price.
\(log(price) = \beta_0 + \beta_1log(nox) + u\).
Multiple Linear Regression
Let’s run a mulitple regression to investigate the effect of air pollution on housing price, but this time we control for rooms and student-teacher ratio.
\(log(price) = \beta_0 + \beta_1log(nox) + \beta_2rooms + \beta_3stratio + u\).
Exercise ☕📝
Run a similiar multiple linear regression analysis, but this time include the squared term for rooms.
\(log(price) = \beta_0 + \beta_1log(nox) + \beta_2rooms + \beta_2rooms^2 + \beta_4stratio + u\).