Running the regression

Simply run the program. If the program is run with an empty ignored list, the result will show up as a bunch of NaNs. Do you recall that earlier we have done some correlation analysis on how some variables are correlated with one another?

We'll start by adding those into our ignored list, and then run the regression. Once we have a score that is no longer NaN, we can start comparing models.

The final model I have prints the following output:

R^2: 0.871
Variable Coefficient StdErr t-stat p-value
Intercept: 12.38352 0.14768 83.85454 0.00000
MSSubClass_30: -0.06466 0.02135 -3.02913 0.00412
MSSubClass_40: -0.03771 0.08537 -0.44172 0.36175
MSSubClass_45: -0.12998 0.04942 -2.63027 0.01264
MSSubClass_50: -0.01901 0.01486 -1.27946 0.17590
MSSubClass_60: -0.06634 0.01061 -6.25069 0.00000
MSSubClass_70: 0.04089 0.02269 1.80156 0.07878
MSSubClass_75: 0.04604 0.03838 1.19960 0.19420
MSSubClass_80: -0.01971 0.02177 -0.90562 0.26462
MSSubClass_85: -0.02167 0.03838 -0.56458 0.34005
MSSubClass_90: -0.05748 0.02222 -2.58741 0.01413
MSSubClass_120: -0.06537 0.01763 -3.70858 0.00043
MSSubClass_160: -0.15650 0.02135 -7.33109 0.00000
MSSubClass_180: -0.01552 0.05599 -0.27726 0.38380
MSSubClass_190: -0.04344 0.02986 -1.45500 0.13840
LotFrontage: -0.00015 0.00265 -0.05811 0.39818
LotArea: 0.00799 0.00090 8.83264 0.00000
Neighborhood_Blueste: 0.02080 0.10451 0.19903 0.39102
Neighborhood_BrDale: -0.06919 0.04285 -1.61467 0.10835
Neighborhood_BrkSide: -0.06680 0.02177 -3.06894 0.00365
Neighborhood_ClearCr: -0.04217 0.03110 -1.35601 0.15904
Neighborhood_CollgCr: -0.06036 0.01403 -4.30270 0.00004
Neighborhood_Crawfor: 0.08813 0.02500 3.52515 0.00082
Neighborhood_Edwards: -0.18718 0.01820 -10.28179 0.00000
Neighborhood_Gilbert: -0.09673 0.01858 -5.20545 0.00000
Neighborhood_IDOTRR: -0.18867 0.02825 -6.67878 0.00000
Neighborhood_MeadowV: -0.24387 0.03971 -6.14163 0.00000
Neighborhood_Mitchel: -0.15112 0.02348 -6.43650 0.00000
Neighborhood_NAmes: -0.11880 0.01211 -9.81203 0.00000
Neighborhood_NPkVill: -0.05093 0.05599 -0.90968 0.26364
Neighborhood_NWAmes: -0.12200 0.01913 -6.37776 0.00000
Neighborhood_NoRidge: 0.13126 0.02688 4.88253 0.00000
Neighborhood_NridgHt: 0.16263 0.01899 8.56507 0.00000
Neighborhood_OldTown: -0.15781 0.01588 -9.93456 0.00000
Neighborhood_SWISU: -0.12722 0.03252 -3.91199 0.00020
Neighborhood_Sawyer: -0.17758 0.02040 -8.70518 0.00000
Neighborhood_SawyerW: -0.11027 0.02115 -5.21481 0.00000
Neighborhood_Somerst: 0.05793 0.01845 3.13903 0.00294
Neighborhood_StoneBr: 0.21206 0.03252 6.52102 0.00000
Neighborhood_Timber: -0.00449 0.02825 -0.15891 0.39384
Neighborhood_Veenker: 0.04530 0.04474 1.01249 0.23884
HouseStyle_1.5Unf: 0.16961 0.04474 3.79130 0.00031
HouseStyle_1Story: -0.03547 0.00864 -4.10428 0.00009
HouseStyle_2.5Fin: 0.16478 0.05599 2.94334 0.00531
HouseStyle_2.5Unf: 0.04816 0.04690 1.02676 0.23539
HouseStyle_2Story: 0.03271 0.00937 3.49038 0.00093
HouseStyle_SFoyer: 0.02498 0.02777 0.89968 0.26604
HouseStyle_SLvl: -0.02233 0.02076 -1.07547 0.22364
YearBuilt: 0.01403 0.00151 9.28853 0.00000
YearRemodAdd: 5.06512 0.41586 12.17991 0.00000
MasVnrArea: 0.00215 0.00164 1.30935 0.16923
Foundation_CBlock: -0.01183 0.00873 -1.35570 0.15910
Foundation_PConc: 0.01978 0.00869 2.27607 0.03003
Foundation_Slab: 0.01795 0.03416 0.52548 0.34738
Foundation_Stone: 0.03423 0.08537 0.40094 0.36802
Foundation_Wood: -0.08163 0.08537 -0.95620 0.25245
BsmtFinSF1: 0.01223 0.00145 8.44620 0.00000
BsmtFinSF2: -0.00148 0.00236 -0.62695 0.32764
BsmtUnfSF: -0.00737 0.00229 -3.21186 0.00234
TotalBsmtSF: 0.02759 0.00375 7.36536 0.00000
Heating_GasA: 0.02397 0.02825 0.84858 0.27820
Heating_GasW: 0.06687 0.03838 1.74239 0.08747
Heating_Grav: -0.15081 0.06044 -2.49506 0.01785
Heating_OthW: -0.00467 0.10451 -0.04465 0.39845
Heating_Wall: 0.06265 0.07397 0.84695 0.27858
CentralAir_Y: 0.10319 0.01752 5.89008 0.00000
1stFlrSF: 0.01854 0.00071 26.15440 0.00000
2ndFlrSF: 0.01769 0.00131 13.46733 0.00000
FullBath: 0.10586 0.01360 7.78368 0.00000
HalfBath: 0.09048 0.01271 7.11693 0.00000
Fireplaces: 0.07432 0.01096 6.77947 0.00000
GarageType_Attchd: -0.37539 0.00884 -42.44613 0.00000
GarageType_Basment: -0.47446 0.03718 -12.76278 0.00000
GarageType_BuiltIn: -0.33740 0.01899 -17.76959 0.00000
GarageType_CarPort: -0.60816 0.06044 -10.06143 0.00000
GarageType_Detchd: -0.39468 0.00983 -40.16266 0.00000
GarageType_2Types: -0.54960 0.06619 -8.30394 0.00000
GarageArea: 0.07987 0.00301 26.56053 0.00000
PavedDrive_P: 0.01773 0.03046 0.58214 0.33664
PavedDrive_Y: 0.02663 0.01637 1.62690 0.10623
WoodDeckSF: 0.00448 0.00166 2.69397 0.01068
OpenPorchSF: 0.00640 0.00201 3.18224 0.00257
PoolArea: -0.00075 0.00882 -0.08469 0.39742
MoSold: 0.00839 0.01020 0.82262 0.28430
YrSold: -4.27193 6.55001 -0.65220 0.32239
RMSE: 0.1428929042451045

The cross-validation results (a RMSE of 0.143) are decent—not the best, but not the worst either. This was done through careful elimination of variables. A seasoned econometrician may come into this, read the results, and decide that further feature engineering may be done.

Indeed, looking at these results, off the top of my head I could think of several other feature engineering that could be done—subtracting the year remodeled from the year sold (recency of remodeling/renovations). Another form of feature engineering is to run a PCA-whitening process on the dataset.

For linear regression models, I tend to stay away from complicated feature engineering. This is because the key benefit of a linear regression is that it's explainable in natural language.

For example, we can say this: for every unit increase in lot area size, if everything else is held constant, we can expect a 0.07103 times increment in house price.

A particularly counter intuitive result from this regression is the PoolArea variable. Interpreting the results, we would say: for every unit increase in pool area, we can expect a -0.00075 times increment in price, ceteris paribus. Granted, the p-value of the coefficient is 0.397, meaning that this coefficient could have been gotten by sheer random chance. Hence, we must be quite careful in saying this—having a pool decreases the value of your property in Ames, Massachusetts.