House Pricing¶
Another classical problem to practice your Exploratory Data Analysis is the house pricing competition from Kaggle. Like the Titanic one that I went through in this entry: https://achefethings.blogspot.com/2021/04/titanic-pre-margin-0px-border-none.html you have many tutorials and a lot of information to explore.
In this case the data set is very similar to the Titanic one:
- train.csv file for you to prepare your model
- test.csv file to put your model into work
- data_description.txt as the datasets have a ton of values this one is important to fathom what each column means
- sample_submission.csv where you post your results and to update them to Kaggle
As always, at the begining we will import all the external modules, packages, libraries that will be needed throughout the analysis.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import scipy.stats as st
from sklearn.linear_model import LinearRegression
Then define the filepaths to where the documents reside and open them as pandas dataframes
train_filepath = r"C:\Users\Usuario\Documents\PythonML\Houses EDA\train.csv"
test_filepath = r"C:\Users\Usuario\Documents\PythonML\Houses EDA\test.csv"
train = pd.read_csv(train_filepath)
test = pd.read_csv(test_filepath)
Let's see if they were imported correctly
train.shape,test.shape
((1460, 81), (1459, 80))
train.head()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
test.head()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1461 | 20 | RH | 80.0 | 11622 | Pave | NaN | Reg | Lvl | AllPub | ... | 120 | 0 | NaN | MnPrv | NaN | 0 | 6 | 2010 | WD | Normal |
1 | 1462 | 20 | RL | 81.0 | 14267 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | NaN | Gar2 | 12500 | 6 | 2010 | WD | Normal |
2 | 1463 | 60 | RL | 74.0 | 13830 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | MnPrv | NaN | 0 | 3 | 2010 | WD | Normal |
3 | 1464 | 60 | RL | 78.0 | 9978 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | NaN | NaN | 0 | 6 | 2010 | WD | Normal |
4 | 1465 | 120 | RL | 43.0 | 5005 | Pave | NaN | IR1 | HLS | AllPub | ... | 144 | 0 | NaN | NaN | NaN | 0 | 1 | 2010 | WD | Normal |
5 rows × 80 columns
Quick summary¶
These datasets are quite complete, they include information about 1460 houses with data on 80 different parameters, including things like the type of land where the house is built, year of construction, access type, surface, number of rooms and disposition in them inside the house, availability of utilities, type of neighbourhood, condition of the house, materials of construction of various parts, garden surface, etc... counting up to 80 different atributes are detailed below and whose description can be found in the .txt file
train.columns.values.tolist();
train.columns.values
array(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition', 'SalePrice'], dtype=object)
You can see that there is a wide variety of attributes, but we can group them in two groups:
- categorical
- quantitative
Since we are dealing with a dataset with a large number of columns, grouping them by hand would take a lot of time and would be very inefficient, so we have to develop some snippet that lets us know which of these attributes are quantitative and which ones are categorical. This basically translates to discerning which ones have numbers in them and which ones have not.
The best way to do this is to know the type of value that each column has like so
In this code snippet below I find the indices of the column types which are number-like data type (int64 or float64) and the ones which are not the number-like data types, just "objects" ("O") in this case, and put the index in two lists (qualitative and categorical)
train_type_series = train.dtypes.values.tolist()
qualitative = []
categorical = []
for idx,i in enumerate(train_type_series):
if ((i == "int64") or (i == "float64")):
qualitative.append(idx)
else:
categorical.append(idx)
Now, find the column names that correspond to each index
qualitative_columns = []
categorical_columns = []
for idx in qualitative:
qualitative_columns.append(train.columns.values.tolist()[idx])
for idx in categorical:
categorical_columns.append(train.columns.values.tolist()[idx])
len(qualitative_columns),len(categorical_columns)
(38, 43)
There are 36 + Id_number + Sale price qualitative columns, these ones below
print(set(zip(qualitative_columns,categorical_columns)))
{('EnclosedPorch', 'KitchenQual'), ('YearRemodAdd', 'LandSlope'), ('Fireplaces', 'BsmtFinType1'), ('LowQualFinSF', 'Exterior1st'), ('SalePrice', 'PavedDrive'), ('MasVnrArea', 'Neighborhood'), ('WoodDeckSF', 'CentralAir'), ('YearBuilt', 'LotConfig'), ('MSSubClass', 'Street'), ('GarageCars', 'Heating'), ('GrLivArea', 'Exterior2nd'), ('BsmtFinSF1', 'Condition1'), ('LotArea', 'LotShape'), ('GarageArea', 'HeatingQC'), ('3SsnPorch', 'Functional'), ('MoSold', 'GarageQual'), ('MiscVal', 'GarageFinish'), ('Id', 'MSZoning'), ('BsmtFinSF2', 'Condition2'), ('BsmtUnfSF', 'BldgType'), ('BedroomAbvGr', 'BsmtQual'), ('1stFlrSF', 'RoofStyle'), ('OverallCond', 'Utilities'), ('ScreenPorch', 'FireplaceQu'), ('TotRmsAbvGrd', 'BsmtExposure'), ('PoolArea', 'GarageType'), ('TotalBsmtSF', 'HouseStyle'), ('GarageYrBlt', 'BsmtFinType2'), ('YrSold', 'GarageCond'), ('FullBath', 'ExterCond'), ('HalfBath', 'Foundation'), ('OpenPorchSF', 'Electrical'), ('OverallQual', 'LandContour'), ('2ndFlrSF', 'RoofMatl'), ('BsmtHalfBath', 'ExterQual'), ('BsmtFullBath', 'MasVnrType'), ('KitchenAbvGr', 'BsmtCond'), ('LotFrontage', 'Alley')}
Investigating the data set¶
Usually, datasets are not fully complete, for instance if one datasets refers to basketball player statistics and there are some columns reserved to 3 pointers, it is possible that these columns might be empty for center players, or if there is a column especifically thought for dunks, many point guards have a lot of missing values in this regard.
The very same applies for this dataset, maybe some houses do not have more than 1 floor, or maybe they don't have kitchen, so many NaNs (Not a Number) will appear.
Let's count below the amount of NaN per category to have an idea of where the data is more scarce and what we should do with those missing values.
NaNs = train.isnull().sum()
NaNs = NaNs[NaNs>0].sort_values()
NaNs.plot.bar()
plt.show()
We can see that up to 19 attributes out of those 80 ones have missing values, and some of them are certainly refering to very weird things in the houses like Pools or Alleys since the number of NaNs in these categories is about the totality of the houses explored.
We are now at the point in which we have to make decisions with respect what to do with the missing data, what I've though to stay as close as possible to the original dataset is the following:
- Remove the PoolQC (Pool Quality) column, as it is missing almost everywhere.
- Remove the MiscFeature column, as it is missing almost everywhere too.
- According to this research (https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.448.3233&rep=rep1&type=pdf) paved access to houses increase up to 20% the price of the house, so it's apparent that this is an important feature to retain, but how to complete that missing data. Let's make it easy, in the research they investigated 587 paved properties versus 1624 unpaved ones. I will take the percentage of each of them and assign them randomly to the missing values.
- For the fences, I will process a bit this indicator. Since in this case it doesnot mean that there is missing data but that there is not fence, the exaplanation is that in most places of the US it is forbidden to have one for these reasons (https://www.forumdaily.com/en/osobennosti-amerikanskoj-nedvizhimosti-i-pochemu-zdes-net-zaborov/) apprently the fact that a house has no fence can mean that the households around it have a similar income to those of thehouse in question, therefore neither raising nor diminishing the price of the house.
- The remaning columns need some further treatment, since the code might be not working very well and might be recognizing as NaNs values that actually aren't so.
Like the case of the fireplace quality (FireplaceQu), where Na means that there is not a fireplace, not that the information of the fireplace is missing.
train["FireplaceQu"] = train["FireplaceQu"].fillna("Nothing")
train["Fence"] = train["Fence"].fillna("Nothing")
train["FireplaceQu"], train["Fence"]
(0 Nothing 1 TA 2 TA 3 Gd 4 TA ... 1455 TA 1456 TA 1457 Gd 1458 Nothing 1459 Nothing Name: FireplaceQu, Length: 1460, dtype: object, 0 Nothing 1 Nothing 2 Nothing 3 Nothing 4 Nothing ... 1455 Nothing 1456 MnPrv 1457 GdPrv 1458 Nothing 1459 Nothing Name: Fence, Length: 1460, dtype: object)
Now for the part of the paved or unpaved entries to the houses, I will start by dropping all the columsn that I mentioned before plus the one that refers to the Alley to then do it by myself.
train = train.drop(["PoolQC","MiscFeature","Alley"], axis=1)
train.shape
(1460, 78)
paved = 587
unpaved = 1624
total_p = paved + unpaved
paved_percent = paved/total_p
vector_paved = np.ones(int(train.shape[0]*paved_percent))
vector_unpaved = np.zeros(1+int(train.shape[0]*(1-paved_percent)))
total = (np.concatenate((vector_paved, vector_unpaved)))
np.random.shuffle(total)
total.tolist();
train["Alley"] = total.tolist()
train.shape
(1460, 79)
Now that we've cleaned the data more or less, let's see how the number of NaNs looks much more promising now, with far less values that might distorte the final result.
NaNs = train.isnull().sum()
NaNs = NaNs[NaNs>0].sort_values()
NaNs.plot.bar()
plt.show()
Some obvious relations¶
It is clear that some variables will drive others (such as the "Sale Price"), let's think of the most obvious and clear ones:
- TotalBsmtSF: is the surface area of the basement of the house, larger houses will have larger prices
- GrLivArea: surface area above ground of the house
- OverallQual
Someones that are not as clear might be:
- YearBuilt: as in general newer houses will be in general in a better state but some people will opt for old houses for several reasons
- Number of rooms, garage, kitchens, area where the house lays, etc...
figure, axs = plt.subplots(1,2, sharey=True)
plt.xlabel("Ground Surface")
plt.ylabel("Sale Price")
sns.scatterplot(data=train, x="GrLivArea", y="SalePrice",ax=axs[0])
plt.xlabel("Basement Surface")
sns.scatterplot(data=train, x="TotalBsmtSF", y="SalePrice",ax=axs[1])
plt.show()
sns.violinplot(x='OverallQual', y='SalePrice', data=train)
plt.title("Violin plot", loc="center")
plt.xlabel("Overall Quality")
plt.ylabel("Sale Price")
plt.show()
The relation within the sale price and the variables that I have explored above are clearly related, some have more weight than others but after all there is a relation. So they should be considered for the model of prediction that is gonna be developed
Further analysing the data¶
Up until now we've only considered the variables that we thing are related, but some other might be related too that escape of our undestanding. So let's try to do a more objective analysis.
For this case the correlation map will give us a very good idea of the parameters have a stronger relation with the the final sale price.
In the graphic below we can see that some variables aren't correlated to each other while other have a very strong correlation.
Let's now zoom in to see the ones who have the strongest correlation.
corrmat = train.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True,cmap="BuPu");
k = 10
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values,cmap="BuPu")
plt.show()
Let's try to get some conclusions on what we are seeing now focusing on the most upper column:
- Overall quality and Living are are strongly correlated with the price
- Garage Cars and Garage Area also correlate with the price of the house, although they are correlated to each other as well, this is a clear case of multicollinearity, since these two variables provide basically the same information, let's take the one that resembles most the info displayed in the other variables (GarageArea)
- Total Basement Square Feet and 1st floor square feet seem to suffer the same thing, so let's just keep one of them
- FullBath, Total rooms above ground, well I guess people just like having windows on their house, it is normal
- Finally the year built as we anticipated
Weird data¶
Before continuing with the analysis let's try to see what's going on with those dots that lay far apart from the others, it seems that something weird is going on, this data, if not treated carefully could impact negatively our study.
Let's bring back the two scatter plots we created previosuly.
As we can see, there are two points there on the right that seem to be too much out of the crowd and not following any trend, so let's just delete them.
We'll keep the ones high above since, even if they are well apart from others they seem to be following a trend.
figure, axs = plt.subplots(1,2, sharey=True)
plt.xlabel("Ground Surface")
plt.ylabel("Sale Price")
sns.scatterplot(data=train, x="GrLivArea", y="SalePrice",ax=axs[0])
plt.xlabel("Basement Surface")
sns.scatterplot(data=train, x="TotalBsmtSF", y="SalePrice",ax=axs[1])
plt.show()
train.sort_values(by=["GrLivArea"],ascending=False)[:2]
train = train.drop(train[train['Id'] == 1299].index)
train = train.drop(train[train['Id'] == 524].index)
And now let's just tprint them again to see how they look like now. I would say that much better
figure, axs = plt.subplots(1,2, sharey=True)
plt.xlabel("Ground Surface")
plt.ylabel("Sale Price")
sns.scatterplot(data=train, x="GrLivArea", y="SalePrice",ax=axs[0])
plt.xlabel("Basement Surface")
sns.scatterplot(data=train, x="TotalBsmtSF", y="SalePrice",ax=axs[1])
plt.show()
Sale Price¶
Let's make an attempt to investigate further on who SalePrice really is
sns.distplot(train['SalePrice'], fit=st.norm);
fig = plt.figure()
fig.show()
res = st.probplot(train['SalePrice'], plot=plt)
C:\Users\Usuario\Anaconda\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) <ipython-input-153-c0c584cca9a1>:3: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure. fig.show()
It is apparent that the distribution of prices is not normal, since it shows positive skewness, this means that the prices are shifted to the left, and it shows kurtosis, meaning that there is a high concentration of values in about the middle, so they do not ditribute as they should.
The blue line should follow closely the black one in the first graph, and in the second one the dots should follow the red line.
skew = st.skew(train["SalePrice"])
kur = st.kurtosis(train["SalePrice"])
print(f"The skewness is: {skew} and the kurtosis is: {kur}")
The skewness is: 1.8793604459195012 and the kurtosis is: 6.496606038358726
Model to predict¶
I'm just going to use the variables that seem to be the most influential when predicting the SalePrice. Luckily enough, these are not categorical, so I'll skip the part of creating dummy variables.
variables = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index.tolist()
variables.remove("SalePrice")
X = train[variables]
Y = train["SalePrice"]
lm = LinearRegression()
lm.fit(X,Y)
LinearRegression()
lm.intercept_
-846372.5520106719
lm.coef_
array([ 1.84955532e+04, 7.15586314e+01, 2.09291367e+03, 3.42979311e+01, 3.43006325e+01, 9.98294602e+00, -9.59088520e+03, -2.77513821e+03, 3.91644235e+02])
These below are the coefficient that come with each of the 9 variables that we have selected and the R parameters of the adjustment that is 0.81, let's say is fair enough, neither too good nor too bad.
all_together = list(zip(variables,lm.coef_))
print(all_together)
[('OverallQual', 18495.55322233605), ('GrLivArea', 71.55863144592139), ('GarageCars', 2092.9136747611205), ('GarageArea', 34.29793110578195), ('TotalBsmtSF', 34.30063246824612), ('1stFlrSF', 9.982946024307836), ('FullBath', -9590.885196575362), ('TotRmsAbvGrd', -2775.138210750693), ('YearBuilt', 391.64423490304813)]
lm.score(X,Y)
0.8174225864840142
Let's now try to predict the sale price of the test data file
test["SalePrice"] = 18495.55322233605*test["OverallQual"] + 71.55863144*test["GrLivArea"] + 2092.91367*test["GarageCars"] + 34.297*test["GarageArea"] + 34.300663*test["TotalBsmtSF"] + 9.98294*test["1stFlrSF"] - 9590.885196575362*test["FullBath"] - 2775.138*test["TotRmsAbvGrd"] + 391.644*test["YearBuilt"]
figure, axs = plt.subplots(1,2, sharey=True)
plt.xlabel("Ground Surface")
plt.ylabel("Sale Price")
sns.scatterplot(data=test, x="GrLivArea", y="SalePrice",ax=axs[0])
plt.xlabel("Basement Surface")
sns.scatterplot(data=test, x="TotalBsmtSF", y="SalePrice",ax=axs[1])
plt.show()
Results¶
I've gone through a lot of exploration of the data, visualization of what is avaialable and lately a model to predict sales prices based on the train data, the results are shown in the two graphs above.
It's been a nice little exercise to predict house of prices and to explore new tools for visualizing and for treating data.
Comments
Post a Comment