How much does this house cost?

House Price v2.0

House Pricing¶

Another classical problem to practice your Exploratory Data Analysis is the house pricing competition from Kaggle. Like the Titanic one that I went through in this entry: https://achefethings.blogspot.com/2021/04/titanic-pre-margin-0px-border-none.html you have many tutorials and a lot of information to explore.

In this case the data set is very similar to the Titanic one:

train.csv file for you to prepare your model
test.csv file to put your model into work
data_description.txt as the datasets have a ton of values this one is important to fathom what each column means
sample_submission.csv where you post your results and to update them to Kaggle

As always, at the begining we will import all the external modules, packages, libraries that will be needed throughout the analysis.

In [175]:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import scipy.stats as st
from sklearn.linear_model import LinearRegression

Then define the filepaths to where the documents reside and open them as pandas dataframes

In [2]:

train_filepath = r"C:\Users\Usuario\Documents\PythonML\Houses EDA\train.csv"
test_filepath = r"C:\Users\Usuario\Documents\PythonML\Houses EDA\test.csv"

train = pd.read_csv(train_filepath)
test = pd.read_csv(test_filepath)

Let's see if they were imported correctly

In [3]:

train.shape,test.shape

Out[3]:

((1460, 81), (1459, 80))

In [4]:

train.head()

Out[4]:

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

5 rows × 81 columns

In [5]:

test.head()

Out[5]:

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	ScreenPorch	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
0	1461	20	RH	80.0	11622	Pave	NaN	Reg	Lvl	AllPub	...	120	NaN	MnPrv	NaN	0	6	2010	WD	Normal
1	1462	20	RL	81.0	14267	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	Gar2	12500	6	2010	WD	Normal
2	1463	60	RL	74.0	13830	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	MnPrv	NaN	0	3	2010	WD	Normal
3	1464	60	RL	78.0	9978	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	6	2010	WD	Normal
4	1465	120	RL	43.0	5005	Pave	NaN	IR1	HLS	AllPub	...	144	NaN	NaN	NaN	0	1	2010	WD	Normal

5 rows × 80 columns

Quick summary¶

These datasets are quite complete, they include information about 1460 houses with data on 80 different parameters, including things like the type of land where the house is built, year of construction, access type, surface, number of rooms and disposition in them inside the house, availability of utilities, type of neighbourhood, condition of the house, materials of construction of various parts, garden surface, etc... counting up to 80 different atributes are detailed below and whose description can be found in the .txt file

In [6]:

train.columns.values.tolist();
train.columns.values

Out[6]:

array(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl',
       'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu',
       'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars',
       'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive',
       'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
       'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature',
       'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition',
       'SalePrice'], dtype=object)

You can see that there is a wide variety of attributes, but we can group them in two groups:

categorical
quantitative

Since we are dealing with a dataset with a large number of columns, grouping them by hand would take a lot of time and would be very inefficient, so we have to develop some snippet that lets us know which of these attributes are quantitative and which ones are categorical. This basically translates to discerning which ones have numbers in them and which ones have not.

The best way to do this is to know the type of value that each column has like so

In this code snippet below I find the indices of the column types which are number-like data type (int64 or float64) and the ones which are not the number-like data types, just "objects" ("O") in this case, and put the index in two lists (qualitative and categorical)

In [7]:

train_type_series = train.dtypes.values.tolist()

qualitative = []
categorical = []

for idx,i in enumerate(train_type_series):
    if ((i == "int64") or (i == "float64")):
        qualitative.append(idx)
    else:
        categorical.append(idx)

Now, find the column names that correspond to each index

In [8]:

qualitative_columns = []
categorical_columns = []

for idx in qualitative:
    qualitative_columns.append(train.columns.values.tolist()[idx])

for idx in categorical:
    categorical_columns.append(train.columns.values.tolist()[idx])
    
len(qualitative_columns),len(categorical_columns)

Out[8]:

(38, 43)

There are 36 + Id_number + Sale price qualitative columns, these ones below

In [9]:

print(set(zip(qualitative_columns,categorical_columns)))

{('EnclosedPorch', 'KitchenQual'), ('YearRemodAdd', 'LandSlope'), ('Fireplaces', 'BsmtFinType1'), ('LowQualFinSF', 'Exterior1st'), ('SalePrice', 'PavedDrive'), ('MasVnrArea', 'Neighborhood'), ('WoodDeckSF', 'CentralAir'), ('YearBuilt', 'LotConfig'), ('MSSubClass', 'Street'), ('GarageCars', 'Heating'), ('GrLivArea', 'Exterior2nd'), ('BsmtFinSF1', 'Condition1'), ('LotArea', 'LotShape'), ('GarageArea', 'HeatingQC'), ('3SsnPorch', 'Functional'), ('MoSold', 'GarageQual'), ('MiscVal', 'GarageFinish'), ('Id', 'MSZoning'), ('BsmtFinSF2', 'Condition2'), ('BsmtUnfSF', 'BldgType'), ('BedroomAbvGr', 'BsmtQual'), ('1stFlrSF', 'RoofStyle'), ('OverallCond', 'Utilities'), ('ScreenPorch', 'FireplaceQu'), ('TotRmsAbvGrd', 'BsmtExposure'), ('PoolArea', 'GarageType'), ('TotalBsmtSF', 'HouseStyle'), ('GarageYrBlt', 'BsmtFinType2'), ('YrSold', 'GarageCond'), ('FullBath', 'ExterCond'), ('HalfBath', 'Foundation'), ('OpenPorchSF', 'Electrical'), ('OverallQual', 'LandContour'), ('2ndFlrSF', 'RoofMatl'), ('BsmtHalfBath', 'ExterQual'), ('BsmtFullBath', 'MasVnrType'), ('KitchenAbvGr', 'BsmtCond'), ('LotFrontage', 'Alley')}

Investigating the data set¶

Usually, datasets are not fully complete, for instance if one datasets refers to basketball player statistics and there are some columns reserved to 3 pointers, it is possible that these columns might be empty for center players, or if there is a column especifically thought for dunks, many point guards have a lot of missing values in this regard.

The very same applies for this dataset, maybe some houses do not have more than 1 floor, or maybe they don't have kitchen, so many NaNs (Not a Number) will appear.

Let's count below the amount of NaN per category to have an idea of where the data is more scarce and what we should do with those missing values.

In [10]:

NaNs = train.isnull().sum()
NaNs = NaNs[NaNs>0].sort_values()
NaNs.plot.bar()
plt.show()

We can see that up to 19 attributes out of those 80 ones have missing values, and some of them are certainly refering to very weird things in the houses like Pools or Alleys since the number of NaNs in these categories is about the totality of the houses explored.

We are now at the point in which we have to make decisions with respect what to do with the missing data, what I've though to stay as close as possible to the original dataset is the following:

Remove the PoolQC (Pool Quality) column, as it is missing almost everywhere.
Remove the MiscFeature column, as it is missing almost everywhere too.
According to this research (https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.448.3233&rep=rep1&type=pdf) paved access to houses increase up to 20% the price of the house, so it's apparent that this is an important feature to retain, but how to complete that missing data. Let's make it easy, in the research they investigated 587 paved properties versus 1624 unpaved ones. I will take the percentage of each of them and assign them randomly to the missing values.
For the fences, I will process a bit this indicator. Since in this case it doesnot mean that there is missing data but that there is not fence, the exaplanation is that in most places of the US it is forbidden to have one for these reasons (https://www.forumdaily.com/en/osobennosti-amerikanskoj-nedvizhimosti-i-pochemu-zdes-net-zaborov/) apprently the fact that a house has no fence can mean that the households around it have a similar income to those of thehouse in question, therefore neither raising nor diminishing the price of the house.
The remaning columns need some further treatment, since the code might be not working very well and might be recognizing as NaNs values that actually aren't so.

Like the case of the fireplace quality (FireplaceQu), where Na means that there is not a fireplace, not that the information of the fireplace is missing.

In [11]:

train["FireplaceQu"] = train["FireplaceQu"].fillna("Nothing")
train["Fence"] = train["Fence"].fillna("Nothing")

In [12]:

train["FireplaceQu"], train["Fence"]

Out[12]:

(0       Nothing
 1            TA
 2            TA
 3            Gd
 4            TA
          ...   
 1455         TA
 1456         TA
 1457         Gd
 1458    Nothing
 1459    Nothing
 Name: FireplaceQu, Length: 1460, dtype: object,
 0       Nothing
 1       Nothing
 2       Nothing
 3       Nothing
 4       Nothing
          ...   
 1455    Nothing
 1456      MnPrv
 1457      GdPrv
 1458    Nothing
 1459    Nothing
 Name: Fence, Length: 1460, dtype: object)

Now for the part of the paved or unpaved entries to the houses, I will start by dropping all the columsn that I mentioned before plus the one that refers to the Alley to then do it by myself.

In [13]:

train = train.drop(["PoolQC","MiscFeature","Alley"], axis=1)

In [14]:

train.shape

Out[14]:

(1460, 78)

In [54]:

paved = 587
unpaved = 1624
total_p = paved + unpaved
paved_percent = paved/total_p
vector_paved = np.ones(int(train.shape[0]*paved_percent))
vector_unpaved = np.zeros(1+int(train.shape[0]*(1-paved_percent)))
total = (np.concatenate((vector_paved, vector_unpaved)))
np.random.shuffle(total)
total.tolist();

In [57]:

train["Alley"] = total.tolist()
train.shape

Out[57]:

(1460, 79)

Now that we've cleaned the data more or less, let's see how the number of NaNs looks much more promising now, with far less values that might distorte the final result.

In [58]:

NaNs = train.isnull().sum()
NaNs = NaNs[NaNs>0].sort_values()
NaNs.plot.bar()
plt.show()

Some obvious relations¶

It is clear that some variables will drive others (such as the "Sale Price"), let's think of the most obvious and clear ones:

TotalBsmtSF: is the surface area of the basement of the house, larger houses will have larger prices
GrLivArea: surface area above ground of the house
OverallQual

Someones that are not as clear might be:

YearBuilt: as in general newer houses will be in general in a better state but some people will opt for old houses for several reasons
Number of rooms, garage, kitchens, area where the house lays, etc...

In [112]:

figure, axs = plt.subplots(1,2, sharey=True)
plt.xlabel("Ground Surface")
plt.ylabel("Sale Price")
sns.scatterplot(data=train, x="GrLivArea", y="SalePrice",ax=axs[0])
plt.xlabel("Basement Surface")
sns.scatterplot(data=train, x="TotalBsmtSF", y="SalePrice",ax=axs[1])
plt.show()

In [127]:

sns.violinplot(x='OverallQual', y='SalePrice', data=train)
plt.title("Violin plot", loc="center")
plt.xlabel("Overall Quality")
plt.ylabel("Sale Price")
plt.show()

The relation within the sale price and the variables that I have explored above are clearly related, some have more weight than others but after all there is a relation. So they should be considered for the model of prediction that is gonna be developed

Further analysing the data¶

Up until now we've only considered the variables that we thing are related, but some other might be related too that escape of our undestanding. So let's try to do a more objective analysis.

For this case the correlation map will give us a very good idea of the parameters have a stronger relation with the the final sale price.

In the graphic below we can see that some variables aren't correlated to each other while other have a very strong correlation.

Let's now zoom in to see the ones who have the strongest correlation.

In [130]:

corrmat = train.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True,cmap="BuPu");

In [135]:

k = 10
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values,cmap="BuPu")
plt.show()

Let's try to get some conclusions on what we are seeing now focusing on the most upper column:

Overall quality and Living are are strongly correlated with the price
Garage Cars and Garage Area also correlate with the price of the house, although they are correlated to each other as well, this is a clear case of multicollinearity, since these two variables provide basically the same information, let's take the one that resembles most the info displayed in the other variables (GarageArea)
Total Basement Square Feet and 1st floor square feet seem to suffer the same thing, so let's just keep one of them
FullBath, Total rooms above ground, well I guess people just like having windows on their house, it is normal
Finally the year built as we anticipated

Weird data¶

Before continuing with the analysis let's try to see what's going on with those dots that lay far apart from the others, it seems that something weird is going on, this data, if not treated carefully could impact negatively our study.

Let's bring back the two scatter plots we created previosuly.

As we can see, there are two points there on the right that seem to be too much out of the crowd and not following any trend, so let's just delete them.

We'll keep the ones high above since, even if they are well apart from others they seem to be following a trend.

In [136]:

figure, axs = plt.subplots(1,2, sharey=True)
plt.xlabel("Ground Surface")
plt.ylabel("Sale Price")
sns.scatterplot(data=train, x="GrLivArea", y="SalePrice",ax=axs[0])
plt.xlabel("Basement Surface")
sns.scatterplot(data=train, x="TotalBsmtSF", y="SalePrice",ax=axs[1])
plt.show()

In [139]:

train.sort_values(by=["GrLivArea"],ascending=False)[:2]
train = train.drop(train[train['Id'] == 1299].index)
train = train.drop(train[train['Id'] == 524].index)

And now let's just tprint them again to see how they look like now. I would say that much better

In [140]:

figure, axs = plt.subplots(1,2, sharey=True)
plt.xlabel("Ground Surface")
plt.ylabel("Sale Price")
sns.scatterplot(data=train, x="GrLivArea", y="SalePrice",ax=axs[0])
plt.xlabel("Basement Surface")
sns.scatterplot(data=train, x="TotalBsmtSF", y="SalePrice",ax=axs[1])
plt.show()

Sale Price¶

Let's make an attempt to investigate further on who SalePrice really is

In [153]:

sns.distplot(train['SalePrice'], fit=st.norm);
fig = plt.figure()
fig.show()
res = st.probplot(train['SalePrice'], plot=plt)

C:\Users\Usuario\Anaconda\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
<ipython-input-153-c0c584cca9a1>:3: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.
  fig.show()

It is apparent that the distribution of prices is not normal, since it shows positive skewness, this means that the prices are shifted to the left, and it shows kurtosis, meaning that there is a high concentration of values in about the middle, so they do not ditribute as they should.

The blue line should follow closely the black one in the first graph, and in the second one the dots should follow the red line.

In [152]:

skew = st.skew(train["SalePrice"])
kur = st.kurtosis(train["SalePrice"])
print(f"The skewness is: {skew} and the kurtosis is: {kur}")

The skewness is: 1.8793604459195012 and the kurtosis is: 6.496606038358726

Model to predict¶

I'm just going to use the variables that seem to be the most influential when predicting the SalePrice. Luckily enough, these are not categorical, so I'll skip the part of creating dummy variables.

In [180]:

variables = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index.tolist()
variables.remove("SalePrice")
X = train[variables]

In [181]:

Y = train["SalePrice"]

In [182]:

lm = LinearRegression()
lm.fit(X,Y)

Out[182]:

LinearRegression()

In [183]:

lm.intercept_

Out[183]:

-846372.5520106719

In [184]:

lm.coef_

Out[184]:

array([ 1.84955532e+04,  7.15586314e+01,  2.09291367e+03,  3.42979311e+01,
        3.43006325e+01,  9.98294602e+00, -9.59088520e+03, -2.77513821e+03,
        3.91644235e+02])

These below are the coefficient that come with each of the 9 variables that we have selected and the R parameters of the adjustment that is 0.81, let's say is fair enough, neither too good nor too bad.

In [187]:

all_together = list(zip(variables,lm.coef_))
print(all_together)

[('OverallQual', 18495.55322233605), ('GrLivArea', 71.55863144592139), ('GarageCars', 2092.9136747611205), ('GarageArea', 34.29793110578195), ('TotalBsmtSF', 34.30063246824612), ('1stFlrSF', 9.982946024307836), ('FullBath', -9590.885196575362), ('TotRmsAbvGrd', -2775.138210750693), ('YearBuilt', 391.64423490304813)]

In [188]:

lm.score(X,Y)

Out[188]:

0.8174225864840142

Let's now try to predict the sale price of the test data file

In [189]:

test["SalePrice"] = 18495.55322233605*test["OverallQual"] + 71.55863144*test["GrLivArea"] + 2092.91367*test["GarageCars"] + 34.297*test["GarageArea"] + 34.300663*test["TotalBsmtSF"] + 9.98294*test["1stFlrSF"] - 9590.885196575362*test["FullBath"] - 2775.138*test["TotRmsAbvGrd"] + 391.644*test["YearBuilt"]

In [190]:

figure, axs = plt.subplots(1,2, sharey=True)
plt.xlabel("Ground Surface")
plt.ylabel("Sale Price")
sns.scatterplot(data=test, x="GrLivArea", y="SalePrice",ax=axs[0])
plt.xlabel("Basement Surface")
sns.scatterplot(data=test, x="TotalBsmtSF", y="SalePrice",ax=axs[1])
plt.show()

Results¶

I've gone through a lot of exploration of the data, visualization of what is avaialable and lately a model to predict sales prices based on the train data, the results are shown in the two graphs above.

It's been a nice little exercise to predict house of prices and to explore new tools for visualizing and for treating data.

In [ ]:

Achefe

Search This Blog

How much does this house cost?

House Pricing¶

Quick summary¶

Investigating the data set¶

Some obvious relations¶

Further analysing the data¶

Weird data¶

Sale Price¶

Model to predict¶

Results¶

Labels

Comments

Post a Comment

Popular posts from this blog

A first approach to IoT, connecting my 3D printer to the internet

Advent of Code, day 2

Computer vision to fight against Covid-19