%matplotlib inline
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from pandas.plotting import scatter_matrix
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from scipy import stats
Titanic problem - 2nd try¶
This time I'm going to follow the Machine Learning Project Checklist that appears in "Hands-on Machine Learning with Scikit-Learn, Keras & Tensorflow" by Aurelien Gerón, and I'm going to try to put in practice all that I have learnt so far that would apply to this problem:
- Regression
- Support Vector Machines
- Decision trees
- Ensemble learning
1. Frame the problem and look at the big picture¶
The objective of the problem is to determine whether someone died in the Titanic or not, this is clearly a problem related to classification, in which we have to decide whether someone would pertain to one category (survived) or the other (not survived), the problem will use supervised ML models, all of them online, there is a single dataset that has been already produced that will not be updated anymore and it will be batch training.
Currently there are many solutions proposed to this problem, but the goal is not to get the best, but to work myself around to learn the most.
To measure the performance of my models I can use the scoring tools provided by Scikit.
2. Get the data¶
The data is obtained from "https://www.kaggle.com/c/titanic", and in this section I will upload it, check for constraints due to its size, legal implications, the type of data provided, etc...
In the following lines of this section I import the data, and check whether it was correcly imported or not.
Theere are two datasets, one for the training of the algorithm and the second one for the testing, although the second one had the labels removed so unless you submit it to the competition cannot really be used, therefore I will only use the training one, and shuffle it to use it as both train and test data.
2.1 Create a test and train set¶
Now, we'll have a train_set and a test_set, we'll set the latter aside and not open it until the end, however, if we limited ourselves to do this we could incur in data bias, especially in this case that the dataset is not large enough. So we are going to proceed with a stratified sampling, so that there is the right proportion of instances for from each stratum (in this case Passenger Class "Pclass"). Mainly because the categories are not equally distributed, there are many more 3rd class passengers than the other ones.
In the table it can be seen that the StratifiedShaffleSplit adjusts better to the original data than the random split, therefore introducing less bias to the sample.
FILEPATH = "C:/Users/Usuario/Documents/Kaggle/titanic"
DATA = FILEPATH + "/train.csv"
data = pd.read_csv(DATA)
data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
train_set, test_set = train_test_split(data, test_size = 0.3, random_state = 42)
split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.3, random_state = 42)
for train_index, test_index in split.split(data, data["Pclass"]):
strat_train_set = data.loc[train_index]
strat_test_set = data.loc[test_index]
sample_bias = pd.DataFrame({"Overall": (data["Pclass"].value_counts() / len(data)).tolist(),
"Stratified": (strat_train_set["Pclass"].value_counts() / len(strat_train_set)).tolist(),
"Random": train_set["Pclass"].value_counts() / len(train_set),
"Rnd. % error": train_set["Pclass"].value_counts() / len(train_set)- (data["Pclass"].value_counts() / len(data)).tolist(),
"Strat. % error": strat_train_set["Pclass"].value_counts() / len(strat_train_set)- (data["Pclass"].value_counts() / len(data)).tolist()})
sample_bias
Overall | Stratified | Random | Rnd. % error | Strat. % error | |
---|---|---|---|---|---|
3 | 0.551066 | 0.550562 | 0.565008 | 0.013942 | -0.000504 |
1 | 0.242424 | 0.242376 | 0.223114 | -0.019310 | -0.000049 |
2 | 0.206510 | 0.207063 | 0.211878 | 0.005368 | 0.000553 |
3. Explore the data¶
This phase will be very similar to the one in my previous try with the Titanic problem, but I'll go through it again to try to spot new insights and for this part I'll work on the original dataset to get a better image of the whole thing.
There are 12 features per passenger, of those 7 are numerical and the remaining 5 are categorical, although some of them might not be relevant to determine the faith of the passenger. These attributes are:
- PassengerId: just a number to identify each passenger
- Survived: Binary variable that tells whether it survived or not
- Pclass: The passenger class in the boat
- Name
- Sex
- Age
- SibSp: Number of sibilings in the boat
- Parch: Number of parents or children above
- Ticket: the ticket number
- Fare: the amount they paid for the ticket
- Cabin: the identifier of their cabin
- Embarked: the port where they embarked
As we can see there are some missing values in the "Age", the "Cabin" and their port of Embarkment, we shall develop some strategy to fill in these values if we think that will be relevant later on. Regarding the rapid description of the data it does not provide us much insight except for the fact that the average age was about 29 years old.
3.1 Target attributes¶
Even though many models will be used for the death prediction, we shall identify beforehand the most relevant attibutes to feed them, as it is clear that some of the data provided does not serve for much. Some reflections about them all:
- PassengerId: just a identifier that can be discarded
- Survived: taget label
- Pclass: this one will likely be relevant, as it could be that the higher class were up above or had access to better rescue equipment
- Name: not relevant
- Sex: very relevant
- Age: very relevant ("attending to the famous phrase: women and children first")
- SibSp and Parch could be combined to get a single attribute meaning all the relatives in the boat
- Ticket: not relevant
- Fare: this one is likely strictly related to the Passenger Class, so I'll just go for that one
- Cabin:
- Embarked: maybe the position of the passengers within the boat was determined by their order of arrival, so I'll use this one too.
3.2 Data Visualization and Data Correlation¶
From the numerical data visualized we can see that:
- There is a clear influence of sex and age in your chances to survive, therefore the "Age" and "Sex" attribute selection was alright
- Your ticket class had also a big influence as it can be seen in the next graph
- Your port of embark was really influencial too, and it made a difference in whether you were male or female.
- Also the number of relatives all together had a bif influence too, the most relatives the lower were your chances or surviving
data.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')
data.shape
(891, 12)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB
data.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
attributes = ["Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Embarked"]
survived = 'survived'
not_survived = 'not survived'
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(12, 5))
women = data[data['Sex']=='female']
men = data[data['Sex']=='male']
ax = sns.histplot(women[women['Survived']==1].Age.dropna(), bins=18, label = survived, ax = axes[0], kde =False, color="blue")
ax = sns.histplot(women[women['Survived']==0].Age.dropna(), bins=40, label = not_survived, ax = axes[0], kde =False, color="red")
ax.legend()
ax.set_title('Female')
ax = sns.histplot(men[men['Survived']==1].Age.dropna(), bins=18, label = survived, ax = axes[1], kde = False, color="blue")
ax = sns.histplot(men[men['Survived']==0].Age.dropna(), bins=40, label = not_survived, ax = axes[1], kde = False, color="red")
ax.legend()
_ = ax.set_title('Male');
sns.set_theme(style="whitegrid")
ax = sns.barplot(x='Pclass', y='Survived', data=data,palette="hls")
ax.set(xlabel='Passenger Class', ylabel='Percentage of survivors')
plt.show()
FacetGrid = sns.FacetGrid(data, row='Embarked', height=4, aspect=1.6)
FacetGrid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', order=None, hue_order=None )
FacetGrid.add_legend()
<seaborn.axisgrid.FacetGrid at 0x1bf1c8aa520>
data.loc[(data['SibSp'] + data["Parch"]) > 0, 'travelled_alone'] = 'No'
data.loc[(data['SibSp'] + data["Parch"]) == 0, 'travelled_alone'] = 'Yes'
axes = sns.factorplot('SibSp','Survived', data=data, aspect = 3, );
C:\Users\Usuario\Anaconda\lib\site-packages\seaborn\categorical.py:3704: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`. warnings.warn(msg) C:\Users\Usuario\Anaconda\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
4. Data Preparation¶
4.1 Get copies of the data and the labels¶
The key here is to work on copies of the data and to use functions that will likely be reusable for future applications, in this case this dataset will not evolve, unless new casulties are discovered but that is very unlikely and for sure the result will not be modified.
4.2 Data Cleaning¶
I will fill the empty spaces of the numerical values with the Simple Imputer
4.3 Handle Categorical Attributes¶
Those attributes that are not a number but a pice of text must be worked around carefully, luckily enough Scikit Learn has tools to work with them easily.
4.4 Custom Transforms¶
Sometimes it's useful to modify data to create new attributes that we think are going to be more useful to predict the final outcome that we want to obtain.
4.5 Scaling¶
Sometimes normalizing or standarizing features is useful since some models do not work well with attributes that have very dispair numerical attributes
data = strat_train_set.drop("Survived", axis = 1)
data_labels = strat_train_set["Survived"].copy()
data_test = strat_test_set.drop("Survived", axis = 1)
data_test_labels = strat_test_set["Survived"].copy()
imputer = SimpleImputer(strategy = "median")
data_num = data.drop(["Name", "Sex", "Cabin", "Embarked", "Pclass","Ticket"], axis = 1)
imputer.fit(data_num)
SimpleImputer(strategy='median')
X = imputer.transform(data_num)
data_tr = pd.DataFrame(X, columns = data_num.columns, index = data_num.index)
cat_encoder = OneHotEncoder()
data_cat = data[["Sex","Embarked"]]
data_cat.fillna(method = "ffill", inplace = True)
C:\Users\Usuario\Anaconda\lib\site-packages\pandas\core\frame.py:4317: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy return super().fillna(
data_cat_1hot = cat_encoder.fit_transform(data_cat)
cat_encoder.categories_
[array(['female', 'male'], dtype=object), array(['C', 'Q', 'S'], dtype=object)]
5. Pipelines of Data¶
SibSp_ix, Parch_ix = 2, 3
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_relatives = True):
self.add_relatives = add_relatives
def fit(self, X, y = None):
return self
def transform(self, X):
relatives = X[:, SibSp_ix] + X[:, Parch_ix]
if self.add_relatives:
return np.c_[X, relatives]
else:
return np.c_[X]
attr_adder = CombinedAttributesAdder(add_relatives = False)
data_extra_attribs = attr_adder.transform(data.values)
num_pipeline = Pipeline([
("imputer", SimpleImputer(strategy = "median")),
("attribs_adder", CombinedAttributesAdder()),
("std_scaler", StandardScaler()),
])
data_num_tr = num_pipeline.fit_transform(data_num)
cat_pipeline = Pipeline([
("imputer",SimpleImputer(strategy = "most_frequent"))
])
data_cat_pr = cat_pipeline.fit_transform(data_cat)
num_attribs = list(data_num)
cat_attribs = ["Sex", "Embarked"]
full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", OneHotEncoder(), cat_attribs),
])
data["Embarked"].fillna(method = "ffill", inplace = True)
data_prepared = full_pipeline.fit_transform(data)
data_test.fillna(method = "ffill", inplace = True)
6. Promising ML Models¶
This is clearly a classification problem in which we need to decide whether a person results alive or death after the Titanic disaster. So we could use any model fitted for this issue or a combination of them:
- Support Vector Machines
- Logistic Regression
- Decision Trees
- Random Forest
- Ensemble models (bagging or pasting)
- etc...
6.1 Logistic Regression¶
Let's start by one of the simplest and easiest to understand, in which the output would be the probability of a row to pertain to one category or the other based on the probability it has to of belonging to either
X = full_pipeline.fit_transform(data)
y = data_labels
X_test = full_pipeline.fit_transform(data_test)
y_test = data_test_labels
log_reg = LogisticRegression()
log_reg.fit(X, y)
LogisticRegression()
y_pred = log_reg.predict(X)
lin_rmse_1 = np.sqrt(mean_squared_error(y_pred, y))
6.2 Decision Tree Regressor¶
tree_reg = DecisionTreeRegressor()
tree_reg.fit(X, y)
DecisionTreeRegressor()
y_pred = tree_reg.predict(X)
lin_rmse_2 = np.sqrt(mean_squared_error(y_pred, y))
6.3 Random Forest Regressor¶
forest_reg = RandomForestRegressor()
forest_reg.fit(X,y)
RandomForestRegressor()
y_pred = forest_reg.predict(X)
lin_rmse_3 = np.sqrt(mean_squared_error(y_pred, y))
6.4 Use of Cross-Validation¶
scores_1 = cross_val_score(log_reg, X, y, scoring = "neg_mean_squared_error", cv = 10)
scores_2 = cross_val_score(tree_reg, X, y, scoring = "neg_mean_squared_error", cv = 10)
scores_3 = cross_val_score(forest_reg, X, y, scoring = "neg_mean_squared_error", cv = 10)
log_reg_scores = np.sqrt(-scores_1)
tree_reg_scores = np.sqrt(-scores_2)
forest_reg_scores = np.sqrt(-scores_3)
def display_scores(scores):
print("Scores: ", scores)
print("Mean: ", scores.mean())
print("Standard deviation: ", scores.std())
display_scores(log_reg_scores)
Scores: [0.47140452 0.39840954 0.45425676 0.47519096 0.38100038 0.43994135 0.43994135 0.52363494 0.50800051 0.43994135] Mean: 0.45317216454089754 Standard deviation: 0.0418598230964969
display_scores(tree_reg_scores)
Scores: [0.50395263 0.39840954 0.51946248 0.52363494 0.52363494 0.49186938 0.52363494 0.43994135 0.49186938 0.50800051] Mean: 0.49244100697797677 Standard deviation: 0.03952561844409002
display_scores(forest_reg_scores)
Scores: [0.40811607 0.37866725 0.36490703 0.36523744 0.4041678 0.33325859 0.41970228 0.34637291 0.39294997 0.39491118] Mean: 0.3808290529951755 Standard deviation: 0.02664035632783443
7. Fine-Tune the Model¶
param_grid = [
{"n_estimators": [3, 10, 30, 50, 100], "max_features": [2, 4, 6, 8]},
{"bootstrap": [False], "n_estimators": [3, 10], "max_features": [2, 3, 4]},
]
grid_search = GridSearchCV(forest_reg, param_grid, cv = 5, scoring = "neg_mean_squared_error", return_train_score = True)
grid_search.fit(X,y)
GridSearchCV(cv=5, estimator=RandomForestRegressor(), param_grid=[{'max_features': [2, 4, 6, 8], 'n_estimators': [3, 10, 30, 50, 100]}, {'bootstrap': [False], 'max_features': [2, 3, 4], 'n_estimators': [3, 10]}], return_train_score=True, scoring='neg_mean_squared_error')
grid_search.best_params_
{'max_features': 4, 'n_estimators': 100}
grid_search.best_estimator_
RandomForestRegressor(max_features=4)
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)
0.4202115869151846 {'max_features': 2, 'n_estimators': 3} 0.39704619053625845 {'max_features': 2, 'n_estimators': 10} 0.37847337183739116 {'max_features': 2, 'n_estimators': 30} 0.38471919881321737 {'max_features': 2, 'n_estimators': 50} 0.378215901816802 {'max_features': 2, 'n_estimators': 100} 0.4380847858233523 {'max_features': 4, 'n_estimators': 3} 0.3930451833174301 {'max_features': 4, 'n_estimators': 10} 0.3803265150689814 {'max_features': 4, 'n_estimators': 30} 0.37766529643194946 {'max_features': 4, 'n_estimators': 50} 0.37698688434445826 {'max_features': 4, 'n_estimators': 100} 0.4169886927633328 {'max_features': 6, 'n_estimators': 3} 0.4040407197620141 {'max_features': 6, 'n_estimators': 10} 0.38432929056244125 {'max_features': 6, 'n_estimators': 30} 0.38124982242195765 {'max_features': 6, 'n_estimators': 50} 0.3804899778257063 {'max_features': 6, 'n_estimators': 100} 0.42325416243848724 {'max_features': 8, 'n_estimators': 3} 0.4043056971824581 {'max_features': 8, 'n_estimators': 10} 0.38868490707673137 {'max_features': 8, 'n_estimators': 30} 0.3805330522713733 {'max_features': 8, 'n_estimators': 50} 0.38205274927136235 {'max_features': 8, 'n_estimators': 100} 0.4290491278651709 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3} 0.40584153907493214 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10} 0.4354439912710361 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3} 0.40586967557134285 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10} 0.4271821482852719 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3} 0.39733466839640036 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}
8. Evaluate system on the Test Set¶
final_model = grid_search.best_estimator_
X_test = strat_test_set.drop("Survived", axis = 1)
y_test = strat_test_set["Survived"].copy()
X_test.fillna(method = "ffill", inplace = True)
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
final_rmse = np.sqrt(mean_squared_error(y_test, final_predictions))
result = []
for i in final_predictions:
if (i < 0.5):
result.append(0)
else:
result.append(1)
sum(result)/len(result)
0.376865671641791
sum(y)/len(y)
0.39807383627608345
Comments
Post a Comment