My second try on the Titanic problem

Titanic 2 - New try

In [1]:

%matplotlib inline
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from pandas.plotting import scatter_matrix
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from scipy import stats

Titanic problem - 2nd try¶

This time I'm going to follow the Machine Learning Project Checklist that appears in "Hands-on Machine Learning with Scikit-Learn, Keras & Tensorflow" by Aurelien Gerón, and I'm going to try to put in practice all that I have learnt so far that would apply to this problem:

Regression
Support Vector Machines
Decision trees
Ensemble learning

1. Frame the problem and look at the big picture¶

The objective of the problem is to determine whether someone died in the Titanic or not, this is clearly a problem related to classification, in which we have to decide whether someone would pertain to one category (survived) or the other (not survived), the problem will use supervised ML models, all of them online, there is a single dataset that has been already produced that will not be updated anymore and it will be batch training.

Currently there are many solutions proposed to this problem, but the goal is not to get the best, but to work myself around to learn the most.

To measure the performance of my models I can use the scoring tools provided by Scikit.

2. Get the data¶

The data is obtained from "https://www.kaggle.com/c/titanic", and in this section I will upload it, check for constraints due to its size, legal implications, the type of data provided, etc...

In the following lines of this section I import the data, and check whether it was correcly imported or not.

Theere are two datasets, one for the training of the algorithm and the second one for the testing, although the second one had the labels removed so unless you submit it to the competition cannot really be used, therefore I will only use the training one, and shuffle it to use it as both train and test data.

2.1 Create a test and train set¶

Now, we'll have a train_set and a test_set, we'll set the latter aside and not open it until the end, however, if we limited ourselves to do this we could incur in data bias, especially in this case that the dataset is not large enough. So we are going to proceed with a stratified sampling, so that there is the right proportion of instances for from each stratum (in this case Passenger Class "Pclass"). Mainly because the categories are not equally distributed, there are many more 3rd class passengers than the other ones.

In the table it can be seen that the StratifiedShaffleSplit adjusts better to the original data than the random split, therefore introducing less bias to the sample.

In [2]:

FILEPATH = "C:/Users/Usuario/Documents/Kaggle/titanic"
DATA = FILEPATH + "/train.csv"

In [3]:

data = pd.read_csv(DATA)

In [4]:

data.head()

Out[4]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

In [5]:

train_set, test_set = train_test_split(data, test_size = 0.3, random_state = 42)

In [6]:

split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.3, random_state = 42)

In [7]:

for train_index, test_index in split.split(data, data["Pclass"]):
    strat_train_set = data.loc[train_index]
    strat_test_set = data.loc[test_index]

In [8]:

sample_bias = pd.DataFrame({"Overall": (data["Pclass"].value_counts() / len(data)).tolist(),
                          "Stratified": (strat_train_set["Pclass"].value_counts() / len(strat_train_set)).tolist(),
                           "Random": train_set["Pclass"].value_counts() / len(train_set),
                           "Rnd. % error": train_set["Pclass"].value_counts() / len(train_set)- (data["Pclass"].value_counts() / len(data)).tolist(),
                           "Strat. % error": strat_train_set["Pclass"].value_counts() / len(strat_train_set)- (data["Pclass"].value_counts() / len(data)).tolist()})

In [9]:

sample_bias

Out[9]:

	Overall	Stratified	Random	Rnd. % error	Strat. % error
3	0.551066	0.550562	0.565008	0.013942	-0.000504
1	0.242424	0.242376	0.223114	-0.019310	-0.000049
2	0.206510	0.207063	0.211878	0.005368	0.000553

3. Explore the data¶

This phase will be very similar to the one in my previous try with the Titanic problem, but I'll go through it again to try to spot new insights and for this part I'll work on the original dataset to get a better image of the whole thing.

There are 12 features per passenger, of those 7 are numerical and the remaining 5 are categorical, although some of them might not be relevant to determine the faith of the passenger. These attributes are:

PassengerId: just a number to identify each passenger
Survived: Binary variable that tells whether it survived or not
Pclass: The passenger class in the boat
Name
Sex
Age
SibSp: Number of sibilings in the boat
Parch: Number of parents or children above
Ticket: the ticket number
Fare: the amount they paid for the ticket
Cabin: the identifier of their cabin
Embarked: the port where they embarked

As we can see there are some missing values in the "Age", the "Cabin" and their port of Embarkment, we shall develop some strategy to fill in these values if we think that will be relevant later on. Regarding the rapid description of the data it does not provide us much insight except for the fact that the average age was about 29 years old.

3.1 Target attributes¶

Even though many models will be used for the death prediction, we shall identify beforehand the most relevant attibutes to feed them, as it is clear that some of the data provided does not serve for much. Some reflections about them all:

PassengerId: just a identifier that can be discarded
Survived: taget label
Pclass: this one will likely be relevant, as it could be that the higher class were up above or had access to better rescue equipment
Name: not relevant
Sex: very relevant
Age: very relevant ("attending to the famous phrase: women and children first")
SibSp and Parch could be combined to get a single attribute meaning all the relatives in the boat
Ticket: not relevant
Fare: this one is likely strictly related to the Passenger Class, so I'll just go for that one
Cabin:
Embarked: maybe the position of the passengers within the boat was determined by their order of arrival, so I'll use this one too.

3.2 Data Visualization and Data Correlation¶

From the numerical data visualized we can see that:

There is a clear influence of sex and age in your chances to survive, therefore the "Age" and "Sex" attribute selection was alright
Your ticket class had also a big influence as it can be seen in the next graph
Your port of embark was really influencial too, and it made a difference in whether you were male or female.
Also the number of relatives all together had a bif influence too, the most relatives the lower were your chances or surviving

In [10]:

 data.columns

Out[10]:

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [11]:

data.shape

Out[11]:

(891, 12)

In [12]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

In [13]:

data.describe()

Out[13]:

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

In [14]:

attributes = ["Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Embarked"]

In [15]:

survived = 'survived'
not_survived = 'not survived'
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(12, 5))
women = data[data['Sex']=='female']
men = data[data['Sex']=='male']
ax = sns.histplot(women[women['Survived']==1].Age.dropna(), bins=18, label = survived, ax = axes[0], kde =False, color="blue")
ax = sns.histplot(women[women['Survived']==0].Age.dropna(), bins=40, label = not_survived, ax = axes[0], kde =False, color="red")
ax.legend()
ax.set_title('Female')
ax = sns.histplot(men[men['Survived']==1].Age.dropna(), bins=18, label = survived, ax = axes[1], kde = False, color="blue")
ax = sns.histplot(men[men['Survived']==0].Age.dropna(), bins=40, label = not_survived, ax = axes[1], kde = False, color="red")
ax.legend()
_ = ax.set_title('Male');

In [16]:

sns.set_theme(style="whitegrid")
ax = sns.barplot(x='Pclass', y='Survived', data=data,palette="hls")
ax.set(xlabel='Passenger Class', ylabel='Percentage of survivors')
plt.show()

In [17]:

FacetGrid = sns.FacetGrid(data, row='Embarked', height=4, aspect=1.6)
FacetGrid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', order=None, hue_order=None )
FacetGrid.add_legend()

Out[17]:

<seaborn.axisgrid.FacetGrid at 0x1bf1c8aa520>

In [18]:

data.loc[(data['SibSp'] + data["Parch"]) > 0, 'travelled_alone'] = 'No'
data.loc[(data['SibSp'] + data["Parch"]) == 0, 'travelled_alone'] = 'Yes'
axes = sns.factorplot('SibSp','Survived', data=data, aspect = 3, );

C:\Users\Usuario\Anaconda\lib\site-packages\seaborn\categorical.py:3704: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.
  warnings.warn(msg)
C:\Users\Usuario\Anaconda\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

4. Data Preparation¶

4.1 Get copies of the data and the labels¶

The key here is to work on copies of the data and to use functions that will likely be reusable for future applications, in this case this dataset will not evolve, unless new casulties are discovered but that is very unlikely and for sure the result will not be modified.

4.2 Data Cleaning¶

I will fill the empty spaces of the numerical values with the Simple Imputer

4.3 Handle Categorical Attributes¶

Those attributes that are not a number but a pice of text must be worked around carefully, luckily enough Scikit Learn has tools to work with them easily.

4.4 Custom Transforms¶

Sometimes it's useful to modify data to create new attributes that we think are going to be more useful to predict the final outcome that we want to obtain.

4.5 Scaling¶

Sometimes normalizing or standarizing features is useful since some models do not work well with attributes that have very dispair numerical attributes

In [19]:

data = strat_train_set.drop("Survived", axis = 1)
data_labels = strat_train_set["Survived"].copy()

In [20]:

data_test = strat_test_set.drop("Survived", axis = 1)
data_test_labels = strat_test_set["Survived"].copy()

In [21]:

imputer = SimpleImputer(strategy = "median")

In [22]:

data_num = data.drop(["Name", "Sex", "Cabin", "Embarked", "Pclass","Ticket"], axis = 1)

In [23]:

imputer.fit(data_num)

Out[23]:

SimpleImputer(strategy='median')

In [24]:

X = imputer.transform(data_num)

In [25]:

data_tr = pd.DataFrame(X, columns = data_num.columns, index = data_num.index)

In [26]:

cat_encoder = OneHotEncoder()

In [27]:

data_cat = data[["Sex","Embarked"]]

In [28]:

data_cat.fillna(method = "ffill", inplace = True)

C:\Users\Usuario\Anaconda\lib\site-packages\pandas\core\frame.py:4317: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().fillna(

In [29]:

data_cat_1hot = cat_encoder.fit_transform(data_cat)

In [30]:

cat_encoder.categories_

Out[30]:

[array(['female', 'male'], dtype=object), array(['C', 'Q', 'S'], dtype=object)]

5. Pipelines of Data¶

In [31]:

SibSp_ix, Parch_ix = 2, 3

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_relatives = True):
        self.add_relatives = add_relatives
    
    def fit(self, X, y = None):
        return self
    
    def transform(self, X):
        relatives = X[:, SibSp_ix] + X[:, Parch_ix]
        if self.add_relatives:
            return np.c_[X, relatives]
        else:
            return np.c_[X]

In [32]:

attr_adder = CombinedAttributesAdder(add_relatives = False)

In [33]:

data_extra_attribs = attr_adder.transform(data.values)

In [34]:

num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy = "median")),
    ("attribs_adder", CombinedAttributesAdder()),
    ("std_scaler", StandardScaler()),
])

In [35]:

data_num_tr = num_pipeline.fit_transform(data_num)

In [36]:

cat_pipeline = Pipeline([
    ("imputer",SimpleImputer(strategy = "most_frequent"))
])

In [37]:

data_cat_pr = cat_pipeline.fit_transform(data_cat)

In [38]:

num_attribs = list(data_num)
cat_attribs = ["Sex", "Embarked"]

In [39]:

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", OneHotEncoder(), cat_attribs),
])

In [40]:

data["Embarked"].fillna(method = "ffill", inplace = True)

In [41]:

data_prepared = full_pipeline.fit_transform(data)

In [42]:

data_test.fillna(method = "ffill", inplace = True)

6. Promising ML Models¶

This is clearly a classification problem in which we need to decide whether a person results alive or death after the Titanic disaster. So we could use any model fitted for this issue or a combination of them:

Support Vector Machines
Logistic Regression
Decision Trees
Random Forest
Ensemble models (bagging or pasting)
etc...

6.1 Logistic Regression¶

Let's start by one of the simplest and easiest to understand, in which the output would be the probability of a row to pertain to one category or the other based on the probability it has to of belonging to either

In [43]:

X = full_pipeline.fit_transform(data)
y = data_labels

In [44]:

X_test = full_pipeline.fit_transform(data_test)
y_test = data_test_labels

In [45]:

log_reg = LogisticRegression()

In [46]:

log_reg.fit(X, y)

Out[46]:

LogisticRegression()

In [47]:

y_pred = log_reg.predict(X)

In [48]:

lin_rmse_1 = np.sqrt(mean_squared_error(y_pred, y))

6.2 Decision Tree Regressor¶

In [49]:

tree_reg = DecisionTreeRegressor()

In [50]:

tree_reg.fit(X, y)

Out[50]:

DecisionTreeRegressor()

In [51]:

y_pred = tree_reg.predict(X)

In [52]:

lin_rmse_2 = np.sqrt(mean_squared_error(y_pred, y))

6.3 Random Forest Regressor¶

In [53]:

forest_reg = RandomForestRegressor()

In [54]:

forest_reg.fit(X,y)

Out[54]:

RandomForestRegressor()

In [55]:

y_pred = forest_reg.predict(X)

In [56]:

lin_rmse_3 = np.sqrt(mean_squared_error(y_pred, y))

6.4 Use of Cross-Validation¶

In [57]:

scores_1 = cross_val_score(log_reg, X, y, scoring = "neg_mean_squared_error", cv = 10)
scores_2 = cross_val_score(tree_reg, X, y, scoring = "neg_mean_squared_error", cv = 10)
scores_3 = cross_val_score(forest_reg, X, y, scoring = "neg_mean_squared_error", cv = 10)

In [58]:

log_reg_scores = np.sqrt(-scores_1)
tree_reg_scores = np.sqrt(-scores_2)
forest_reg_scores = np.sqrt(-scores_3)

In [59]:

def display_scores(scores):
    print("Scores: ", scores)
    print("Mean: ", scores.mean())
    print("Standard deviation: ", scores.std())

In [60]:

display_scores(log_reg_scores)

Scores:  [0.47140452 0.39840954 0.45425676 0.47519096 0.38100038 0.43994135
 0.43994135 0.52363494 0.50800051 0.43994135]
Mean:  0.45317216454089754
Standard deviation:  0.0418598230964969

In [61]:

display_scores(tree_reg_scores)

Scores:  [0.50395263 0.39840954 0.51946248 0.52363494 0.52363494 0.49186938
 0.52363494 0.43994135 0.49186938 0.50800051]
Mean:  0.49244100697797677
Standard deviation:  0.03952561844409002

In [62]:

display_scores(forest_reg_scores)

Scores:  [0.40811607 0.37866725 0.36490703 0.36523744 0.4041678  0.33325859
 0.41970228 0.34637291 0.39294997 0.39491118]
Mean:  0.3808290529951755
Standard deviation:  0.02664035632783443

7. Fine-Tune the Model¶

In [63]:

param_grid = [
    {"n_estimators": [3, 10, 30, 50, 100], "max_features": [2, 4, 6, 8]},
    {"bootstrap": [False], "n_estimators": [3, 10], "max_features": [2, 3, 4]},
]

In [64]:

grid_search = GridSearchCV(forest_reg, param_grid, cv = 5, scoring = "neg_mean_squared_error", return_train_score = True)

In [65]:

grid_search.fit(X,y)

Out[65]:

GridSearchCV(cv=5, estimator=RandomForestRegressor(),
             param_grid=[{'max_features': [2, 4, 6, 8],
                          'n_estimators': [3, 10, 30, 50, 100]},
                         {'bootstrap': [False], 'max_features': [2, 3, 4],
                          'n_estimators': [3, 10]}],
             return_train_score=True, scoring='neg_mean_squared_error')

In [66]:

grid_search.best_params_

Out[66]:

{'max_features': 4, 'n_estimators': 100}

In [67]:

grid_search.best_estimator_

Out[67]:

RandomForestRegressor(max_features=4)

In [68]:

cvres = grid_search.cv_results_

In [69]:

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

0.4202115869151846 {'max_features': 2, 'n_estimators': 3}
0.39704619053625845 {'max_features': 2, 'n_estimators': 10}
0.37847337183739116 {'max_features': 2, 'n_estimators': 30}
0.38471919881321737 {'max_features': 2, 'n_estimators': 50}
0.378215901816802 {'max_features': 2, 'n_estimators': 100}
0.4380847858233523 {'max_features': 4, 'n_estimators': 3}
0.3930451833174301 {'max_features': 4, 'n_estimators': 10}
0.3803265150689814 {'max_features': 4, 'n_estimators': 30}
0.37766529643194946 {'max_features': 4, 'n_estimators': 50}
0.37698688434445826 {'max_features': 4, 'n_estimators': 100}
0.4169886927633328 {'max_features': 6, 'n_estimators': 3}
0.4040407197620141 {'max_features': 6, 'n_estimators': 10}
0.38432929056244125 {'max_features': 6, 'n_estimators': 30}
0.38124982242195765 {'max_features': 6, 'n_estimators': 50}
0.3804899778257063 {'max_features': 6, 'n_estimators': 100}
0.42325416243848724 {'max_features': 8, 'n_estimators': 3}
0.4043056971824581 {'max_features': 8, 'n_estimators': 10}
0.38868490707673137 {'max_features': 8, 'n_estimators': 30}
0.3805330522713733 {'max_features': 8, 'n_estimators': 50}
0.38205274927136235 {'max_features': 8, 'n_estimators': 100}
0.4290491278651709 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
0.40584153907493214 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
0.4354439912710361 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
0.40586967557134285 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
0.4271821482852719 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
0.39733466839640036 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}

8. Evaluate system on the Test Set¶

In [70]:

final_model = grid_search.best_estimator_

In [71]:

X_test = strat_test_set.drop("Survived", axis = 1)
y_test = strat_test_set["Survived"].copy()

In [72]:

X_test.fillna(method = "ffill", inplace = True)

In [73]:

X_test_prepared = full_pipeline.transform(X_test)

In [74]:

final_predictions = final_model.predict(X_test_prepared)

In [75]:

final_rmse = np.sqrt(mean_squared_error(y_test, final_predictions))

In [78]:

result = []
for i in final_predictions:
    if (i < 0.5):
        result.append(0)
    else:
        result.append(1)

In [80]:

sum(result)/len(result)

Out[80]:

0.376865671641791

In [82]:

sum(y)/len(y)

Out[82]:

0.39807383627608345

Achefe

Search This Blog