Who died in the Titanic and why? A Data Science approach

Getting Started with Titanic

During the last week I have been doing a course on Machine Learning with Python I bought on Udemy.com a year ago or so (https://www.udemy.com/course/machinelearningpython/) I had already made about a third of it and I thought it would be a good idea to try to do something by myself. Up to that point I had already learnt the basics about data wrangling and exploratory analysis.

I went to Kaggle, which is a webpage that I really recommend you if you want to learn Machine Learning or Data Science and signed up for the Titanic competition.

The Titanic problem¶

The Titanic problem seems to be like the print("Hello World") of programming or the Benchy of 3D printing, it is kind of the first thing you do when you are just starting out.

You are given two data sets (https://www.kaggle.com/c/titanic/data):

test.csv
train.csv and then you have a third one for the solution:
gender_submission.csv Your goal is basically to infer how many people died in the Titanic, training a Machine Learning model with the train Data Set.

To do so, as I still haven't learnt about preditive models, my original apportation was just the way in which I managed the data and what I thought were the msor relevant inputs to train the model with.

In [1]:

import os
from IPython.display import Image
Image("C:/Users/Usuario/Documents/Kaggle/titanic/tita.jpg")

Out[1]:

Let's start by bringing in all the modules that we need for the problem and by importing the data sets we are given

In [2]:

import os
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from matplotlib import pyplot as plt
import seaborn as sns

The ones above are the basic packages that we need to solve the problem and the ones below the filepaths to the data sets

In [3]:

filepath_train = "C:/Users/Usuario/Documents/Kaggle/titanic/train.csv"
filepath_test = "C:/Users/Usuario/Documents/Kaggle/titanic/test.csv"

Now we are going to convert the .csv files into dataframes, which is the kind of object that you work with at pandas. As the csv is formatted in a very nice way, no further input is needed for this case, and the data is quite clean already, so no data cleaning is required for this begginers problem.

In [4]:

train = pd.read_csv(filepath_train)
test = pd.read_csv(filepath_test)

Basic Data Exploration¶

Here you can see the looks of the first rows of both data sets

In [5]:

train.head()

Out[5]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

In [6]:

test.head()

Out[6]:

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

As you can see, the testing one lacks the "Survived" column, down below you got the columns of each one.

In [7]:

'Train columns:', train.columns.tolist()

Out[7]:

('Train columns:',
 ['PassengerId',
  'Survived',
  'Pclass',
  'Name',
  'Sex',
  'Age',
  'SibSp',
  'Parch',
  'Ticket',
  'Fare',
  'Cabin',
  'Embarked'])

In [8]:

'Test columns:', test.columns.tolist()

Out[8]:

('Test columns:',
 ['PassengerId',
  'Pclass',
  'Name',
  'Sex',
  'Age',
  'SibSp',
  'Parch',
  'Ticket',
  'Fare',
  'Cabin',
  'Embarked'])

Now, I am going to check the survival rate for both men and women in the train data set with this function I created that locates the one that survived and classify them by sex

In [9]:

def SurvivalRate(sex):
    rate = (train.loc[train["Sex"] == sex].loc[train["Survived"] == True].shape[0])/(train.loc[train["Sex"] == sex].shape[0])

    print(f"{round(rate*100,3)}% {sex} survived")

Now, call the function to check those rates

In [10]:

SurvivalRate("female")
SurvivalRate("male")

74.204% female survived
18.891% male survived

As we can see the survival rate is quite different, so already we can see that being a woman in the Titanic gave you quite a surivalistic advantage.

Age: digging deeper into the data¶

Let's see if the age was a relevant factor when talking about survivorship rates in the Titanic, to do so I have to check in the train data set of course.

First of all, I need to make sure that age data is available for every passenger, otherwise we would need to invent something to solve this issue.

In [11]:

train['Age'].isnull().values.any()

Out[11]:

True

In [12]:

mean_age = train["Age"].mean()
std_age = train["Age"].std()

f"The mean age is {mean_age} years with and standard deviation of {std_age} years"

Out[12]:

'The mean age is 29.69911764705882 years with and standard deviation of 14.526497332334044 years'

OK, gotta admit that's maybe too much precission, but let's continue. Now I have to create an array with a size equal to the number of NaN s in the data sets and then I will assign an age value to those persons

In [13]:

rand_age = np.random.normal(loc=mean_age, scale=std_age, size=train["Age"].isnull().sum())

Let's visualize the ages we have created to check that it looks alrighty, and let's compare them with the ages that are known

In [14]:

plt.hist(train["Age"], bins = int(np.ceil(1+np.log2(train["Age"].notnull().sum()))))
plt.hist(rand_age, bins = int(np.ceil(1+np.log2(train["Age"].isnull().sum()))))

plt.xlabel("Age")
plt.ylabel("Number of people with that age")
plt.title("Distribution of ages in the Titanic")

plt.legend(['Known Ages',"Guessed Ages"])

Out[14]:

<matplotlib.legend.Legend at 0x14e78646850>

The results we got are quite satisfactory, the shape of both histograms are quite similar as you can see so we can move forward with the analysis. Just some curiosity if you want to improve the visualization of your histograms you should use the Sturges Formula (https://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width) to determine the width of yuor bins.

So we can now replace the NaN with the calculated ages

In [15]:

age_slice = train["Age"].copy()
age_slice[np.isnan(age_slice)] = rand_age
train["Age"] = age_slice
train["Age"] = train["Age"].astype(int)

Now we can see that there aren't NaNs anymore in the "Age" column of the train data set

In [16]:

train['Age'].isnull().values.any()

Out[16]:

False

In [17]:

survived = 'survived'
not_survived = 'not survived'
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(12, 6))
women = train[train['Sex']=='female']
men = train[train['Sex']=='male']
ax = sns.histplot(women[women['Survived']==1].Age.dropna(), bins=18, label = survived, ax = axes[0], kde =False, color="blue")
ax = sns.histplot(women[women['Survived']==0].Age.dropna(), bins=40, label = not_survived, ax = axes[0], kde =False, color="red")
ax.legend()
ax.set_title('Female')
ax = sns.histplot(men[men['Survived']==1].Age.dropna(), bins=18, label = survived, ax = axes[1], kde = False, color="blue")
ax = sns.histplot(men[men['Survived']==0].Age.dropna(), bins=40, label = not_survived, ax = axes[1], kde = False, color="red")
ax.legend()
_ = ax.set_title('Male');

As we can see, the typical phrase of women and kids go first applies perfectly in this case

Class Ticket¶

Was the class relevant when talking about surivorship rate?

In [18]:

sns.set_theme(style="whitegrid")
ax = sns.barplot(x='Pclass', y='Survived', data=train,palette="hls")
ax.set(xlabel='Passenger Class', ylabel='Percentage of survivors')
plt.show()

Clearly, as everything in life, those with more money have a higher rate of being succesfull, what a coincidence :D

Port of embark¶

I guess that depeding on the part where you embarked determined the place you were situated within the boat, and so probably we will be able to see some patterns when relating the embarkation port with the rate of survivorship. Let's see

To begin, we are going to discard those passengers whose port of embark we don't know, so the NaNs as before

In [19]:

train['Embarked'].isnull().values.any()

Out[19]:

True

Now we are sure that there are some null values so let's remove them

In [20]:

train.dropna(subset = ["Embarked"], inplace=True)
train['Embarked'].isnull().values.any()

Out[20]:

False

In [21]:

FacetGrid = sns.FacetGrid(train, row='Embarked', height=4, aspect=1.6)
FacetGrid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', order=None, hue_order=None )
FacetGrid.add_legend();

In fact, we can see that the port of embarkment did really made a difference, since males from Cherbourg survived almost all of them, while the opposite stands for women. Is this enough to say that French men were less of gentlemen than those from Ireland or England? In my humble opinion, it really is.

Relatives¶

My hypothesis here is that the least relatives you travel with, the higher are your chances to survive, as you can take care for yourself instead having to worry about what's happening to your mum or your little brother. But what I think is not really relevant, let's see if data corroborate my hunch

We need to create a new column that agluitnates the columns of sibilins and the one for parents, that's quite easy

In [22]:

train["Relatives"] = train["SibSp"] + train["Parch"]
train.tail()

Out[22]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Relatives
886	887	0	2	Montvila, Rev. Juozas	male	27	0	0	211536	13.00	NaN	S	0
887	888	1	1	Graham, Miss. Margaret Edith	female	19	0	0	112053	30.00	B42	S	0
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	38	1	2	W./C. 6607	23.45	NaN	S	3
889	890	1	1	Behr, Mr. Karl Howell	male	26	0	0	111369	30.00	C148	C	0
890	891	0	3	Dooley, Mr. Patrick	male	32	0	0	370376	7.75	NaN	Q	0

Ok, now that the new column is created let's see whether the number of relatives made a difference

In [23]:

train.loc[train['Relatives'] > 0, 'travelled_alone'] = 'No'
train.loc[train['Relatives'] == 0, 'travelled_alone'] = 'Yes'
axes = sns.factorplot('Relatives','Survived', data=train, aspect = 2.5, );

C:\Users\Usuario\Anaconda\lib\site-packages\seaborn\categorical.py:3704: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.
  warnings.warn(msg)
C:\Users\Usuario\Anaconda\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

For some reason,the optimum number of people to survive was 3, while big families had for sure almost a fata finale.

As then umber of relatives seems to be determinant to decide whether you lived or not, we'll include it for the training of the predictive model

Training the Machine Learning model¶

To train the model we have to decide the features that will better determine whether you lived or not, after the analysis that we have made in the exploratory research the most relevant features are:

Class
Sex
Relatives:
- Sibilings
- Parents
Age
Port of embarkment

These will the features that we will take into account for the predictive model

In [24]:

features = ["Pclass", "Sex", "SibSp", "Parch","Embarked","Relatives"]
X = pd.get_dummies(train[features])
y = train["Survived"]

How I understand this step is like so:

I know the input to the problem (X)
I know the output to the problm (y)
I know the variables that seem to be relevant to get from X to y
I don't know the exact function that gets me from one to the other (and here's where Machine Learning comes in helpful, I will create a statiscal model that acts as a transfer function) --> Random Forest Classifier (https://en.wikipedia.org/wiki/Random_forest)

I definetely don't know all the insights about this model, but I have some parameters I can play with, so I will try to tweak them until I get results on the testing data set that gives me results similar to those of the training dataset

In [25]:

model = RandomForestClassifier(n_estimators = 200, max_depth = 80, random_state = 10)
model.fit(X,y)

Out[25]:

RandomForestClassifier(max_depth=80, n_estimators=200, random_state=10)

Thist first time I'm just gonna try with the params to see what the output is

In [26]:

test["Relatives"] = test["SibSp"] + test["Parch"]
test.head()

Out[26]:

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Relatives
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q	0
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S	1
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q	0
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S	0
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S	2

Now, we're going to prepare the test data frame for the model to work with it.

In [27]:

X_test = pd.get_dummies(test[features])
X_test.dropna()
X_test[~X_test.isin([np.nan, np.inf, -np.inf]).any(1)]
X_test.head()

Out[27]:

	Pclass	SibSp	Parch	Relatives	Sex_female	Sex_male	Embarked_Q	Embarked_S
0	3	0	0	0	0	1	1	0
1	3	1	0	1	1	0	0	1
2	2	0	0	0	0	1	1	0
3	3	0	0	0	0	1	0	1
4	3	1	1	2	1	0	0	1

And finally we'll see what the output for testing data frame is, and put the results in a dictionary called output. So to sum up, we've now done this:

I have the inputs (X_test)
I have a function (the Random Forest Classifier) that converts X_test into y_test
y_test is now a guess for whether those passengers will survive or not

I am then gonna compare these rates of survivorship with ones in the first data set.

In [28]:

y_test = model.predict(X_test)

y_test = pd.DataFrame(
    {
        "PassengerId" : test.PassengerId,
        "Survived": y_test
    }
)

Now I already have the output of the transfer function, I am gonna merge the output with the ID of each passenger and see the rates of death that we got

Results¶

In [29]:

results = test.join(y_test["Survived"])

This function below will just compute the rate of survivorship of male and female

In [30]:

def SurvivalRate_expected(sex):
    rate_expected = (results.loc[results["Sex"] == sex].loc[results["Survived"] == True].shape[0])/(results.loc[results["Sex"] == sex].shape[0])
    print(f"{round(rate_expected*100,3)} % {sex} survived")

Now just calling the function

In [31]:

SurvivalRate_expected("female")
SurvivalRate_expected("male")

78.289 % female survived
7.519 % male survived

Let's compare the results with the know ones from the training data set

In [33]:

barWidth = 0.3
bars1 = [74.204, 18.891]
bars2 = [78.289, 7.519]
r1 = np.arange(len(bars1))
r2 = [x + barWidth for x in r1]
 
plt.bar(r1, bars1, width = barWidth, color = 'blue', edgecolor = 'black', capsize=7, label='Real')
plt.bar(r2, bars2, width = barWidth, color = 'cyan', edgecolor = 'black', capsize=7, label='Expected')
plt.xlabel('Women vs Men')

plt.ylabel('Percentage of Survivors')
plt.legend()
plt.show()

Conclusions¶

Your chances of surviving in the Titanic were pretty high if you were a kid or a women, also the higher the class of your ticket the more chances you had of surviving, and that magical number of 3 relatives in the boat got chances quite high too. To compare.

Comparing the real values with the expected ones we can see that maybe the algorithm was a bit too optimistic about women and a little pesimistic about men

This has been my first approach to Data Science and Machine Learning, I've learnt a lot and hope to keep on doing so in the upcoming weeks, hopefully I'll do the same problem again and get a more appropiate value, who knows?

Achefe

Search This Blog

Who died in the Titanic and why? A Data Science approach

The Titanic problem¶

Basic Data Exploration¶

Age: digging deeper into the data¶

Class Ticket¶

Port of embark¶

Relatives¶

Training the Machine Learning model¶

Results¶

Conclusions¶

Labels

Comments

Post a Comment

Popular posts from this blog

Advent of Code, day 1

A first approach to IoT, connecting my 3D printer to the internet

Advent of Code, day 2