During the last week I have been doing a course on Machine Learning with Python I bought on Udemy.com a year ago or so (https://www.udemy.com/course/machinelearningpython/) I had already made about a third of it and I thought it would be a good idea to try to do something by myself. Up to that point I had already learnt the basics about data wrangling and exploratory analysis.
I went to Kaggle, which is a webpage that I really recommend you if you want to learn Machine Learning or Data Science and signed up for the Titanic competition.
The Titanic problem¶
The Titanic problem seems to be like the print("Hello World") of programming or the Benchy of 3D printing, it is kind of the first thing you do when you are just starting out.
You are given two data sets (https://www.kaggle.com/c/titanic/data):
- test.csv
- train.csv and then you have a third one for the solution:
- gender_submission.csv Your goal is basically to infer how many people died in the Titanic, training a Machine Learning model with the train Data Set.
To do so, as I still haven't learnt about preditive models, my original apportation was just the way in which I managed the data and what I thought were the msor relevant inputs to train the model with.
import os
from IPython.display import Image
Image("C:/Users/Usuario/Documents/Kaggle/titanic/tita.jpg")
Let's start by bringing in all the modules that we need for the problem and by importing the data sets we are given
import os
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from matplotlib import pyplot as plt
import seaborn as sns
The ones above are the basic packages that we need to solve the problem and the ones below the filepaths to the data sets
filepath_train = "C:/Users/Usuario/Documents/Kaggle/titanic/train.csv"
filepath_test = "C:/Users/Usuario/Documents/Kaggle/titanic/test.csv"
Now we are going to convert the .csv files into dataframes, which is the kind of object that you work with at pandas. As the csv is formatted in a very nice way, no further input is needed for this case, and the data is quite clean already, so no data cleaning is required for this begginers problem.
train = pd.read_csv(filepath_train)
test = pd.read_csv(filepath_test)
Basic Data Exploration¶
Here you can see the looks of the first rows of both data sets
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
test.head()
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q |
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S |
2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q |
3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S |
4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S |
As you can see, the testing one lacks the "Survived" column, down below you got the columns of each one.
'Train columns:', train.columns.tolist()
('Train columns:', ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'])
'Test columns:', test.columns.tolist()
('Test columns:', ['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'])
Now, I am going to check the survival rate for both men and women in the train data set with this function I created that locates the one that survived and classify them by sex
def SurvivalRate(sex):
rate = (train.loc[train["Sex"] == sex].loc[train["Survived"] == True].shape[0])/(train.loc[train["Sex"] == sex].shape[0])
print(f"{round(rate*100,3)}% {sex} survived")
Now, call the function to check those rates
SurvivalRate("female")
SurvivalRate("male")
74.204% female survived 18.891% male survived
As we can see the survival rate is quite different, so already we can see that being a woman in the Titanic gave you quite a surivalistic advantage.
Age: digging deeper into the data¶
Let's see if the age was a relevant factor when talking about survivorship rates in the Titanic, to do so I have to check in the train data set of course.
First of all, I need to make sure that age data is available for every passenger, otherwise we would need to invent something to solve this issue.
train['Age'].isnull().values.any()
True
mean_age = train["Age"].mean()
std_age = train["Age"].std()
f"The mean age is {mean_age} years with and standard deviation of {std_age} years"
'The mean age is 29.69911764705882 years with and standard deviation of 14.526497332334044 years'
OK, gotta admit that's maybe too much precission, but let's continue. Now I have to create an array with a size equal to the number of NaN s in the data sets and then I will assign an age value to those persons
rand_age = np.random.normal(loc=mean_age, scale=std_age, size=train["Age"].isnull().sum())
Let's visualize the ages we have created to check that it looks alrighty, and let's compare them with the ages that are known
plt.hist(train["Age"], bins = int(np.ceil(1+np.log2(train["Age"].notnull().sum()))))
plt.hist(rand_age, bins = int(np.ceil(1+np.log2(train["Age"].isnull().sum()))))
plt.xlabel("Age")
plt.ylabel("Number of people with that age")
plt.title("Distribution of ages in the Titanic")
plt.legend(['Known Ages',"Guessed Ages"])
<matplotlib.legend.Legend at 0x14e78646850>
The results we got are quite satisfactory, the shape of both histograms are quite similar as you can see so we can move forward with the analysis. Just some curiosity if you want to improve the visualization of your histograms you should use the Sturges Formula (https://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width) to determine the width of yuor bins.
So we can now replace the NaN with the calculated ages
age_slice = train["Age"].copy()
age_slice[np.isnan(age_slice)] = rand_age
train["Age"] = age_slice
train["Age"] = train["Age"].astype(int)
Now we can see that there aren't NaNs anymore in the "Age" column of the train data set
train['Age'].isnull().values.any()
False
survived = 'survived'
not_survived = 'not survived'
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(12, 6))
women = train[train['Sex']=='female']
men = train[train['Sex']=='male']
ax = sns.histplot(women[women['Survived']==1].Age.dropna(), bins=18, label = survived, ax = axes[0], kde =False, color="blue")
ax = sns.histplot(women[women['Survived']==0].Age.dropna(), bins=40, label = not_survived, ax = axes[0], kde =False, color="red")
ax.legend()
ax.set_title('Female')
ax = sns.histplot(men[men['Survived']==1].Age.dropna(), bins=18, label = survived, ax = axes[1], kde = False, color="blue")
ax = sns.histplot(men[men['Survived']==0].Age.dropna(), bins=40, label = not_survived, ax = axes[1], kde = False, color="red")
ax.legend()
_ = ax.set_title('Male');
As we can see, the typical phrase of women and kids go first applies perfectly in this case
Class Ticket¶
Was the class relevant when talking about surivorship rate?
sns.set_theme(style="whitegrid")
ax = sns.barplot(x='Pclass', y='Survived', data=train,palette="hls")
ax.set(xlabel='Passenger Class', ylabel='Percentage of survivors')
plt.show()
Clearly, as everything in life, those with more money have a higher rate of being succesfull, what a coincidence :D
Port of embark¶
I guess that depeding on the part where you embarked determined the place you were situated within the boat, and so probably we will be able to see some patterns when relating the embarkation port with the rate of survivorship. Let's see
To begin, we are going to discard those passengers whose port of embark we don't know, so the NaNs as before
train['Embarked'].isnull().values.any()
True
Now we are sure that there are some null values so let's remove them
train.dropna(subset = ["Embarked"], inplace=True)
train['Embarked'].isnull().values.any()
False
FacetGrid = sns.FacetGrid(train, row='Embarked', height=4, aspect=1.6)
FacetGrid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', order=None, hue_order=None )
FacetGrid.add_legend();
In fact, we can see that the port of embarkment did really made a difference, since males from Cherbourg survived almost all of them, while the opposite stands for women. Is this enough to say that French men were less of gentlemen than those from Ireland or England? In my humble opinion, it really is.
Relatives¶
My hypothesis here is that the least relatives you travel with, the higher are your chances to survive, as you can take care for yourself instead having to worry about what's happening to your mum or your little brother. But what I think is not really relevant, let's see if data corroborate my hunch
We need to create a new column that agluitnates the columns of sibilins and the one for parents, that's quite easy
train["Relatives"] = train["SibSp"] + train["Parch"]
train.tail()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Relatives | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27 | 0 | 0 | 211536 | 13.00 | NaN | S | 0 |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19 | 0 | 0 | 112053 | 30.00 | B42 | S | 0 |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | 38 | 1 | 2 | W./C. 6607 | 23.45 | NaN | S | 3 |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26 | 0 | 0 | 111369 | 30.00 | C148 | C | 0 |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32 | 0 | 0 | 370376 | 7.75 | NaN | Q | 0 |
Ok, now that the new column is created let's see whether the number of relatives made a difference
train.loc[train['Relatives'] > 0, 'travelled_alone'] = 'No'
train.loc[train['Relatives'] == 0, 'travelled_alone'] = 'Yes'
axes = sns.factorplot('Relatives','Survived', data=train, aspect = 2.5, );
C:\Users\Usuario\Anaconda\lib\site-packages\seaborn\categorical.py:3704: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`. warnings.warn(msg) C:\Users\Usuario\Anaconda\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
For some reason,the optimum number of people to survive was 3, while big families had for sure almost a fata finale.
As then umber of relatives seems to be determinant to decide whether you lived or not, we'll include it for the training of the predictive model
Training the Machine Learning model¶
To train the model we have to decide the features that will better determine whether you lived or not, after the analysis that we have made in the exploratory research the most relevant features are:
- Class
- Sex
- Relatives:
- Sibilings
- Parents
- Age
- Port of embarkment
These will the features that we will take into account for the predictive model
features = ["Pclass", "Sex", "SibSp", "Parch","Embarked","Relatives"]
X = pd.get_dummies(train[features])
y = train["Survived"]
How I understand this step is like so:
- I know the input to the problem (X)
- I know the output to the problm (y)
- I know the variables that seem to be relevant to get from X to y
- I don't know the exact function that gets me from one to the other (and here's where Machine Learning comes in helpful, I will create a statiscal model that acts as a transfer function) --> Random Forest Classifier (https://en.wikipedia.org/wiki/Random_forest)
I definetely don't know all the insights about this model, but I have some parameters I can play with, so I will try to tweak them until I get results on the testing data set that gives me results similar to those of the training dataset
model = RandomForestClassifier(n_estimators = 200, max_depth = 80, random_state = 10)
model.fit(X,y)
RandomForestClassifier(max_depth=80, n_estimators=200, random_state=10)
Thist first time I'm just gonna try with the params to see what the output is
test["Relatives"] = test["SibSp"] + test["Parch"]
test.head()
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Relatives | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q | 0 |
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S | 1 |
2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q | 0 |
3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S | 0 |
4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S | 2 |
Now, we're going to prepare the test data frame for the model to work with it.
X_test = pd.get_dummies(test[features])
X_test.dropna()
X_test[~X_test.isin([np.nan, np.inf, -np.inf]).any(1)]
X_test.head()
Pclass | SibSp | Parch | Relatives | Sex_female | Sex_male | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|
0 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
1 | 3 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
2 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
3 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
4 | 3 | 1 | 1 | 2 | 1 | 0 | 0 | 0 | 1 |
And finally we'll see what the output for testing data frame is, and put the results in a dictionary called output. So to sum up, we've now done this:
- I have the inputs (X_test)
- I have a function (the Random Forest Classifier) that converts X_test into y_test
- y_test is now a guess for whether those passengers will survive or not
I am then gonna compare these rates of survivorship with ones in the first data set.
y_test = model.predict(X_test)
y_test = pd.DataFrame(
{
"PassengerId" : test.PassengerId,
"Survived": y_test
}
)
Now I already have the output of the transfer function, I am gonna merge the output with the ID of each passenger and see the rates of death that we got
Results¶
results = test.join(y_test["Survived"])
This function below will just compute the rate of survivorship of male and female
def SurvivalRate_expected(sex):
rate_expected = (results.loc[results["Sex"] == sex].loc[results["Survived"] == True].shape[0])/(results.loc[results["Sex"] == sex].shape[0])
print(f"{round(rate_expected*100,3)} % {sex} survived")
Now just calling the function
SurvivalRate_expected("female")
SurvivalRate_expected("male")
78.289 % female survived 7.519 % male survived
Let's compare the results with the know ones from the training data set
barWidth = 0.3
bars1 = [74.204, 18.891]
bars2 = [78.289, 7.519]
r1 = np.arange(len(bars1))
r2 = [x + barWidth for x in r1]
plt.bar(r1, bars1, width = barWidth, color = 'blue', edgecolor = 'black', capsize=7, label='Real')
plt.bar(r2, bars2, width = barWidth, color = 'cyan', edgecolor = 'black', capsize=7, label='Expected')
plt.xlabel('Women vs Men')
plt.ylabel('Percentage of Survivors')
plt.legend()
plt.show()
Conclusions¶
Your chances of surviving in the Titanic were pretty high if you were a kid or a women, also the higher the class of your ticket the more chances you had of surviving, and that magical number of 3 relatives in the boat got chances quite high too. To compare.
Comparing the real values with the expected ones we can see that maybe the algorithm was a bit too optimistic about women and a little pesimistic about men
This has been my first approach to Data Science and Machine Learning, I've learnt a lot and hope to keep on doing so in the upcoming weeks, hopefully I'll do the same problem again and get a more appropiate value, who knows?
Comments
Post a Comment