Post

RTG #2: Titanic

In this article, we’ll take a look at what is probably the most well-known competition on Kaggle: Titanic - Machine Learning from Disaster

The problem

The task here is pretty simple: Given some information about a passenger on the Titanic, predict if the passenger will survive or not. The data given is already split into train and test sets, and to make a submission, the competitors must send a CSV file containing the labels predicted for the test set.

The data

Apart from information regarding the family members of each passenger (which was recorded in one of the weirdest ways possible), everything else was pretty standard. The data included some specific information about a person (age, sex, etc.) and their trip (port of embarkation, ticket class, passenger fare, etc.).

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
103Braund, Mr. Owen Harrismale2210A/5 211717.25 S
211Cumings, Mrs. John Bradley (Florence Briggs Thayer)female3810PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale2600STON/O2. 31012827.925 S
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female351011380353.1C123S
503Allen, Mr. William Henrymale35003734508.05 S
603Moran, Mr. Jamesmale 003308778.4583 Q
701McCarthy, Mr. Timothy Jmale54001746351.8625E46S
803Palsson, Master. Gosta Leonardmale23134990921.075 S
913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female270234774211.1333 S

The family information was encoded in two variables: SibSp and ParCh, whose names clearly indicate what they measure, but I’ll explain just in case. sibsp stands for the number of Siblings and Spouses a passenger had on board, and parch stands for the number of Parents and Children a passenger had on board.

The training set had a total of 891 records, and the test set had 418. Most of the data cleaning wasn’t all that challenging: just removing some columns that yielded no useful information (all distinct values) or were too irregular to make sense of.

The imputation of missing values, however, was quite interesting. Usually, I would just drop rows with missing values, or impute something like the mean of a feature. This time I checked some notebooks beforehand and found this one, in which the author applied a more nuanced strategy.

He first searched among available data for columns that had a significant influence on the distribution of the column he wanted to impute. Then he grouped the data by influential_col and used statistics from each of the groups to impute col_with_missing_values. I applied a similar (exactly the same) strategy in my imputation process.

Stats by Pclass Stats by Pclass The Age distribution is different depending on the value of pclass. To impute missing Age values, we check the data point’s pclass and use the median age (purple) for that pclass

The last thing worth noticing was a little bit of feature engineering. After analysis I noticed some differences in the distribution of people who had missing age information and those who didn’t, so I decided to add a new feature to tag people whose age got imputed.

The solution

The problem is a binary classification one, and the algorithms I chose to tackle it were logistic regression, random forest, and AdaBoost. Nothing fancy. One cool thing about this lineup is that none of those require standardization of the input data: logistic regression doesn’t need it, and the trees used in random forest and AdaBoost don’t care about the scale of the features.

It was only a matter of selecting the best performer among those. Scikit-learn’s GridSearchCV was the tool for the job, as it allowed me to try all the classifiers on the same input data, each with a different set of hyperparameters, all in one go.

The best performer ended up being random forest, perhaps unsurprisingly. In the initial tests it overfit quite a bit, but after tuning some hyperparameters and applying cross-validation (thanks to GridSearchCV) the train accuracy was about 83% and the final submission scored of 77.272%.

This post is licensed under CC BY 4.0 by the author.