Project - Titanic passenger survival

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. In this project, you build a model to predict which passengers survived the tragedy. The goal is to predict whether or not a passenger survived based on attributes such as their age, sex, passenger class, where they embarked and so on.

Objectives:

First, login to Kaggle and go to the Titanic challenge to download train.csv and test.csv. Save them to the datasets/titanic directory.

Next, let's load the data:

import os

TITANIC_PATH = os.path.join("datasets", "titanic")

import pandas as pd

def load_titanic_data(filename, titanic_path=TITANIC_PATH):
    csv_path = os.path.join(titanic_path, filename)
    return pd.read_csv(csv_path)

train_data = load_titanic_data("train.csv")
test_data = load_titanic_data("test.csv")

The data is already split into a training set and a test set. However, the test data does not contain the labels: your goal is to train the best model you can using the training data, then make your predictions on the test data and upload them to Kaggle to see your final score.

Let's take a peek at the top few rows of the training set:

train_data.head()

The attributes have the following meaning:

Survived: that's the target, 0 means the passenger did not survive, while 1 means he/she survived. Pclass: passenger class. Name, Sex, Age: self-explanatory SibSp: how many siblings & spouses of the passenger aboard the Titanic. Parch: how many children & parents of the passenger aboard the Titanic. Ticket: ticket id Fare: price paid (in pounds) Cabin: passenger's cabin number Embarked: where the passenger embarked the Titanic

Use the info() function to get more info about the data.
Use the describe() function to look at the numerical attributes.
Check if the target is indeed 0 or 1using value_counts()
Now take a quick look at all the categorical attributes (Pclass, Sex, Embarked using value_counts()

Now build a preprocessing pipelines. You can use the following DataframeSelector to select specific attributes from the DataFrame:

from sklearn.base import BaseEstimator, TransformerMixin

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names]

Build the pipeline for the numerical attributes:

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

num_pipeline = Pipeline([
        ("select_numeric", DataFrameSelector(["Age", "SibSp", "Parch", "Fare"])),
        ("imputer", SimpleImputer(strategy="median")),
    ])

num_pipeline.fit_transform(train_data)

Create an imputer for the string categorical columns (the regular SimpleImputer does not work on those):

class MostFrequentImputer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.most_frequent_ = pd.Series([X[c].value_counts().index[0] for c in X],
                                        index=X.columns)
        return self
    def transform(self, X, y=None):
        return X.fillna(self.most_frequent_)


from sklearn.preprocessing import OneHotEncoder

cat_pipeline = Pipeline([
        ("select_cat", DataFrameSelector(["Pclass", "Sex", "Embarked"])),
        ("imputer", MostFrequentImputer()),
        ("cat_encoder", OneHotEncoder(sparse=False)),
    ])

cat_pipeline.fit_transform(train_data)

Now let's join the numerical and categorical pipelines in the preprocess_pipeline variable.

Then view the data:

X_train = preprocess_pipeline.fit_transform(train_data)
X_train

y_train = train_data["Survived"]

Now train an SVC with gamma as auto for svm_clf variable
Fit the data in the SVC
Predict the data.

Project - Titanic passenger survival

XP

Congratulations on completing topic -

Loading comments...