Project - Titanic passenger survival

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. In this project, you build a model to predict which passengers survived the tragedy. The goal is to predict whether or not a passenger survived based on attributes such as their age, sex, passenger class, where they embarked and so on.

Objectives:

First, login to Kaggle and go to the Titanic challenge to download train.csv and test.csv. Save them to the datasets/titanic directory.

Next, let's load the data:

import os

TITANIC_PATH = os.path.join("datasets", "titanic")

import pandas as pd

def load_titanic_data(filename, titanic_path=TITANIC_PATH):
    csv_path = os.path.join(titanic_path, filename)
    return pd.read_csv(csv_path)

train_data = load_titanic_data("train.csv")
test_data = load_titanic_data("test.csv")

The data is already split into a training set and a test set. However, the test data does not contain the labels: your goal is to train the best model you can using the training data, then make your predictions on the test data and upload them to Kaggle to see your final score.

Let's take a peek at the top few rows of the training set:

train_data.head()

The attributes have the following meaning:

Survived: that's the target, 0 means the passenger did not survive, while 1 means he/she survived. Pclass: passenger class. Name, Sex, Age: self-explanatory SibSp: how many siblings & spouses of the passenger aboard the Titanic. Parch: how many children & parents of the passenger aboard the Titanic. Ticket: ticket id Fare: price paid (in pounds) Cabin: passenger's cabin number Embarked: where the passenger embarked the Titanic

  1. Use the info() function to get more info about the data.
  2. Use the describe() function to look at the numerical attributes.
  3. Check if the target is indeed 0 or 1using value_counts()
  4. Now take a quick look at all the categorical attributes (Pclass, Sex, Embarked using value_counts()
  5. Now build a preprocessing pipelines. You can use the following DataframeSelector to select specific attributes from the DataFrame:

    from sklearn.base import BaseEstimator, TransformerMixin
    
    class DataFrameSelector(BaseEstimator, TransformerMixin):
        def __init__(self, attribute_names):
            self.attribute_names = attribute_names
        def fit(self, X, y=None):
            return self
        def transform(self, X):
            return X[self.attribute_names]
    
  6. Build the pipeline for the numerical attributes:

    from sklearn.pipeline import Pipeline
    from sklearn.impute import SimpleImputer
    
    num_pipeline = Pipeline([
            ("select_numeric", DataFrameSelector(["Age", "SibSp", "Parch", "Fare"])),
            ("imputer", SimpleImputer(strategy="median")),
        ])
    
    num_pipeline.fit_transform(train_data)
    
  7. Create an imputer for the string categorical columns (the regular SimpleImputer does not work on those):

    class MostFrequentImputer(BaseEstimator, TransformerMixin):
        def fit(self, X, y=None):
            self.most_frequent_ = pd.Series([X[c].value_counts().index[0] for c in X],
                                            index=X.columns)
            return self
        def transform(self, X, y=None):
            return X.fillna(self.most_frequent_)
    
    
    from sklearn.preprocessing import OneHotEncoder
    
    cat_pipeline = Pipeline([
            ("select_cat", DataFrameSelector(["Pclass", "Sex", "Embarked"])),
            ("imputer", MostFrequentImputer()),
            ("cat_encoder", OneHotEncoder(sparse=False)),
        ])
    
    cat_pipeline.fit_transform(train_data)
    
  8. Now let's join the numerical and categorical pipelines in the preprocess_pipeline variable.

  9. Then view the data:

    X_train = preprocess_pipeline.fit_transform(train_data)
    X_train
    
    y_train = train_data["Survived"]
    
  10. Now train an SVC with gamma as auto for svm_clf variable

  11. Fit the data in the SVC
  12. Predict the data.

No hints are availble for this assesment

Answer is not availble for this assesment

Loading comments...