Login using Social Account
     Continue with GoogleLogin using your credentials
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. In this project, you build a model to predict which passengers survived the tragedy. The goal is to predict whether or not a passenger survived based on attributes such as their age, sex, passenger class, where they embarked and so on.
Objectives:
First, login to Kaggle and go to the Titanic challenge to download train.csv and test.csv. Save them to the datasets/titanic directory.
Next, let's load the data:
import os
TITANIC_PATH = os.path.join("datasets", "titanic")
import pandas as pd
def load_titanic_data(filename, titanic_path=TITANIC_PATH):
csv_path = os.path.join(titanic_path, filename)
return pd.read_csv(csv_path)
train_data = load_titanic_data("train.csv")
test_data = load_titanic_data("test.csv")
The data is already split into a training set and a test set. However, the test data does not contain the labels: your goal is to train the best model you can using the training data, then make your predictions on the test data and upload them to Kaggle to see your final score.
Let's take a peek at the top few rows of the training set:
train_data.head()
The attributes have the following meaning:
Survived: that's the target, 0 means the passenger did not survive, while 1 means he/she survived. Pclass: passenger class. Name, Sex, Age: self-explanatory SibSp: how many siblings & spouses of the passenger aboard the Titanic. Parch: how many children & parents of the passenger aboard the Titanic. Ticket: ticket id Fare: price paid (in pounds) Cabin: passenger's cabin number Embarked: where the passenger embarked the Titanic
Now build a preprocessing pipelines. You can use the following DataframeSelector to select specific attributes from the DataFrame:
from sklearn.base import BaseEstimator, TransformerMixin
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names]
Build the pipeline for the numerical attributes:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
num_pipeline = Pipeline([
("select_numeric", DataFrameSelector(["Age", "SibSp", "Parch", "Fare"])),
("imputer", SimpleImputer(strategy="median")),
])
num_pipeline.fit_transform(train_data)
Create an imputer for the string categorical columns (the regular SimpleImputer does not work on those):
class MostFrequentImputer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
self.most_frequent_ = pd.Series([X[c].value_counts().index[0] for c in X],
index=X.columns)
return self
def transform(self, X, y=None):
return X.fillna(self.most_frequent_)
from sklearn.preprocessing import OneHotEncoder
cat_pipeline = Pipeline([
("select_cat", DataFrameSelector(["Pclass", "Sex", "Embarked"])),
("imputer", MostFrequentImputer()),
("cat_encoder", OneHotEncoder(sparse=False)),
])
cat_pipeline.fit_transform(train_data)
Now let's join the numerical and categorical pipelines in the preprocess_pipeline variable.
Then view the data:
X_train = preprocess_pipeline.fit_transform(train_data)
X_train
y_train = train_data["Survived"]
Now train an SVC with gamma as auto for svm_clf variable
No hints are availble for this assesment
Answer is not availble for this assesment
Loading comments...