Halloween Sale: Flat 70% + Addl. 25% Off + 30 Days Extra Lab on all Courses | Use Coupon HS25 in Checkout | Offer Expires In

  Enroll Now

End to End ML Project - Fashion MNIST - Fine-Tuning the Model - Grid Search - Dimensionality Reduction

As we have finished cross-validation and selected the XGBoost model for fine-tuning, we will perform fine-tuning of this XGBoost model using 'Grid Search' technique.

Grid search takes a lot of time on large datasets. Hence, let us apply 'Dimensionality Reduction' to the training dataset to reduce the number of features in the dataset, so that the time taken for grid search and prediction is reduced. Also, we will calculate the scores based on the reduced features.

We will also check, if dimensionality reduction leads to any significant loss of information from the images in our training dataset. If we get a significant loss of information with dimensionality reduction, we will not use dimensionality reduction for our training dataset (and hence the problem).

Our dataset is not like a Swiss-roll, therefore, we don't need to convert a 3-dimensional dataset to 2-dimensional plane, etc. Hence, we won't be using Manifold technique for dimensionality reduction here.

We will be using Projection technique (PCA) for dimensionality reduction for our problem.

We will use Scikit Learn's PCA class which uses SVD (Singular Value Decomposition) internally and also the projection.

You can experiment with various values of n_components (variance ratio).

For the current problem, with n_components=0.95, in the reduced dataset (X_train_reduced) we got only 187 features (out of original 784), and there was significant loss of information (quality) in the 'recovered' (decompressed) images. Hence, we have selected n_components=0.99, which gives 459 features (out of original 784) and there is no significant loss of information (quality) in the 'recovered' images.

The comparison of the 'original' dataset images and the 'compressed' dataset images (got after decompression) shows that there is not much information loss due to dimensionality reduction by using 0.99 variance ratio. Hence, we will go ahead with performing the Grid Search using this 'reduced' training dataset (X_train_reduced).

INSTRUCTIONS

For dimensionality reduction, please follow the below steps:

Import PCA from SKLearn

from <<your code comes here>> import PCA

Create an instance of PCA called 'pca', by passing to it the parameter n_components=0.99 (i.e. variance ratio of 0.99)

pca = PCA(<<your code comes here>>)

Apply PCA on the training dataset X_train dataset and save the result in a variable called X_train_reduced

X_train_reduced = pca.<<your code comes here>>(X_train)

Please check the number of components (features) present in the X_train_reduced dataset

pca.<<your code comes here>>

Please check if you have hit a total of 99% explained variance ratio with the select number of components:

np.sum(pca.<<your code comes here>>)

Please check if there is any loss of information due to dimensionality reduction. You can do this by recovering (decompressing) some of the images (instances) of X_train_reduced dataset.

Let us recover (decompress) some of the images (instances) of X_train_reduced dataset and check.

Please use inverse_transform function to decompress the compressed dataset (X_train_reduced) back to 784 dimensions , and save the resulting dataset in X_train_recovered variable.

X_train_recovered = pca.<<your code comes here>>(<<your code comes here>>)

Please use the below code and function as it is. It will display the original image and the compressed image (that was recovered after decompression).

import matplotlib
import matplotlib.pyplot as plt

def plot_digits(instances, images_per_row=5, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    images = [instance.reshape(size,size) for instance in instances]
    n_rows = (len(instances) - 1) // images_per_row + 1
    row_images = []
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))
    for row in range(n_rows):
        rimages = images[row * images_per_row : (row + 1) * images_per_row]
        row_images.append(np.concatenate(rimages, axis=1))
    image = np.concatenate(row_images, axis=0)
    plt.imshow(image, cmap = matplotlib.cm.binary, **options)
    plt.axis("off")

plt.figure(figsize=(7, 4))
plt.subplot(121)
# Plotting 'original' image
plot_digits(X_train[::2100])
plt.title("Original", fontsize=16)
plt.subplot(122)
# Plotting the corresponding 'recovered' image
plot_digits(X_train_recovered[::2100])
plt.title("Compressed", fontsize=16)
plt.show()

The comparison of the 'original' dataset images and the 'compressed' dataset images (got after decompression) shows that there is not much information loss due to dimensionality reduction by using 0.99 variance ratio.


No hints are availble for this assesment

Answer is not availble for this assesment


Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...