End-to-End ML Project - California Housing

5 / 17

End to End ML Project - Explore the dataset

Now we will explore the dataset. Here, we will be using the hist method to plot a histogram to view the data. A histogram is used to visually represent the distribution of the data instead of the actual data itself, simply put, it is used to summarize discrete or continuous data.

The hist methods here has the bins parameter. These are also sometimes referred to as classes, intervals, or `buckets, are groups of equal widths into which the data is separated. Each bin is plotted as a bar whose height corresponds to how many data points are in that bin.

We will also be using the cut method from Pandas. This is used to segment and sort data values into bins. cut is also helpful for converting from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins. The labels parameters here specifies the labels for the returned bins. It has to be of the same length as the resulting bins. Also, if you notice, we have mentioned a np.inf here for the bins. That is a form of floating point representation of infinity.

INSTRUCTIONS
  • Use the info method to get more information on the dataset

    housing.<<your code goes here>>
    
  • Get a better understand of the mean, standard deviation, maximum value and other such information from the dataset by using the describe method

    housing.<<your code goes here>>
    
  • Plot histograms of all the features using hist method

    housing.<<your code goes here>>(bins=50, figsize=(20,15))
    plt.show()
    
  • Plot a histogram of the median income attribute of the dataset

    housing["median_income"].<<your code goes here>>
    
  • Divide the median income attribute into bins and labels using the cut mthod, and then plot another histogram of the same

    housing["income_cat"] = pd.<<your code goes here>>(housing["median_income"],
                                   bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                                   labels=[1, 2, 3, 4, 5])
    
    housing["income_cat"].hist()
    
See Answer

No hints are availble for this assesment


Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...