End-to-End ML Project- Beginner friendly

63 / 94

Scatter Matrix

On referring to the correlation matrix, we can check relations existing between different attributes of our dataset. Likewise, we can see that the attribute population has a highly positive correlation with attributes total_rooms, total_bedrooms, and households. This is also obvious because where there are more people, they'll require more rooms.

Remember, correlation only tells us about the linear relationship between the variables. It being close to zero doesn't mean that there's no relationship between those variables. It only means that there is no linear relationship between the two variables. Although, there can be a non-linear relationship between them.

We can also visualize correlation with the scatter_matrix() function from pandas.plotting. Its syntax is as-

pd.plotting.scatter_matrix(DataFrame)

where, DataFrame is the name of the DataFrame.

It plots every numerical attribute against every other numerical attribute. In the case of an attribute against itself, instead of plotting the scatter plot, scatter_matrix() plots the histogram of the attribute.

Refer to scatter_matrix documentation for further details about the method.

As there are 11 features, we would get 11*11 i.e., 121 plots. It would be much more difficult to fit them on a page. So, we will only take the top 4 attributes which are most correlated with our target attribute.

INSTRUCTIONS

Plot the scatter_matrix between the top 4 attributes which are most correlated with the attribute median_house_value(irrespective of the direction of correlation) and the attribute median_house_value itself of the DataFrame train_copy. That makes a total of 5 attributes generating 25 plots.

Specify the parameter: figsize = (12,10).

Note:- We can provide some specific attributes by providing attribute names within single or double quotes separated by a comma, like-

DataFrame_name[["attribute_name1", "attribute_name2", "attribute_name3",.....]]
Get Hint See Answer

Loading comments...