{"id":713,"date":"2017-09-13T11:53:32","date_gmt":"2017-09-13T11:53:32","guid":{"rendered":"http:\/\/blog.cloudxlab.com\/?p=713"},"modified":"2017-09-13T12:11:41","modified_gmt":"2017-09-13T12:11:41","slug":"predicting-income-level-case-study-r","status":"publish","type":"post","link":"https:\/\/cloudxlab.com\/blog\/predicting-income-level-case-study-r\/","title":{"rendered":"Predicting Income Level, An Analytics Casestudy in R"},"content":{"rendered":"<h1><img class=\"aligncenter wp-image-734 size-full\" src=\"http:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2017\/09\/Percentage-of-Income-more-than-50k-Country-wise.png\" alt=\"Percentage of Income more than 50k Country wise\" width=\"614\" height=\"359\" \/><\/h1>\n<h1>1. Introduction<\/h1>\n<p>In this data analytics case study, we will use the US census data to build a model to predict if the income of any individual in the US is greater than or less than USD 50000 based on the information available about that individual in the census data.<\/p>\n<p>The dataset used for the analysis is an extraction from the 1994 census data by Barry Becker and donated to the public site <a href=\"http:\/\/archive.ics.uci.edu\/ml\/datasets\/Census+Income\">http:\/\/archive.ics.uci.edu\/ml\/datasets\/Census+Income<\/a>. This dataset is popularly called the \u201cAdult\u201d data set. The way that we will go about this case study is in the following order:<\/p>\n<ol>\n<li><strong>Describe the data-\u00a0<\/strong>Specifically the predictor variables (also called independent variables features) from the Census data and the dependent variable which is the level of income (either \u201cgreater than USD 50000\u201d or \u201cless than USD 50000\u201d).<\/li>\n<li><strong>Acquire and Read the data-\u00a0<\/strong>Downloading the data directly from the source and reading it.<\/li>\n<li><strong>Clean the data-\u00a0<\/strong>Any data from the real world is always messy and noisy. The data needs to be reshaped in order to aid exploration of the data and modeling to predict the income level.<\/li>\n<li><strong>Explore the independent variables of the data-\u00a0<\/strong>A very crucial step before modeling is the exploration of the independent variables. Exploration provides great insights to an analyst on the predicting power of the variable. An analyst looks at the distribution of the variable, how variable it is to predict the income level, what skews it has, etc. In most analytics project, the analyst goes back to either get more data or better context or clarity from his finding.<\/li>\n<li><strong>Build the prediction model with the training data-\u00a0<\/strong>Since data like the Census data can have many weak predictors, for this particular case study I have chosen the non-parametric predicting algorithm of Boosting. Boosting is a classification algorithm (here we classify if an individual\u2019s income is \u201cgreater than USD 50000\u201d or \u201cless than USD 50000\u201d) that gives the best prediction accuracy for weak predictors. Cross validation, a mechanism to reduce over fitting while modeling, is also used with Boosting.<\/li>\n<li><strong>Validate the prediction model with the testing data-\u00a0<\/strong>Here the built model is applied on test data that the model has never seen. This is performed to determine the accuracy of the model in the field when it would be deployed. Since this is a case study, only the crucial steps are retained to keep the content concise and readable.<\/li>\n<\/ol>\n<p><!--more--><\/p>\n<h2>2. About the Data<\/h2>\n<p>As mentioned earlier, the data set is from\u00a0<a href=\"http:\/\/archive.ics.uci.edu\/ml\/datasets\/Census+Income\">http:\/\/archive.ics.uci.edu\/ml\/datasets\/Census+Income<\/a>.<\/p>\n<h3>2.1 Dependent Variable<\/h3>\n<p>The dependent variable is \u201cincomelevel\u201d, representing the level of income. A value of \u201c&lt;=50K\u201d indicates \u201cless than or equal to USD 50000\u201d and \u201c&gt;50K\u201d indicates \u201cgreater than USD 50000\u201d.<\/p>\n<h3>2.2 Independent Variable<\/h3>\n<p>Below are the independent variables (features or predictors) from the Census Data<\/p>\n\n<table id=\"tablepress-2\" class=\"tablepress tablepress-id-2\">\n<thead>\n<tr class=\"row-1 odd\">\n\t<th class=\"column-1\">Variable Name<\/th><th class=\"column-2\">Description<\/th><th class=\"column-3\">Type<\/th><th class=\"column-4\">Possible Values<br \/>\n<\/th>\n<\/tr>\n<\/thead>\n<tbody class=\"row-hover\">\n<tr class=\"row-2 even\">\n\t<td class=\"column-1\">Age<\/td><td class=\"column-2\">Age of the individual<br \/>\n<\/td><td class=\"column-3\">Continuous<\/td><td class=\"column-4\">Numeric<\/td>\n<\/tr>\n<tr class=\"row-3 odd\">\n\t<td class=\"column-1\">Workclass<\/td><td class=\"column-2\">Class of Work<br \/>\n<\/td><td class=\"column-3\">Categorical<\/td><td class=\"column-4\">Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked<\/td>\n<\/tr>\n<tr class=\"row-4 even\">\n\t<td class=\"column-1\">fnlwgt<\/td><td class=\"column-2\">Final Weight Determined by Census Org<\/td><td class=\"column-3\">Continuous<\/td><td class=\"column-4\">Numeric<\/td>\n<\/tr>\n<tr class=\"row-5 odd\">\n\t<td class=\"column-1\">Education<\/td><td class=\"column-2\">Education of the individual<\/td><td class=\"column-3\">Ordered Factor<\/td><td class=\"column-4\">Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool<\/td>\n<\/tr>\n<tr class=\"row-6 even\">\n\t<td class=\"column-1\">Education-num<\/td><td class=\"column-2\">Number of years of education<\/td><td class=\"column-3\">Continuous<\/td><td class=\"column-4\">Numeric<\/td>\n<\/tr>\n<tr class=\"row-7 odd\">\n\t<td class=\"column-1\">Marital-status<br \/>\n<\/td><td class=\"column-2\">Marital status of the individual<\/td><td class=\"column-3\">Categorical<\/td><td class=\"column-4\">Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse<\/td>\n<\/tr>\n<tr class=\"row-8 even\">\n\t<td class=\"column-1\">Occupation<\/td><td class=\"column-2\">Occupation of the individual<\/td><td class=\"column-3\">Categorical<\/td><td class=\"column-4\">Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces<\/td>\n<\/tr>\n<tr class=\"row-9 odd\">\n\t<td class=\"column-1\">Relationship<\/td><td class=\"column-2\">Present relationship<\/td><td class=\"column-3\">Categorical<\/td><td class=\"column-4\">Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried<\/td>\n<\/tr>\n<tr class=\"row-10 even\">\n\t<td class=\"column-1\">Race<\/td><td class=\"column-2\">Race of the individual<\/td><td class=\"column-3\">Categorical<\/td><td class=\"column-4\">White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black<\/td>\n<\/tr>\n<tr class=\"row-11 odd\">\n\t<td class=\"column-1\">Sex<\/td><td class=\"column-2\">Sex of the individual<\/td><td class=\"column-3\">Categorical<\/td><td class=\"column-4\">Female, Male<\/td>\n<\/tr>\n<tr class=\"row-12 even\">\n\t<td class=\"column-1\">Capital-gain<\/td><td class=\"column-2\">Capital gain made by the individual<\/td><td class=\"column-3\">Continuous<\/td><td class=\"column-4\">Numeric<\/td>\n<\/tr>\n<tr class=\"row-13 odd\">\n\t<td class=\"column-1\">Capital-loss<br \/>\n<\/td><td class=\"column-2\">Capital loss made by the individual<\/td><td class=\"column-3\">Continuous<\/td><td class=\"column-4\">Numeric<\/td>\n<\/tr>\n<tr class=\"row-14 even\">\n\t<td class=\"column-1\">Hours-per-week<br \/>\n<\/td><td class=\"column-2\">Average number of hours spent by the individual on work<\/td><td class=\"column-3\">Continuous<\/td><td class=\"column-4\">Numeric<\/td>\n<\/tr>\n<tr class=\"row-15 odd\">\n\t<td class=\"column-1\">Native-country<\/td><td class=\"column-2\">Average number of hours spent by the individual on work<\/td><td class=\"column-3\">Categorical<\/td><td class=\"column-4\">United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&amp;Tobago, Peru, Hong, Holand-Netherlands<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<!-- #tablepress-2 from cache -->\n<h2>3. Download and Read the Data<\/h2>\n<p>Training data and test data are both separately available at the UCI source. Both the data files are downloaded as below. The test file is set aside until model validation.<\/p>\n<pre class=\"lang:r decode:true \">trainFileName = \"adult.data\"; testFileName = \"adult.test\"\r\n\r\nif (!file.exists (trainFileName))\r\n    download.file (url = \"http:\/\/archive.ics.uci.edu\/ml\/machine-learning-databases\/adult\/adult.data\", \r\n                   destfile = trainFileName)\r\n\r\nif (!file.exists (testFileName))\r\n    download.file (url = \"http:\/\/archive.ics.uci.edu\/ml\/machine-learning-databases\/adult\/adult.test\", \r\n                   destfile = testFileName)<\/pre>\n<p>As the training data file does not contain the variable names, the variable names are explicitly specified while reading the data set. While reading the data, extra spaces are stripped.<\/p>\n<pre class=\"lang:r decode:true \">colNames = c (\"age\", \"workclass\", \"fnlwgt\", \"education\", \r\n              \"educationnum\", \"maritalstatus\", \"occupation\",\r\n              \"relationship\", \"race\", \"sex\", \"capitalgain\",\r\n              \"capitalloss\", \"hoursperweek\", \"nativecountry\",\r\n              \"incomelevel\")\r\n\r\ntrain = read.table (trainFileName, header = FALSE, sep = \",\",\r\n                       strip.white = TRUE, col.names = colNames,\r\n                        na.strings = \"?\", stringsAsFactors = TRUE)<\/pre>\n<p>Dataset is read and stored as train data frame of 32561 rows and 15 columns. A high level summary of the data is below. All the variables have been read in their expected classes.<\/p>\n<pre class=\"lang:r decode:true \">str (train)<\/pre>\n<pre class=\"lang:r decode:true \">## 'data.frame':    32561 obs. of  15 variables:\r\n##  $ age          : int  39 50 38 53 28 37 49 52 31 42 ...\r\n##  $ workclass    : Factor w\/ 8 levels \"Federal-gov\",..: 7 6 4 4 4 4 4 6 4 4 ...\r\n##  $ fnlwgt       : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...\r\n##  $ education    : Factor w\/ 16 levels \"10th\",\"11th\",..: 10 10 12 2 10 13 7 12 13 10 ...\r\n##  $ educationnum : int  13 13 9 7 13 14 5 9 14 13 ...\r\n##  $ maritalstatus: Factor w\/ 7 levels \"Divorced\",\"Married-AF-spouse\",..: 5 3 1 3 3 3 4 3 5 3 ...\r\n##  $ occupation   : Factor w\/ 14 levels \"Adm-clerical\",..: 1 4 6 6 10 4 8 4 10 4 ...\r\n##  $ relationship : Factor w\/ 6 levels \"Husband\",\"Not-in-family\",..: 2 1 2 1 6 6 2 1 2 1 ...\r\n##  $ race         : Factor w\/ 5 levels \"Amer-Indian-Eskimo\",..: 5 5 5 3 3 5 3 5 5 5 ...\r\n##  $ sex          : Factor w\/ 2 levels \"Female\",\"Male\": 2 2 2 2 1 1 1 2 1 2 ...\r\n##  $ capitalgain  : int  2174 0 0 0 0 0 0 0 14084 5178 ...\r\n##  $ capitalloss  : int  0 0 0 0 0 0 0 0 0 0 ...\r\n##  $ hoursperweek : int  40 13 40 40 40 40 16 45 50 40 ...\r\n##  $ nativecountry: Factor w\/ 41 levels \"Cambodia\",\"Canada\",..: 39 39 39 39 5 39 23 39 39 39 ...\r\n##  $ incomelevel  : Factor w\/ 2 levels \"&lt;=50K\",\"&gt;50K\": 1 1 1 1 1 1 1 2 2 2 ...<\/pre>\n<h2>4. Cleaning the Data<\/h2>\n<p>The training data set is cleaned for missing or invalid data.<\/p>\n<p>About 8% (2399\/30162) of the dataset has NAs in them. It is observed that in most of the missing data set, the \u2018workclass\u2019 variable and \u2018occupation\u2019 variable are missing data together. And the remaining have \u2018nativecountry\u2019 variable missing. We could handle the missing values by imputing the data. However, since \u2018workclass\u2019, \u2018occupation\u2019 and \u2018nativecountry\u2019 could potentially be very good predictors of income, imputing may simply skew the model.<\/p>\n<p>Also, since most of the missing data 2066\/2399 (~86%) rows pertain to the \u201c&lt;=50K\u201d incomelevel and the dataset is predominantly of \u201c&lt;=50K\u201d incomelevel, there will not be much information loss for the predictive model building if we removed the NAS data set.<\/p>\n<pre class=\"lang:r decode:true \">table (complete.cases (train))<\/pre>\n<pre class=\"lang:r decode:true \">## \r\n## FALSE  TRUE \r\n##  2399 30162<\/pre>\n<pre class=\"lang:r decode:true \"># Summarize all data sets with NAs only\r\nsummary  (train [!complete.cases(train),])<\/pre>\n<pre class=\"lang:r decode:true \">##       age                   workclass        fnlwgt      \r\n##  Min.   :17.00   Private         : 410   Min.   : 12285  \r\n##  1st Qu.:22.00   Self-emp-inc    :  42   1st Qu.:121804  \r\n##  Median :36.00   Self-emp-not-inc:  42   Median :177906  \r\n##  Mean   :40.39   Local-gov       :  26   Mean   :189584  \r\n##  3rd Qu.:58.00   State-gov       :  19   3rd Qu.:232669  \r\n##  Max.   :90.00   (Other)         :  24   Max.   :981628  \r\n##                  NA's            :1836                   \r\n##         education    educationnum                 maritalstatus\r\n##  HS-grad     :661   Min.   : 1.00   Divorced             :229  \r\n##  Some-college:613   1st Qu.: 9.00   Married-AF-spouse    :  2  \r\n##  Bachelors   :311   Median :10.00   Married-civ-spouse   :911  \r\n##  11th        :127   Mean   : 9.57   Married-spouse-absent: 48  \r\n##  10th        :113   3rd Qu.:11.00   Never-married        :957  \r\n##  Masters     : 96   Max.   :16.00   Separated            : 86  \r\n##  (Other)     :478                   Widowed              :166  \r\n##            occupation           relationship                 race     \r\n##  Prof-specialty : 102   Husband       :730   Amer-Indian-Eskimo:  25  \r\n##  Other-service  :  83   Not-in-family :579   Asian-Pac-Islander: 144  \r\n##  Exec-managerial:  74   Other-relative: 92   Black             : 307  \r\n##  Craft-repair   :  69   Own-child     :602   Other             :  40  \r\n##  Sales          :  66   Unmarried     :234   White             :1883  \r\n##  (Other)        : 162   Wife          :162                            \r\n##  NA's           :1843                                                 \r\n##      sex        capitalgain       capitalloss       hoursperweek  \r\n##  Female: 989   Min.   :    0.0   Min.   :   0.00   Min.   : 1.00  \r\n##  Male  :1410   1st Qu.:    0.0   1st Qu.:   0.00   1st Qu.:25.00  \r\n##                Median :    0.0   Median :   0.00   Median :40.00  \r\n##                Mean   :  897.1   Mean   :  73.87   Mean   :34.23  \r\n##                3rd Qu.:    0.0   3rd Qu.:   0.00   3rd Qu.:40.00  \r\n##                Max.   :99999.0   Max.   :4356.00   Max.   :99.00  \r\n##                                                                   \r\n##        nativecountry  incomelevel \r\n##  United-States:1666   &lt;=50K:2066  \r\n##  Mexico       :  33   &gt;50K : 333  \r\n##  Canada       :  14               \r\n##  Philippines  :  10               \r\n##  Germany      :   9               \r\n##  (Other)      :  84               \r\n##  NA's         : 583<\/pre>\n<pre class=\"lang:r decode:true \"># Distribution of the income level factor in the entire training data set.\r\ntable (train$incomelevel)<\/pre>\n<pre class=\"lang:r decode:true \">## \r\n## &lt;=50K  &gt;50K \r\n## 24720  7841<\/pre>\n<p>Data sets with NAs are removed below:<\/p>\n<pre class=\"lang:r decode:true \">myCleanTrain = train [!is.na (train$workclass) &amp; !is.na (train$occupation), ]\r\nmyCleanTrain = myCleanTrain [!is.na (myCleanTrain$nativecountry), ]<\/pre>\n<p>The \u2018fnlwgt\u2019 final weight estimate refers to population totals derived from CPS by creating \u201cweighted tallies\u201d of any specified socio-economic characteristics of the population. This variable is removed from the training data set due to it\u2019s diminished impact on income level.<\/p>\n<pre class=\"lang:default decode:true \">myCleanTrain$fnlwgt = NULL<\/pre>\n<p>The cleaned data set is now myCleanTrain.<\/p>\n<h2>5. Explore the Data<\/h2>\n<p>Each of the variables is explored for quirks, distribution, variance, and predictability.<\/p>\n<h3>5.1 Explore the Continuous Variables<\/h3>\n<p>Since the model of choice here is Boosting, which is non-parametric (does not follow any statistical distribution), we will not be transforming variables to address skewness. We will, however, try to understand the data to determine each variable\u2019s predictability.<\/p>\n<h4>5.1.1 Explore the Age variable<\/h4>\n<p>The Age variable has a wide range and variability. The distribution and mean are quite different for income level &lt;=50K and &gt;50K, implying that \u2018age\u2019 will be a good predictor of \u2018incomelevel\u2019.<\/p>\n<pre class=\"lang:r decode:true \">summary (myCleanTrain$age)<\/pre>\n<pre class=\"lang:r decode:true \">##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. \r\n##   17.00   28.00   37.00   38.44   47.00   90.00<\/pre>\n<pre class=\"lang:r decode:true \">boxplot (age ~ incomelevel, data = myCleanTrain, \r\n         main = \"Age distribution for different income levels\",\r\n         xlab = \"Income Levels\", ylab = \"Age\", col = \"salmon\")<\/pre>\n<p><img class=\"alignnone size-full wp-image-722\" src=\"http:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2017\/09\/Age-distribution-for-diffrent-levels.png\" alt=\"Age distribution for different levels\" width=\"675\" height=\"500\" \/><\/p>\n<pre class=\"lang:r decode:true \">incomeBelow50K = (myCleanTrain$incomelevel == \"&lt;=50K\")\r\nxlimit = c (min (myCleanTrain$age), max (myCleanTrain$age))\r\nylimit = c (0, 1600)\r\n\r\nhist1 = qplot (age, data = myCleanTrain[incomeBelow50K,], margins = TRUE, \r\n           binwidth = 2, xlim = xlimit, ylim = ylimit, colour = incomelevel)\r\n\r\nhist2 = qplot (age, data = myCleanTrain[!incomeBelow50K,], margins = TRUE, \r\n           binwidth = 2, xlim = xlimit, ylim = ylimit, colour = incomelevel)\r\n\r\ngrid.arrange (hist1, hist2, nrow = 2)<\/pre>\n<p>&nbsp;<\/p>\n<p><img class=\"alignnone size-full wp-image-723\" src=\"http:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2017\/09\/Income-levels.png\" alt=\"Income levels\" width=\"675\" height=\"500\" \/><\/p>\n<h4>5.1.2 Explore the Years of Education Variable<\/h4>\n<p>The Years of Education variable has good variability. The statistics are quite different for income level &lt;=50K and &gt;50K, implying that \u2018educationnum\u2019 will be a good predictor of \u2018incomelevel\u2019.<\/p>\n<pre class=\"lang:r decode:true \">summary (myCleanTrain$educationnum)<\/pre>\n<pre class=\"lang:r decode:true\">##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. \r\n##    1.00    9.00   10.00   10.12   13.00   16.00<\/pre>\n<pre class=\"lang:r decode:true \">boxplot (educationnum ~ incomelevel, data = myCleanTrain, \r\n         main = \"Years of Education distribution for different income levels\",\r\n         xlab = \"Income Levels\", ylab = \"Years of Education\", col = \"blue\")<\/pre>\n<p><img class=\"alignnone size-full wp-image-725\" src=\"http:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2017\/09\/Years-of-education-distribution-for-diffrent-income-levels.png\" alt=\"Years of education distribution for different income levels\" width=\"675\" height=\"500\" \/><\/p>\n<h4>5.1.3 Explore the Capital Gain and Capital Loss variables<\/h4>\n<p>The capital gain and capital loss variables do not show much variance for all income levels from the plots below. However, the means show a difference for the different levels of income. So these variables can be used for prediction.<\/p>\n<pre class=\"lang:r decode:true \">nearZeroVar (myCleanTrain[, c(\"capitalgain\", \"capitalloss\")], saveMetrics = TRUE)\r\n<\/pre>\n<pre class=\"lang:r decode:true \">##             freqRatio percentUnique zeroVar  nzv\r\n## capitalgain  81.97033     0.3912207   FALSE TRUE\r\n## capitalloss 148.11856     0.2983887   FALSE TRUE<\/pre>\n<pre class=\"lang:r decode:true \">summary (myCleanTrain[ myCleanTrain$incomelevel == \"&lt;=50K\", \r\n                       c(\"capitalgain\", \"capitalloss\")])<\/pre>\n<pre class=\"lang:r decode:true \">##   capitalgain       capitalloss     \r\n##  Min.   :    0.0   Min.   :   0.00  \r\n##  1st Qu.:    0.0   1st Qu.:   0.00  \r\n##  Median :    0.0   Median :   0.00  \r\n##  Mean   :  148.9   Mean   :  53.45  \r\n##  3rd Qu.:    0.0   3rd Qu.:   0.00  \r\n##  Max.   :41310.0   Max.   :4356.00<\/pre>\n<pre class=\"lang:r decode:true\">summary (myCleanTrain[ myCleanTrain$incomelevel == \"&gt;50K\", \r\n                       c(\"capitalgain\", \"capitalloss\")])<\/pre>\n<pre class=\"lang:r decode:true\">##   capitalgain     capitalloss    \r\n##  Min.   :    0   Min.   :   0.0  \r\n##  1st Qu.:    0   1st Qu.:   0.0  \r\n##  Median :    0   Median :   0.0  \r\n##  Mean   : 3938   Mean   : 193.8  \r\n##  3rd Qu.:    0   3rd Qu.:   0.0  \r\n##  Max.   :99999   Max.   :3683.0<\/pre>\n<h4>5.1.4 Explore the Hours Per Week variable<\/h4>\n<p>The Hours Per Week variable has a good variability implying that \u2018hoursperweek\u2019 will be a good predictor of \u2018incomelevel\u2019.<\/p>\n<pre class=\"lang:r decode:true \">summary (myCleanTrain$hoursperweek)<\/pre>\n<pre class=\"lang:r decode:true \">##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. \r\n##    1.00   40.00   40.00   40.93   45.00   99.00<\/pre>\n<pre class=\"lang:default decode:true \">boxplot (hoursperweek ~ incomelevel, data = myCleanTrain, \r\n         main = \"Hours Per Week distribution for different income levels\",\r\n         xlab = \"Income Levels\", ylab = \"Hours Per Week\", col = \"salmon\")<\/pre>\n<p><img class=\"alignnone size-full wp-image-727\" src=\"http:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2017\/09\/Hours-per-Week-Distribution-for-Diffrent-Income-Levels.png\" alt=\"Hours Per Week Distribution For Different Income Levels\" width=\"675\" height=\"500\" \/><\/p>\n<pre class=\"lang:r decode:true \">nearZeroVar (myCleanTrain[, \"hoursperweek\"], saveMetrics = TRUE)<\/pre>\n<pre class=\"lang:r decode:true \">##   freqRatio percentUnique zeroVar   nzv\r\n## 1  5.243194     0.3116504   FALSE FALSE<\/pre>\n<h4>5.1.5 Explore the correlation between continuous variables<\/h4>\n<p>The below shows that there is no correlation between the continuous variables and that they are independent of each other.<\/p>\n<pre class=\"lang:r decode:true \">corMat = cor (myCleanTrain[, c(\"age\", \"educationnum\", \"capitalgain\", \"capitalloss\", \"hoursperweek\")])\r\ndiag (corMat) = 0 #Remove self correlations\r\ncorMat<\/pre>\n<pre class=\"lang:r decode:true \">##                     age educationnum capitalgain capitalloss hoursperweek\r\n## age          0.00000000   0.04352609  0.08015423  0.06016548   0.10159876\r\n## educationnum 0.04352609   0.00000000  0.12441600  0.07964641   0.15252207\r\n## capitalgain  0.08015423   0.12441600  0.00000000 -0.03222933   0.08043180\r\n## capitalloss  0.06016548   0.07964641 -0.03222933  0.00000000   0.05241705\r\n## hoursperweek 0.10159876   0.15252207  0.08043180  0.05241705   0.00000000<\/pre>\n<h3>5.2 Explore Categorical Variables<\/h3>\n<h4>5.2.1 Exploring the Sex variable<\/h4>\n<p>Mostly the sex variable is not a good predictor, and so is the case for the income level prediction too. This variable will not be used for the model.<\/p>\n<pre class=\"lang:r decode:true \">table (myCleanTrain[,c(\"sex\", \"incomelevel\")])<\/pre>\n<pre class=\"lang:r decode:true \">##         incomelevel\r\n## sex      &lt;=50K  &gt;50K\r\n##   Female  8670  1112\r\n##   Male   13984  6396<\/pre>\n<h4>5.2.2 Exploring the work class, occupation, marital status, relationship and education variables<\/h4>\n<p>The variables workclass, occupation, maritalstatus, relationship all show good predictability of the incomelevel variable.<\/p>\n<pre class=\"lang:r decode:true \">qplot (incomelevel, data = myCleanTrain, fill = workclass) + facet_grid (. ~ workclass)\r\n<\/pre>\n<p><img class=\"alignnone size-full wp-image-729\" src=\"http:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2017\/09\/Income-level.png\" alt=\"Income level\" width=\"675\" height=\"500\" \/><\/p>\n<pre class=\"lang:r decode:true \">qplot (incomelevel, data = myCleanTrain, fill = occupation) + facet_grid (. ~ occupation)\r\n<\/pre>\n<p><img class=\"alignnone size-full wp-image-730\" src=\"http:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2017\/09\/Income-level-occupation.png\" alt=\"Income level occupation\" width=\"675\" height=\"500\" \/><\/p>\n<pre class=\"lang:r decode:true \">qplot (incomelevel, data = myCleanTrain, fill = maritalstatus) + facet_grid (. ~ maritalstatus)\r\n<\/pre>\n<p><img class=\"alignnone size-full wp-image-731\" src=\"http:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2017\/09\/Income-level-maritalstatus.png\" alt=\"Income level marital status\" width=\"675\" height=\"500\" \/><\/p>\n<pre class=\"lang:r decode:true \">qplot (incomelevel, data = myCleanTrain, fill = relationship) + facet_grid (. ~ relationship)\r\n<\/pre>\n<p><img class=\"alignnone size-full wp-image-732\" src=\"http:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2017\/09\/Income-level-relationship.png\" alt=\"Income level relationship\" width=\"675\" height=\"500\" \/><\/p>\n<p>The education variable, however, needs to be reordered and marked an ordinal variable (ordered factor variable). The new ordinal variable also shows good predictability of incomelevel.<\/p>\n<pre class=\"lang:default decode:true \"># Modify the levels to be ordinal\r\nmyCleanTrain$education = ordered (myCleanTrain$education,\r\n    levels (myCleanTrain$education) [c(14, 4:7, 1:3, 12, 15, 8:9, 16, 10, 13, 11)])\r\n\r\nprint (levels (myCleanTrain$education))<\/pre>\n<pre class=\"lang:r decode:true \">##  [1] \"Preschool\"    \"1st-4th\"      \"5th-6th\"      \"7th-8th\"     \r\n##  [5] \"9th\"          \"10th\"         \"11th\"         \"12th\"        \r\n##  [9] \"HS-grad\"      \"Prof-school\"  \"Assoc-acdm\"   \"Assoc-voc\"   \r\n## [13] \"Some-college\" \"Bachelors\"    \"Masters\"      \"Doctorate\"<\/pre>\n<pre class=\"lang:r decode:true \">qplot (incomelevel, data = myCleanTrain, fill = education) + facet_grid (. ~ education)<\/pre>\n<p><img class=\"alignnone size-full wp-image-733\" src=\"http:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2017\/09\/Income-level-education.png\" alt=\"Income level education\" width=\"675\" height=\"500\" \/><\/p>\n<h4>5.2.3 Exploring the nativecountry variable<\/h4>\n<p>Plotting the percentage of Income more than USD 50000 nativecountry-wise, shows that nativecounty is a good predictor of incomelevel.<\/p>\n<p>The names of the countries are cleaned for display on the world map. The code pertaining to these are now shown to keep the article concise.<\/p>\n<pre class=\"lang:r decode:true \">## 40 codes from your data successfully matched countries in the map\r\n## 0 codes from your data failed to match with a country code in the map\r\n##      failedCodes failedCountries\r\n## 204 codes from the map weren't represented in your data<\/pre>\n<p><img class=\"alignnone size-full wp-image-734\" src=\"http:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2017\/09\/Percentage-of-Income-more-than-50k-Country-wise.png\" alt=\"Percentage of Income more than 50k Country wise\" width=\"614\" height=\"359\" \/><\/p>\n<h3>5.3 Building the Prediction Model<\/h3>\n<p>Finally, down to building the prediction model, we will be using all the independent variables except the Sex variable to build a model that predicts the income level of an individual to be greater than USD 50000 or less than USD 50000 using Census data.<\/p>\n<p>Since Census data is typical of weak predictors, the <strong>Boosting algorithm is used for this classification modeling<\/strong>.<\/p>\n<p>I have also used Cross Validation (CV) where the training data is partitioned a specific number of times and separate boosted models are built on each. The resulting models are ensembled to arrive at final model. This helps avoid overfitting the model to the training data.<\/p>\n<pre class=\"lang:r decode:true \">set.seed (32323)\r\ntrCtrl = trainControl (method = \"cv\", number = 10)\r\n\r\nboostFit = train (incomelevel ~ age + workclass + education + educationnum +\r\n                      maritalstatus + occupation + relationship +\r\n                      race + capitalgain + capitalloss + hoursperweek +\r\n                      nativecountry, trControl = trCtrl, \r\n                  method = \"gbm\", data = myCleanTrain, verbose = FALSE)<\/pre>\n<p>The confusion matrix below shows an in-sample overall accuracy of ~86%, the sensitivity of ~88% and specificity of ~79%.<\/p>\n<p>This implies that 86% of times, the model has classified the income level correctly, 88% of the times, the income level being less than or equal to USD 50000 in classified correctly and 79% of the times, the income level being greater than USD 50000 is classified correctly.<\/p>\n<pre class=\"lang:r decode:true \">confusionMatrix (myCleanTrain$incomelevel, predict (boostFit, myCleanTrain))\r\n<\/pre>\n<pre class=\"lang:r decode:true \">## Confusion Matrix and Statistics\r\n## \r\n##           Reference\r\n## Prediction &lt;=50K  &gt;50K\r\n##      &lt;=50K 21415  1239\r\n##      &gt;50K   2900  4608\r\n##                                           \r\n##                Accuracy : 0.8628          \r\n##                  95% CI : (0.8588, 0.8666)\r\n##     No Information Rate : 0.8061          \r\n##     P-Value [Acc &gt; NIR] : &lt; 2.2e-16       \r\n##                                           \r\n##                   Kappa : 0.6037          \r\n##  Mcnemar's Test P-Value : &lt; 2.2e-16       \r\n##                                           \r\n##             Sensitivity : 0.8807          \r\n##             Specificity : 0.7881          \r\n##          Pos Pred Value : 0.9453          \r\n##          Neg Pred Value : 0.6137          \r\n##              Prevalence : 0.8061          \r\n##          Detection Rate : 0.7100          \r\n##    Detection Prevalence : 0.7511          \r\n##       Balanced Accuracy : 0.8344          \r\n##                                           \r\n##        'Positive' Class : &lt;=50K           \r\n##<\/pre>\n<h3>5.4 Validating the Prediction Model<\/h3>\n<p>The created prediction model is applied to the test data to validate the true performance. The test data is cleaned similar to the training data before applying the model.<\/p>\n<pre class=\"lang:r decode:true \">## \r\n## FALSE \r\n## 15060<\/pre>\n<p>The cleaning is not shown to keep the case study concise. The cleaned test dataset has 15060 rows and 14 columns with no missing data.<\/p>\n<p>The prediction model is applied on the test data. From the confusion matrix below the performance measures are out-of-sample overall accuracy of ~86%, sensitivity of ~88% and specificity of ~78%, which is quite similar to the in-sample performances<\/p>\n<pre class=\"lang:r decode:true \">myCleanTest$predicted = predict (boostFit, myCleanTest)\r\nconfusionMatrix (myCleanTest$incomelevel, myCleanTest$predicted)<\/pre>\n<pre class=\"lang:r decode:true \">## Confusion Matrix and Statistics\r\n## \r\n##           Reference\r\n## Prediction &lt;=50K  &gt;50K\r\n##      &lt;=50K 10731   629\r\n##      &gt;50K   1446  2254\r\n##                                           \r\n##                Accuracy : 0.8622          \r\n##                  95% CI : (0.8566, 0.8677)\r\n##     No Information Rate : 0.8086          \r\n##     P-Value [Acc &gt; NIR] : &lt; 2.2e-16       \r\n##                                           \r\n##                   Kappa : 0.5984          \r\n##  Mcnemar's Test P-Value : &lt; 2.2e-16       \r\n##                                           \r\n##             Sensitivity : 0.8813          \r\n##             Specificity : 0.7818          \r\n##          Pos Pred Value : 0.9446          \r\n##          Neg Pred Value : 0.6092          \r\n##              Prevalence : 0.8086          \r\n##          Detection Rate : 0.7125          \r\n##    Detection Prevalence : 0.7543          \r\n##       Balanced Accuracy : 0.8315          \r\n##                                           \r\n##        'Positive' Class : &lt;=50K           \r\n##<\/pre>\n<h2>6. Executive Summary<\/h2>\n<p>During a data analytics exercise, it is very important to understand how the built model has performed with respect to a baseline model. This helps the analyst understand if there is really any value that the new model adds.<\/p>\n<p>The baseline accuracy (here, accuracy of selection by random chance as there is no prior model) is 75% for income less than USD 50000 (sensitivity) and 25% for income more than USD 50000 (specificity) with an overall accuracy of 68% (Refer the skewed number of data sets for both the incomelevels in the cleaned test data).<\/p>\n<p>The prediction model built using the boosting algorithm can predict a less than USD 50000 income level with 88% accuracy (sensitivity) and a more than USD 50000 income level with 78% accuracy (specificity) and an overall accuracy of 86%.<\/p>\n<p>So the prediction model does perform better than the baseline model.<\/p>\n<p>The below maps shows the overall prediction Overall Accuracy, Sensitivity and Specificity nativecountry-wise (For keeping the report concise, the computations for plotting the map are not shown).<\/p>\n<pre class=\"lang:r decode:true \">## 39 codes from your data successfully matched countries in the map\r\n## 0 codes from your data failed to match with a country code in the map\r\n## 205 codes from the map weren't represented in your data<\/pre>\n<p><img class=\"alignnone size-full wp-image-734\" src=\"http:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2017\/09\/Percentage-of-Income-more-than-50k-Country-wise.png\" alt=\"Percentage of Income more than 50k Country wise\" width=\"614\" height=\"359\" \/><\/p>\n<p><img class=\"alignnone size-full wp-image-737\" src=\"http:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2017\/09\/Prediction-Sensitivity-Nativecountry-Wise-1.png\" alt=\"Prediction sensitivity native country-wise\" width=\"619\" height=\"411\" \/><\/p>\n<p><img class=\"alignnone size-full wp-image-738\" src=\"http:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2017\/09\/Prediction-Specificity-Nativecountry-Wise.png\" alt=\"Prediction specificity native country-wise\" width=\"619\" height=\"411\" \/><\/p>\n<p>Hope this guide was helpful. Please feel free to leave your comments. Follow\u00a0<a href=\"https:\/\/twitter.com\/CloudxLab\" target=\"_blank\" rel=\"noopener\">CloudxLab on Twitter<\/a>\u00a0to get updates on new blogs and videos.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>1. Introduction In this data analytics case study, we will use the US census data to build a model to predict if the income of any individual in the US is greater than or less than USD 50000 based on the information available about that individual in the census data. The dataset used for the &hellip; <a href=\"https:\/\/cloudxlab.com\/blog\/predicting-income-level-case-study-r\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Predicting Income Level, An Analytics Casestudy in R&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[13],"tags":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v16.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Predicting Income Level, An Analytics Casestudy in R | CloudxLab Blog<\/title>\n<meta name=\"description\" content=\"In this case study, we will use the US census data to build a model to predict if the income level in the US is greater than or less than USD 5000\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/cloudxlab.com\/blog\/predicting-income-level-case-study-r\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Predicting Income Level, An Analytics Casestudy in R | CloudxLab Blog\" \/>\n<meta property=\"og:description\" content=\"In this case study, we will use the US census data to build a model to predict if the income level in the US is greater than or less than USD 5000\" \/>\n<meta property=\"og:url\" content=\"https:\/\/cloudxlab.com\/blog\/predicting-income-level-case-study-r\/\" \/>\n<meta property=\"og:site_name\" content=\"CloudxLab Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cloudxlab\" \/>\n<meta property=\"article:published_time\" content=\"2017-09-13T11:53:32+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2017-09-13T12:11:41+00:00\" \/>\n<meta property=\"og:image\" content=\"http:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2017\/09\/Percentage-of-Income-more-than-50k-Country-wise.png\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@CloudxLab\" \/>\n<meta name=\"twitter:site\" content=\"@CloudxLab\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\">\n\t<meta name=\"twitter:data1\" content=\"12 minutes\">\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebSite\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/#website\",\"url\":\"https:\/\/cloudxlab.com\/blog\/\",\"name\":\"CloudxLab Blog\",\"description\":\"Learn AI, Machine Learning, Deep Learning, Devops &amp; Big Data\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":\"https:\/\/cloudxlab.com\/blog\/?s={search_term_string}\",\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/predicting-income-level-case-study-r\/#primaryimage\",\"inLanguage\":\"en-US\",\"url\":\"http:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2017\/09\/Percentage-of-Income-more-than-50k-Country-wise.png\",\"contentUrl\":\"http:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2017\/09\/Percentage-of-Income-more-than-50k-Country-wise.png\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/predicting-income-level-case-study-r\/#webpage\",\"url\":\"https:\/\/cloudxlab.com\/blog\/predicting-income-level-case-study-r\/\",\"name\":\"Predicting Income Level, An Analytics Casestudy in R | CloudxLab Blog\",\"isPartOf\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/predicting-income-level-case-study-r\/#primaryimage\"},\"datePublished\":\"2017-09-13T11:53:32+00:00\",\"dateModified\":\"2017-09-13T12:11:41+00:00\",\"author\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/#\/schema\/person\/0efa3c54df68406de820ea466f002d3c\"},\"description\":\"In this case study, we will use the US census data to build a model to predict if the income level in the US is greater than or less than USD 5000\",\"breadcrumb\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/predicting-income-level-case-study-r\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/cloudxlab.com\/blog\/predicting-income-level-case-study-r\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/predicting-income-level-case-study-r\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"item\":{\"@type\":\"WebPage\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/\",\"url\":\"https:\/\/cloudxlab.com\/blog\/\",\"name\":\"Home\"}},{\"@type\":\"ListItem\",\"position\":2,\"item\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/predicting-income-level-case-study-r\/#webpage\"}}]},{\"@type\":\"Person\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/#\/schema\/person\/0efa3c54df68406de820ea466f002d3c\",\"name\":\"Abhinav Singh\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/#personlogo\",\"inLanguage\":\"en-US\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/fc74fe31169bf872f6ab11bbab621d53?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/fc74fe31169bf872f6ab11bbab621d53?s=96&d=mm&r=g\",\"caption\":\"Abhinav Singh\"},\"sameAs\":[\"https:\/\/cloudxlab.com\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","_links":{"self":[{"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/posts\/713"}],"collection":[{"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/comments?post=713"}],"version-history":[{"count":15,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/posts\/713\/revisions"}],"predecessor-version":[{"id":747,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/posts\/713\/revisions\/747"}],"wp:attachment":[{"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/media?parent=713"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/categories?post=713"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/tags?post=713"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}