Machine Learning jobs have been awarded the sexiest job of the 21st century by many websites.  Looking at the demand for machine learning professionals and ever-growing pay package for such roles, we can safely say it is indeed true. The job opportunities in machine learning are endless in both academic and corporate settings. So if you are one of the aspirants who would like to join the machine learning bandwagon, then you’ve come to the right place.

In this post, we have collected the most asked machine learning interview questions in startups and corporates. Also, we don’t want you to struggle with finding the answers, thus we’ve tried to provide the simple explanation to each question.

What is regularization and how is it used to solve the problem of overfitting?

In Statistical models, overfitting is a very common problem. One of the methods to solve this problem is Regularization. Before I go further and write a plain definition of regularization, it is very important for you to understand the problem of overfitting.

Let’s take an example. Let’s say, you’ve been given a problem to predict the genre of music one likes based on one’s age. You first try a linear regression model with age as an independent variable and music genre as a dependent one. Sad for you, but this model will mostly fail because of its too simplistic nature.

You then sure want to add more explaining variables to make your model more interesting. You then go ahead and add the sex and the education of each individual in your dataset. Now, you measure its accuracy by a loss metric L(X,Y)L(X,Y) where XX is your design matrix and YY is the denoted targets (music genre in your case). You find out that results are good but not very accurate.

So you go ahead and add more variables like marital status, location, profession, education, etc. Much to your surprise, you find that your model may have poor prediction power. You have just experienced a problem of overfitting. Which means you model sticks too much to the data and might have learned the background noise. In other words, your model has high variance and low bias.

To overcome this problem, we use the technique called regularization. Basically, you need to penalize the loss function by adding a multiple of L1L1 (Lasso) norm of the weight vector ww. You will then come up with the following equation

L(X,Y) + λN(w), where λ is regularisation term and N is either L1 (Lasso), L2 (Lasso), or any other norm.

Explain how a ROC curve works?

Receiver Operating Characteristics or ROC curve is a graphical representation of the performance of a binary classifier system at various thresholds.

Let’s understand this definition step by step. First, we need to understand what discrimination threshold is.

In a binary classifier system, you get the probability of an observation to be classified as 0 or 1. i.e when you decide the threshold you classify the output into two classes. For example, you have a problem where your model needs to classify a tumor as cancerous or non-cancerous. Now, you set the threshold of your system as 0.8, which means tumor of diameter above this number will be considered cancerous. If you notice, the performance of the system varies as you change the threshold.

Now that you understand what discrimination threshold is, let’s understand two more important terms to understand how ROC curve works.

### True Positive Rate:

This tells you how many times your model is able to classify the positives as positives.

### False Positive Rate:

This tells you how many times your model classify a negative as a positive.

Now, to get the ROC curve, you plot the True Positive Rate against the False Positive Rate at various threshold settings.

What is MAP Hypothesis?

You need to understand the Bayes Theorem first. Bayes theorem gives a formula that calculates the conditional probability that something will happen, given that something else has already happened.

The formula for calculating conditional probability is

### P (h | d) = P (d | h) * P (h) / P(d)

Where

• P(h|d) is the probability of hypothesis h given the data d. This is also called the posterior probability
• P(d|h) is the probability of data d given that the hypothesis h was true.
• P(h) is the probability of hypothesis h being true. This is also called the prior probability of h.
• P(d) is the probability of the data.

We calculate the posterior probability for different hypotheses and select the hypothesis with the highest probability. What you get is the maximum probable hypothesis or maximum a posteriori (MAP) hypothesis.

## Give the difference between concordant and discordant pairs with an example.

Concordant and Discordant pairs are calculated for ordinal variables and they tell you if there is agreement or disagreement between scores. Please note that you must order your data and place them into pairs before calculating the concordance or discordance.

Let’s see an example to show the difference between the two

Say, you have a score data of 5 job applicants given by two interviewers.

 Candidate Interviewer 1 Interviewer 2 A 1 1 B 2 2 C 3 4 D 4 3 E 5 6

By and large, you are looking if both the Interviewers have scored the candidate in the same or opposite order. Let’s observe the score of candidates A and C. Their scores are in the same order, thus A and C are the concordant pairs. Similarly, C and D are discordant pairs because the order of their score is opposite.

## Describe a situation when you would use logistic regression vs random forest vs SVM.

When classifiers are clean and data is huge then one should go with SVM

If model interpretability is not that important and you need outliers to be part of your model then you should go with random forest

When data is small and one has to do two class classification and distribution is normal in each of the classes then you should go with logistic regression.

What is Homoscedasticity and how is it different from Heteroscedasticity?

In linear regression, you must ensure that the data is homoscedastic in nature. i.e variance is same for all the points in data. You can see if the data is homoscedastic by observing the distance for each point from the regression line. This distance should be same for the data to be homoscedastic.

Technically, the data is homoscedastic if the ratio of the largest variance to the smallest variance is below 1.5

But, in reality, you often have to deal with the heteroscedastic data where variance is not constant for all the data points in the scattered data. Heteroscedastic data has a cone shape that spreads out in either direction i.e left to right, or right to left.

One example of such data is the prediction of annual income by age. More often than not, people in their teens earn close to the minimum wage, so the variance of such data points seems constant at low age. But, if you observe, the income gap widens with the age. For example, One could be driving a Ferrari and other could not even afford a car. We can illustrate this example with the below graph

## What is bagging?

Bagging is another term for bootstrap aggregation. In order to understand it better, you should have a clear understanding of a statistical method called bootstrap. Now, let’s see one example to make it more easy to grasp the concept.

Let’s say, you want to estimate the mean of a sample of 100 values. Unless you are living under the rock, you already know that you can calculate the mean directly from the following formula:

Mean = Sum of all values / Total no. of values

But, if your sample is small then sure as shooting your mean has an error in it. What should you do to improve your estimation? This is where bootstrap method comes into the picture. Below are the steps to follow the bootstrap method

• Create many sub-samples out of your dataset with replacement. For ex: If your dataset is (1,2,3,4,5) then your subsets would be (1,2,3) or (3,4,5). Note that we have replaced 3 in the second subset.
• Calculate the mean of each subset
• Average the means of all the many subsets

Now, let’s come back to bagging. Bagging is nothing but the application of the bootstrap method in machine learning algorithms like Decision Trees and Random Forest.

## What are the different ways of model assessment?

• Complexity parameter
• AIC
• BIC