Top Machine Learning Interview Questions for 2018 (Part-1)


These Machine Learning Interview Questions, are the real questions that are asked in the top interviews.

For hiring machine learning engineers or data scientists, the typical process has multiple rounds.

  1. A basic screening round – The objective is to check the minimum fitness in this round.
  2. Algorithm Design Round – Some companies have this round but most don’t. This involves checking the coding / algorithmic skills of the interviewee.
  3. ML Case Study – In this round, you are given a case study problem of machine learning on the lines of Kaggle. You have to solve it in an hour.
  4. Bar Raiser / Hiring Manager  – This interview is generally with the most senior person in the team or a very senior person from another team (at Amazon it is called Bar raiser round) who will check if the candidate fits in the company-wide technical capabilities. This is generally the last round.

A typical first round of interview consists of three parts. First, a brief intro about yourself. 

Second, a brief about your relevant projects.

A typical interviewer will start by asking about the relevant work from your profile. On your past experience of machine learning project, the interviewer might ask how would you improve it.

Say, you have done a project on recommendation, the interviewer might ask:

  • How would you improve the recommendations?
  • How would you do ranking?
  • Have you done any end-to-end machine learning project? If yes, then what were the challenges faced. How would you do solve the problem of the cold start?
  • How would you improve upon the speed of recommendation?

Afterwards (third part), the interviewer would proceed to check your basic knowledge of machine learning on the following lines.

Q1: What is machine learning?

Machine learning is the field of study that gives the computer the ability to learn and improve from experience without explicitly taught or programmed.

In traditional programs, the rules are coded for a program to make decisions, but in machine learning, the program learns based on the data to make decisions.

Q2. Why do we need machine learning?

The most intuitive and prominent example is self-driving cars, but let’s answer this question in the more structured way. Machine learning is needed to solve the problems that are categorized as below:

  • Problems for which a traditional solution require a long and complex set of rules and requires hand-tuning often. Example of such problem is email spam filter. You notice a few words such as 4U, promotion, credit card, free, amazing etc. and figure out that the email is a spam. This list can be really long and can change once the spammer notices that you started ignoring these words. It becomes hard to deal with this problem with traditional programming approach. Machine learning algorithm learns to detect spam emails very well and works better.
  • Complex problems for which there is no good solution at all using the traditional approach. Speech recognition is an example of this category of the problems.
    Machine learning algorithms can find a good solution to these problems.
  • Fluctuating environment: a machine learning system can adapt to new data and learn to do well in this new set of data.
  • Getting insights into complex, large amounts of data. For example, your business collects a large amount of data from the customers. A machine learning algorithm can find insights into this data which otherwise is not easy to figure out.

For more details visit Machine Learning Specialization

Q3. What is the difference between the supervised and unsupervised learning? Give examples of both.

By definition, the supervised and unsupervised learning algorithms are categorized based on the supervision required while training. Supervised learning algorithms work on data which are labelled i.e. the data has a desired solution or a label.

On the other hand, unsupervised learning algorithms work on unlabeled data, meaning that the data does not contain the desired solution for the algorithm to learn from.

Supervised algorithm examples:

  • Linear Regression
  • Neural Networks/Deep Learning
  • Decision Trees
  • Support Vector Machine (SVM)
  • K-Nearest neighbours

Unsupervised algorithm examples:

  • Clustering Algorithms – K-means, Hierarchical Clustering Analysis (HCA)
  • Visualization and Dimensionality reduction – Principal component reduction (PCA)
  • Association rule learning

Q4. Is recommendation supervised or unsupervised learning?

Recommendation algorithms are interesting as some of these are supervised and some are unsupervised. The recommendations based on your profile, previous purchases, page views fall under supervised learning. But there are recommendations based on hot selling products, country/location-based recommendations, which are unsupervised learning.

For more details visit Machine Learning Specialization

Q5. Explain PCA?

PCA stands for Principal Component Analysis. PCA is a procedure to reduce the dimensionality of the data, which consist of many variables related to each other heavily or lightly while retaining the variation in the data to the maximum possible. The data on which the PCA is applied has to be scaled data and the result of the PCA is sensitive to the relative scaling of the data.

PCA Hyperplane depiction
PCA – Hyperplanes

For an example, say, you have dataset in 2D space and you would need to choose a hyperplane to project the dataset. The hyperplane must be chosen such that the variance is preserved to the maximum. In below figure, when converting from one representation to another (left to right), the hyperplace C1 (solid line) has preserved maximum variance in the dataset while C2 (dotted line) has very little variance preserved.



Q6. Which supervised learning algorithms do you know?

You need to answer this question in your own comfort level with the algorithm. There are many supervised learning algorithms such as regression, decision tree, neural networks, SVM etc. Out of these the most popular and simple algorithm in supervised learning is the linear regression. Let me explain it in a quick way.

Say we need to predict income of residents of a county based on some historical data. Linear regression is can be used for this problem.
The linear regression model is a linear function of input features with weights which define the model and a bias term as shown below.

In this equation y_hat  is the predicted outcome, x_are the inputs and theta_i are the model parameters or weights. theta_is the bias.
The performance of this model is measured by evaluating Root Mean Square Error (RMSE). In practice, Mean Square Error is minimized to find the values so that the MSE is the least.MSE is given as below:

Q7. Can you compare Decision Trees and linear regression? Can decision trees be used for non-linear classification?

Decision trees are used for both unsupervised and supervised learning. Also, they are used for classification as well as the regression in supervised machine learning problems. In decision trees, we form the tree by splitting the node. Initially, all of the instances are divided into two parts based on a boundary such that the instance on either side is boundary is very close to other instance on the same side. The instances on the left-hand side should be very similar to other instance on the left-hand side and same is true for the right-hand side.

Below figure shows the decision tree of max depth 2 and max depth 3; you can see that as the max depth of the decision tree increases you get a better coverage of the available data.

Decision tree depth explanation
Decision Trees with Different Depths

One more aspect of the decision tree worth highlight is the stability of the decision trees. The decision trees are sensitive to the dataset rotations. Below picture demonstrates the instability of decision tree while the data is rotated.

Decision Tree Sensitivity with Rotation of Data
Decision Tree Sensitivity with Rotation of Data


For more details visit Machine Learning Specialization

Q8. Explain overfitting and underfitting? What causes overfitting?

Say, there are two kids Jack and Jill in a maths exam. Jack only learnt additions and Jill memorized the questions and their answers from the maths book.  Now, who will succeed in the exam? The answer is neither. From machine learning lingo, Jack is underfitting and Jill is overfitting.

Overfitting is failing of the algorithm to generalize to new examples which are not in the training set, at the same time the algorithm works very well for training set data same as Jill can answer the question which is in the book but nothing besides it. Underfitting, on the other hand, refers to the model when it does not capture the underlying trend of the data (training data as well as test data). The remedy, in general, is to choose a better (more complex) machine learning algorithm.

So, the underfitting models are the ones that give bad performance both in training and test data. Overfitting is very important to keep a tab on while developing the machine learning algorithms. This is because, by intuition, if the model fits very well with the training set the developers tend to think that the algorithm is working well, sometimes failing to account for overfitting. Overfitting occurs when the model is too complex relative to the amount and noisiness of the training data. It also means that the algorithm is not be working for test data well, maybe because the test data does not come from the same distribution as that of training data. Below are some of the ways to avoid overfitting:

  • Simplify the model: regularization, controlled by hyperparameter
  • Gather more training data
  • Reduce the noise in the training data

Below are some of the ways to avoid underfitting:

  • Selecting a more powerful model
  • Feeding better features to the learning algorithm
  • Reducing the constraints on the model (reduce regularization hyperparameter)

Q9. What is cross-validation technique?

Let’s understand what validation set is and then we will go to cross-validation. When building the model, the training set is required to tune the weights by the means of backpropagation. And these weights are chosen such that the training error is minimum.

Now you need data to evaluate the model and the hyperparameters and this data can not be the same as the training set data. Hence a portion of the training set data is reserved for validation and is called the validation set. When testing different models to avoid wasting too much data in the validation of the models by keeping separate validation sets, the cross-validation technique is used. In cross-validation technique training data is divided into complimentary sub-sets and a different set of training and validation set are used for different models.

Then finally the best model is tested with test data.

For more details visit Machine Learning Specialization

Q10. How would you detect overfitting and underfitting?

This is one of the most important questions of practical machine learning. For answering this question, let’s understand the concept of bias and variance.

In order to conclude whether the algorithm is overfitting or underfitting, you need to find out the training set error (E_train) and cross-validation set error (E_cv). If your E_train is high and E_cv is also in the same range as E_train i.e. both E_train and E_cv are high. This is the case of high bias and the algorithm is underfitting. In another case, say, your training set error is low but your cross-validation set error is high: E_train is low and E_cv is high. This is the case of high variance and the algorithm is overfitting.

Q11. What’s the trade-off between bias and variance?

Explanation of Bias and Variance trade-off
Bias vs Variance

In simple terms, you can understand that a very simple algorithm (which does not capture the underlying details of the data) underfit, and has high bias and a very complex algorithm overfit and has high variance. There has to be a balance between the two. The picture below depicts how they are related in terms of the trade-off between them.

Q12. How would you overcome overfitting in the algorithms that you mentioned above?

As mentioned above, the ways to overcome overfitting are as below:

  • Simplify the model: regularization, controlled by hyperparameter
  • Gather more training data
  • Reduce the noise in the training data

Q13. There is a colleague who claims to have achieved 99.99% accuracy in the classifier that he has built? Would you believe him? If not, what could be the prime suspects? How would you solve it?

99.99% is a very high accuracy, in general, and should be suspected. At least a careful analysis of the data set and any flow in modelling the solution around it be checked thoroughly. My prime suspects would be the data set and the problem statement. For example: in a set of handwritten characters where there are digits from 0 to 9 and if one builds a model to detect whether a digit is 5 or not 5. A faulty model which always recognize a digit as 8 will also give 90% accuracy, given all digits have the equal number of images in the data set.
In this case, the data set is not having good distribution for the problem of detecting 5 or not 5.

Q14. Explain how a ROC curve works?

ROC Curve
ROC Curve

ROC stands for Receiver Operating Characteristic. ROC curve is used to measure the performance of different algorithms. This is a measurement of the area under the curve when the true positive rate and the false positive rate is plotted. More the area better the model.

For more details visit Machine Learning Specialization

Q15. Explain the ensemble methods? What is the basic principle?

Say you ask a question to thousands of people and then aggregates the answer, many times this answer is better than an expert’s answer. Ensemble methods are basically combining the predictions of different learning algorithms such as classification, regression etc., to achieve a higher accuracy. This aggregate prediction is better than the best individual predictor. These group of predictors are called ensemble and the technique is called ensemble learning.

Q16. Say, you have a dataset having city id as the feature, what would you do?

When you collect the data for your machine learning project, you need to carefully select the features from the data collected. City id is just a serial number which does not represent any property of the city unless otherwise stated, so I would just drop city id from the features list.

Q17. In a dataset, there is a feature hour_of_the_day which goes from 0 to 23. Do you think it is okay?

This feature can not be used as it is because of the simple reason that hour_of_the_day may imply a certain constraint on your problem to be solved using the machine learning technique but there is a flaw to use the feature as is. Consider 0 and 23, these two numbers have a large numeric difference but in fact, they are close in the actual occurrence in the day, hence the algorithm may not produce desired results. There are two ways to solve this. First is to apply sine function with the periodicity of 24 (hours in a day), this will result in a continuous data from a discontinuous data.

The second approach is to divide the hours of the day into categories such as morning, afternoon, evening, night etc. or in the split of peak hours and non-peak hours based on your knowledge of the domain of your problem.

Q18. If you have a smaller dataset, how would handle?

There are multiple ways to deal with this problem. Below are a few techniques.

  1. Data augmentation
  2. Pretrained Models
  3. Better algorithm
  4. Get started with generating the data
  5. Download from internet


To learn more in details, join the course on machine learning


Phrase matching using Apache Spark

Recently, a friend whose company is working on large scale project reached out to us to seek a solution to a simple problem of finding a list of phrases (approximately 80,000) in a huge set of rich text documents (approx 6 million).

The problem at first looked simple. The way engineers had solved it is by simply loading the two documents in Apache Spark’s DataFrame and joining those using “like”. Something on these lines:

select, from phrases, docs where docs.txt like ‘%’ + phrases.phrase + ‘%’

But it was taking huge time even on the small subset of the data and processing is done in distributed fashion. Any Guesses, why?

They had also tried to use Apache Spark’s broadcast mechanism on the smaller dataset but still, it was taking a long while finishing even a small task.

So, how we solved it finally? Here is one of my approaches. Please feel free to provide your input.

We first brought together the phrase and documents where there is at least one match.  Then we grouped the data based on the pair of phrase id and document id. And finally, we filtered the results based on whether all of the words in the phrase are found in the document or not and in the same order.

You can take a look at the project here. The Scala version is not yet finished, though Python version is done.

You may be wondering if it really makes it faster? And what makes it faster?

If you have m phrases and n documents. The phrases have w words and documents have k words.

The total complexity will be of the order of m*w * n * k. Each word from phrases will be compared with each word in documents.

While complexity using our approach will not be that straightforward to compute. Let me try.

First, it is going to sort the data. The total number of words are m*w + n*k. Let’s call it W

W = m*w + n*k

The complexity of sorting it is: W log W

Then we are going to sort the data based on (phrase Id, document id). If every phrase was found in every document then there will be a total of m * n records to be sorted.

m*n log (m*n)

but it is going to be far lesser and can be approximated to n. Now, sorting the data based on

So, final sorting will take approx: n* log(n)

We can safely ignore other processing steps as those are linear. The overall complexity or the time consumption is going to be of the order of:

(m*w + n*k) log(m*w + n*k)  +  m*n log (m*n)

Which is definitely way better than m*w * n * k

I hope you find it useful. Please visit to see various courses and lab offerings.


How To Optimise A Neural Network?

When we are solving an industry problem involving neural networks, very often we end up with bad performance. Here are some suggestions on what should be done in order to improve the performance.

Is your model underfitting or overfitting?

You must break down the input data set into two parts – training and test. The general practice is to have 80% for training and 20% for testing.

You should train your neural network with the training set and test with the testing set. This sounds like common sense but we often skip it.

Compare the performance (MSE in case of regression and accuracy/f1/recall/precision in case of classification) of your model with the training set and with the test set.

If it is performing badly for both test and training it is underfitting and if it is performing great for the training set but not test set, it is overfitting.

In case of Underfitting

If the performance over test set is continuously improving over the iterations or epochs, it means you need to increase the iterations/epochs. If it is taking too much time, you may want to use GPUs. You can also try adding an optimizer such as Adam instead of only plain Gradient Descent.

If the performance isn’t improving, it means you have a true case of underfitting. In such cases, There are three possibilities:

  1. Insufficient data
  2. No correlation in data – random data
  3. You need a better model

If the data is insufficient, you can do the following:

  • You can generate more data. This is called data augmenting. For example, you could take more pictures from different angles, You could reshape them a bit, put more colour filters, remove some pixels from border etc.
  • You can download similar data from the internet. Say you want to build a neural network to recognize the faces in your office. You can download more picture of faces from across the globe and first train the model on those faces and then train the model using the faces from your office.
  • You can download a pre-trained neural network and add a layer on top of it and further train it using your data.

If there is no correlation in data, you can’t do much. You can just recheck the labels. A common error is label mismatch. Imagine that there are two files one containing the features and other containing the label and those in different orders or we skipped just one line in either causing the label mismatch. So, recheck if the labels are in the same order as the features. Also, check with the data gathering team if there is something wrong with data.

The last case where you need to improve upon the model is the hardest. In case of neural networks, you can do the following:

  • Add more layers
  • Add more neurons to full connected / dense layers but prefer adding more neurons to increasing neurons
  • Add more filters
  • Experiment with different strides
  • Add RELU if you aren’t using it already
  • If you have the diminishing or exploding gradients problem,
    • use batch normalization.
    • Try initializing the weights using the xavier_initializer or other heuristics
    • Also, try gradient clipping
  • Normalize the features either using the min-max scaling or standardization
  • Try normalizing the labels too. Though it is not recommended first.

In case of overfitting

If you notice that your model is overfitting you should do the regularization and also make sure that you are shuffling the training set at every iteration such that every batch is different every time.

For regularization, you can use L1 or L2 normalization or dropout layer.

These are my quick notes. Feel free to let us know you observe any errors in this post .

If you liked it share it with your friends.

How to Install Hortonworks Data Platform – HDP 2.6 on AWS

In this post, we will show you how you can install Hortonworks Data Platform on AWS.

You can also watch the video of this tutorial here


We start with three machines. We could install Hadoop on these machines by manually downloading and configuring them, but that’s very insufficient. So either we could use Cloudera manager or Ambari. In this tutorial, we are going to use Ambari.

On the first machine, we are going to install the Ambari server. For that, we need to buy these three instances at Amazon and we will follow the Ambari guidelines.

Ambari will then install all the components that are required in other two machines.

Please note, we will use 16 GB ram machines so that installation goes smoothly. 

Let’s get started.

Continue reading “How to Install Hortonworks Data Platform – HDP 2.6 on AWS”

A Simple Tutorial on Scala – Part – 2

Welcome back to the Scala tutorial.

This post is the continuation of A Simple Tutorial on Scala – Part – 1

In the Part-1 we learned the following topics on Scala

  • Scala Features
  • Variables and Methods
  • Condition and Loops
  • Variables and Type Inference
  • Classes and Objects

Keeping up the same pace, we will learn the following topics in the 2nd part of the Scala series.

  • Functions Representation
  • Collections
  • Sequence and Sets
  • Tuples and Maps
  • Higher Order Functions
  • Build Tool – SBT

Functions Representation

We have already discussed functions. We can write a function in different styles in Scala. The first style is the usual way of defining a function.

Please note that the return type is specified as Int.

In the second style, please note that the return type is omitted, also there is no “return” keyword. The Scala compiler will infer the return type of the function in this case.

If the function body has just one statement, then the curly braces are optional. In the third style, please note that there are no curly braces.

Continue reading “A Simple Tutorial on Scala – Part – 2”

A Simple Tutorial on Scala – Part – 1

Welcome to the Scala tutorial. We will cover the Scala in two-part blog series. In this part, we will learn the following topics

  • Scala Features
  • Variables and Methods
  • Condition and Loops
  • Variables and Type Inference
  • Classes and Objects

For better understanding, do hands-on with this tutorial. We’ve made this post in such a way that the reader will find easy to follow the tutorial with hands-on.

Scala Features

Scala is a modern multi-paradigm programming language designed to express common programming patterns in a concise, elegant, and type-safe way.

It is a statically typed language. Which means it does type checking at compile-time as opposed to run-time. Let me give you an example to better understand this concept.

When we deploy jobs which will run for hours in production, we do not want to discover midway that the code has unexpected runtime errors. With Scala, you can be sure that your code will not give you unexpected errors while running in production.

Since Scala is statically typed we get performance and speed over dynamic languages.

How is Scala different than Java?

Unlike Java, in Scala, we do not have to write quite as much code to perform simple tasks and its syntax is very similar to other data-centric languages. You could say that Scala is the modified version of Java with less boilerplate code.

Continue reading “A Simple Tutorial on Scala – Part – 1”

A Simple Tutorial on Linux – Part-2

This post is the continuation of A Simple Tutorial on Linux – Part-1

In the Part-1 we learned the following topics on Linux.

  • Linux Operating System
  • Linux Files & Process
  • The Directory Structure
  • Permissions
  • Process

Keeping up the same pace, we will learn the following topics in the 2nd part of the Linux series.

  • Shell Scripting
  • Networking
  • Files & Directories
  • Chaining Unix Commands
  • Pipes
  • Filters
  • Word Count Exercise
  • Special System commands
  • Environment variables

Writing first shell script

A shell script is a file containing a list of commands. Let’s create a simple command that prints two words:

1. Open a text editor to create a file

2. Write the following into the editor:

Note: In Unix, the extension doesn’t dictate the program to be used while executing a script. It is the first line of the script that would dictate which program to use. In the example above, the program is “/bin/bash” which is a Unix shell.

1. Press Ctrl +x to save and then “y” to exit

2. Now, by default, it would not have executable permission. You can make it executable like this:

3. To run the script, use:

Continue reading “A Simple Tutorial on Linux – Part-2”

A Simple Tutorial on Linux – Part-1

We have started this series of tutorials for Linux which is divided into two blog posts. Each one of them will cover basic concepts with practical examples. Also, we have provided the quiz on some of the topics that you can attend for free.

In the first part of the series, we will learn the following topics in detail

  • Linux Operating System
  • Linux Files & Process
  • The Directory Structure
  • Permissions
  • Process


Linux is a Unix like operating system. It is open source and free. We might sometimes use the word “Unix” instead of Linux.

A user can interact with Linux either using a ‘graphical interface’ or using the ‘command line interface’.

Learning to use the command line interface has a bigger learning curve than the graphical interface but the former can be used to automate very easily. Also, most of the server side work is generally done using the command line interface.

Linux Operating System

The operating system is made of three parts:

1. The Programs

A user executes programs. AngryBird is a program that gets executed by the kernel, for example. When a program is launched, it creates processes. Program or process will be used interchangeably.

2. The Kernel

The Kernel handles the main work of an operating system:

  • Allocates time & memory to programs
  • Handles File System
  • Responds to various Calls

3. The Shell

A user interacts with the Kernel via the Shell. The console as opened in the previous slide is the shell. A user writes instructions in the shell to execute commands. Shell is also a program that keeps asking you to type the name of other programs to run.

Continue reading “A Simple Tutorial on Linux – Part-1”

NumPy and Pandas Tutorial – Data Analysis with Python

Python is increasingly being used as a scientific language. Matrix and vector manipulations are extremely important for scientific computations. Both NumPy and Pandas have emerged to be essential libraries for any scientific computation, including machine learning, in python due to their intuitive syntax and high-performance matrix computation capabilities.

In this post, we will provide an overview of the common functionalities of NumPy and Pandas. We will realize the similarity of these libraries with existing toolboxes in R and MATLAB. This similarity and added flexibility have resulted in wide acceptance of python in the scientific community lately. Topic covered in the blog are:

  1. Overview of NumPy
  2. Overview of Pandas
  3. Using Matplotlib

This post is an excerpt from a live hands-on training conducted by CloudxLab on 25th Nov 2017. It was attended by more than 100 learners around the globe. The participants were from countries namely; United States, Canada, Australia, Indonesia, India, Thailand, Philippines, Malaysia, Macao, Japan, Hong Kong, Singapore, United Kingdom, Saudi Arabia, Nepal, & New Zealand.

Continue reading “NumPy and Pandas Tutorial – Data Analysis with Python”

Python Setup Using Anaconda For Machine Learning and Data Science Tools

Python for Machine Learning

In this post, we will learn how to configure tools required for CloudxLab’s Python for Machine Learning course. We will use Python 3 and Jupyter notebooks for hands-on practicals in the course. Jupyter notebooks provide a really good user interface to write code, equations, and visualizations.

Please choose one of the options listed below for practicals during the course.

Continue reading “Python Setup Using Anaconda For Machine Learning and Data Science Tools”