Generally, Machine Learning (or Deep Learning) projects are quite unique and also different from traditional web application projects due to the inherent complexity involved with them.
The goal of this article is, not to go through full project management life cycle, but to discuss a few complexities and finer points which may impact different project management phases and aspects of a Machine Learning(or Deep Learning) project, and, which should be taken care of, to avoid any surprises later.
Below is a quick ready reckoner for the topics that we will be discussing in this article.
‘Machine Learning’ term in this article means both – ‘Machine Learning’ and ‘Deep Learning’.
Let us say, you have trained, fine-tuned and tested Machine Learning(ML) model – sgd_clf, which was trained and tested using SGD Classifier on MNIST dataset. And now you want to deploy it in production, so that consumers of this model could use it. What are different options you have to deploy your ML model in production?
One of the classic problem that has been used in the Machine Learning world for quite sometime is the MNIST problem. The objective is to identify the digit based on image. But MNIST is not very great problem because we come up with great accuracy even if we are looking at few pixels in the image. So, another common example problem against which we test algorithms is Fashion-MNIST.
Fashion-MNIST is a dataset of Zalando’s fashion article images —consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each instance is a 28×28 grayscale image, associated with a label.
These Machine Learning Interview Questions, are the real questions that are asked in the top interviews.
For hiring machine learning engineers or data scientists, the typical process has multiple rounds.
A basic screening round – The objective is to check the minimum fitness in this round.
Algorithm Design Round – Some companies have this round but most don’t. This involves checking the coding / algorithmic skills of the interviewee.
ML Case Study – In this round, you are given a case study problem of machine learning on the lines of Kaggle. You have to solve it in an hour.
Bar Raiser / Hiring Manager – This interview is generally with the most senior person in the team or a very senior person from another team (at Amazon it is called Bar raiser round) who will check if the candidate fits in the company-wide technical capabilities. This is generally the last round.
Recently, a friend whose company is working on large scale project reached out to us to seek a solution to a simple problem of finding a list of phrases (approximately 80,000) in a huge set of rich text documents (approx 6 million).
The problem at first looked simple. The way engineers had solved it is by simply loading the two documents in Apache Spark’s DataFrame and joining those using “like”. Something on these lines:
select phrase.id, docs.id from phrases, docs where docs.txt like ‘%’ + phrases.phrase + ‘%’
But it was taking huge time even on the small subset of the data and processing is done in distributed fashion. Any Guesses, why?
They had also tried to use Apache Spark’s broadcast mechanism on the smaller dataset but still, it was taking a long while finishing even a small task.
Can a machine create quiz which is good enough for testing a person’s knowledge of a subject?
So, last Friday, we wrote a program which can create simple ‘Fill in the blank’ type questions based on any valid English text.
This program basically figures out sentences in a text and then for each sentence it would first try to delete a proper noun and if there is no proper noun, it deletes a noun.
We are using textblob which is basically a wrapper over NLTK – The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.
In this data analytics case study, we will use the US census data to build a model to predict if the income of any individual in the US is greater than or less than USD 50000 based on the information available about that individual in the census data.
The dataset used for the analysis is an extraction from the 1994 census data by Barry Becker and donated to the public site http://archive.ics.uci.edu/ml/datasets/Census+Income. This dataset is popularly called the “Adult” data set. The way that we will go about this case study is in the following order:
Describe the data- Specifically the predictor variables (also called independent variables features) from the Census data and the dependent variable which is the level of income (either “greater than USD 50000” or “less than USD 50000”).
Acquire and Read the data- Downloading the data directly from the source and reading it.
Clean the data- Any data from the real world is always messy and noisy. The data needs to be reshaped in order to aid exploration of the data and modeling to predict the income level.
Explore the independent variables of the data- A very crucial step before modeling is the exploration of the independent variables. Exploration provides great insights to an analyst on the predicting power of the variable. An analyst looks at the distribution of the variable, how variable it is to predict the income level, what skews it has, etc. In most analytics project, the analyst goes back to either get more data or better context or clarity from his finding.
Build the prediction model with the training data- Since data like the Census data can have many weak predictors, for this particular case study I have chosen the non-parametric predicting algorithm of Boosting. Boosting is a classification algorithm (here we classify if an individual’s income is “greater than USD 50000” or “less than USD 50000”) that gives the best prediction accuracy for weak predictors. Cross validation, a mechanism to reduce over fitting while modeling, is also used with Boosting.
Validate the prediction model with the testing data- Here the built model is applied on test data that the model has never seen. This is performed to determine the accuracy of the model in the field when it would be deployed. Since this is a case study, only the crucial steps are retained to keep the content concise and readable.