What are the pre-requisites to learn big data?

Pre-requisites for Big Data Hadoop

We, at CloudxLab, keep getting a lot of questions online, sometimes offline, asking us

“I want to learn big data. But, just don’t know whether I am eligible or not.”

“I am so and so, can I learn big data?”

We have compiled the most common questions here. And, we will answer each one of them.

So, here we go.

What are those questions?

  1. I am from a non-technical background. Can I learn big data?
  2. Do I need to know programming languages such as Java, Python, PHP, etc.?
  3. Or, since it is big data, do I need to know any other relational databases such as Oracle or in general do I need to be well versed with SQL?
  4. And also, do I need to know the Unix or Linux?

The first question, I don’t have any technical background or programming experience.

Well, the answer is, you don’t have to compulsorily have a technical background as such. But, that said, if you can fine tune a few programming basics, it would be more than enough. And, to do this, you just need a few hours to get familiar.

The second question, do I need to know any programming languages, such as Java, Python, etc?

The answer is, you don’t have to be a hard-core programmer. That said, you should know the fundamentals of programming, which again takes a few hours to get to know.

For example, we offer a free Java course and a free self-paced Python course. You can check more details on our website.

The third question, do I need to know the SQL or any other RDBMS?

Well, the answer is yes. You should know at least SQL. If you don’t know, there are so many free resources available online.

The final question here, do I need to have Linux or Unix skills?

The answer is, not compulsory. But, it is good if you know.

Some generic questions:

  1. I am from the mainframe background, will learning big data help me?
  2. I am from telecom/pharma/manufacturing/FMCG background, will learning big data help me?
  3. I have not been in the job for the last few years, will learning big data help me find a job?
  4. I have been working in SAP field and now want to change my career to the big data, can a big data course help me?
  5. I am an MBA, will learning big data help me shift my career?

I am from the mainframe background, will learning big data help me shift my career?

Being in mainframe, you might have a good idea of programming such as Cobol. Also, you might be comfortable with SQL by now. This would accelerate your learning of big data. Now, since mainframes are not progressing much, it is very important to upgrade your technical skills to suit the new generation of technologies. We have seen many of our students from mainframes enrolling in our courses and successfully transitioning their careers.

I am from telecom/pharma/manufacturing background, will learning big data help me?

In telecom, pharma or manufacturing, the data that is being generated has become big data. Earlier, to derive insights or predictions, we were able to use traditional tools. But the same can’t be done anymore because data has grown exponentially. So, naturally, the industry is adopting big data technologies.

I have not been in the job for the last few years, will learning big data help me a job?

From time to time, the technology landscape changes giving an opportunity to those who have been in the industry. Before it is too late, it is better to equip yourself with new technologies, new skills to get a job in this current scenario. Long answer short – learning big data along with a few other skills will definitely help.

I have been working in SAP field and now want to change my career to the big data, can a big data course help me?

It’s a little tricky question. In SAP, I am not sure if you are a functional consultant or technical consultant. It does help to learn big data. But, the transition may take some time.

I am an MBA, will learning big data help me shift my career?

If you are at the beginning of your career, learning big data will definitely help you. If you have been in the job for a while, and want to switch your career, it takes additional effort to master the skills we discussed in the above.

So, to put it in a nutshell,

You need to know the fundamentals of a programming language such as Java or Python. We have a free course for both. Please visit our website www.cloudxlab.com and enroll yourself.

And also, you do need to know SQL. Again, we have a free course for this as well. Please visit our website for further details.

And, a little bit of Linux or Unix will complete the equation.

More than anything else, you need to have a great passion, ambition to succeed in your career, and willingness to put in sincere efforts and hard work.

Before we wrap up, please visit www.cloudxlab.com to know more details about our big data courses. We have an instructor-led course on big data and a few self-paced courses as well.

Hope we answered all your questions. If you have any other questions, please put them here in the comments or add your questions on the discussion forum on our website.

Top Machine Learning Interview Questions for 2018 (Part-1)


These Machine Learning Interview Questions, are the real questions that are asked in the top interviews.

For hiring machine learning engineers or data scientists, the typical process has multiple rounds.

  1. A basic screening round – The objective is to check the minimum fitness in this round.
  2. Algorithm Design Round – Some companies have this round but most don’t. This involves checking the coding / algorithmic skills of the interviewee.
  3. ML Case Study – In this round, you are given a case study problem of machine learning on the lines of Kaggle. You have to solve it in an hour.
  4. Bar Raiser / Hiring Manager  – This interview is generally with the most senior person in the team or a very senior person from another team (at Amazon it is called Bar raiser round) who will check if the candidate fits in the company-wide technical capabilities. This is generally the last round.

A typical first round of interview consists of three parts. First, a brief intro about yourself. 

Second, a brief about your relevant projects.

A typical interviewer will start by asking about the relevant work from your profile. On your past experience of machine learning project, the interviewer might ask how would you improve it.

Say, you have done a project on recommendation, the interviewer might ask:

  • How would you improve the recommendations?
  • How would you do ranking?
  • Have you done any end-to-end machine learning project? If yes, then what were the challenges faced. How would you do solve the problem of the cold start?
  • How would you improve upon the speed of recommendation?

Afterwards (third part), the interviewer would proceed to check your basic knowledge of machine learning on the following lines.

Q1: What is machine learning?

Machine learning is the field of study that gives the computer the ability to learn and improve from experience without explicitly taught or programmed.

In traditional programs, the rules are coded for a program to make decisions, but in machine learning, the program learns based on the data to make decisions.

Q2. Why do we need machine learning?

The most intuitive and prominent example is self-driving cars, but let’s answer this question in the more structured way. Machine learning is needed to solve the problems that are categorized as below:

  • Problems for which a traditional solution require a long and complex set of rules and requires hand-tuning often. Example of such problem is email spam filter. You notice a few words such as 4U, promotion, credit card, free, amazing etc. and figure out that the email is a spam. This list can be really long and can change once the spammer notices that you started ignoring these words. It becomes hard to deal with this problem with traditional programming approach. Machine learning algorithm learns to detect spam emails very well and works better.
  • Complex problems for which there is no good solution at all using the traditional approach. Speech recognition is an example of this category of the problems.
    Machine learning algorithms can find a good solution to these problems.
  • Fluctuating environment: a machine learning system can adapt to new data and learn to do well in this new set of data.
  • Getting insights into complex, large amounts of data. For example, your business collects a large amount of data from the customers. A machine learning algorithm can find insights into this data which otherwise is not easy to figure out.

For more details visit Machine Learning Specialization

Q3. What is the difference between the supervised and unsupervised learning? Give examples of both.

By definition, the supervised and unsupervised learning algorithms are categorized based on the supervision required while training. Supervised learning algorithms work on data which are labelled i.e. the data has a desired solution or a label.

On the other hand, unsupervised learning algorithms work on unlabeled data, meaning that the data does not contain the desired solution for the algorithm to learn from.

Supervised algorithm examples:

  • Linear Regression
  • Neural Networks/Deep Learning
  • Decision Trees
  • Support Vector Machine (SVM)
  • K-Nearest neighbours

Unsupervised algorithm examples:

  • Clustering Algorithms – K-means, Hierarchical Clustering Analysis (HCA)
  • Visualization and Dimensionality reduction – Principal component reduction (PCA)
  • Association rule learning

Q4. Is recommendation supervised or unsupervised learning?

Recommendation algorithms are interesting as some of these are supervised and some are unsupervised. The recommendations based on your profile, previous purchases, page views fall under supervised learning. But there are recommendations based on hot selling products, country/location-based recommendations, which are unsupervised learning.

For more details visit Machine Learning Specialization

Q5. Explain PCA?

PCA stands for Principal Component Analysis. PCA is a procedure to reduce the dimensionality of the data, which consist of many variables related to each other heavily or lightly while retaining the variation in the data to the maximum possible. The data on which the PCA is applied has to be scaled data and the result of the PCA is sensitive to the relative scaling of the data.

PCA Hyperplane depiction
PCA – Hyperplanes

For an example, say, you have dataset in 2D space and you would need to choose a hyperplane to project the dataset. The hyperplane must be chosen such that the variance is preserved to the maximum. In below figure, when converting from one representation to another (left to right), the hyperplace C1 (solid line) has preserved maximum variance in the dataset while C2 (dotted line) has very little variance preserved.



Q6. Which supervised learning algorithms do you know?

You need to answer this question in your own comfort level with the algorithm. There are many supervised learning algorithms such as regression, decision tree, neural networks, SVM etc. Out of these the most popular and simple algorithm in supervised learning is the linear regression. Let me explain it in a quick way.

Say we need to predict income of residents of a county based on some historical data. Linear regression is can be used for this problem.
The linear regression model is a linear function of input features with weights which define the model and a bias term as shown below.

In this equation y_hat  is the predicted outcome, x_are the inputs and theta_i are the model parameters or weights. theta_is the bias.
The performance of this model is measured by evaluating Root Mean Square Error (RMSE). In practice, Mean Square Error is minimized to find the values so that the MSE is the least.MSE is given as below:

Q7. Can you compare Decision Trees and linear regression? Can decision trees be used for non-linear classification?

Decision trees are used for both unsupervised and supervised learning. Also, they are used for classification as well as the regression in supervised machine learning problems. In decision trees, we form the tree by splitting the node. Initially, all of the instances are divided into two parts based on a boundary such that the instance on either side is boundary is very close to other instance on the same side. The instances on the left-hand side should be very similar to other instance on the left-hand side and same is true for the right-hand side.

Below figure shows the decision tree of max depth 2 and max depth 3; you can see that as the max depth of the decision tree increases you get a better coverage of the available data.

Decision tree depth explanation
Decision Trees with Different Depths

One more aspect of the decision tree worth highlight is the stability of the decision trees. The decision trees are sensitive to the dataset rotations. Below picture demonstrates the instability of decision tree while the data is rotated.

Decision Tree Sensitivity with Rotation of Data
Decision Tree Sensitivity with Rotation of Data


For more details visit Machine Learning Specialization

Q8. Explain overfitting and underfitting? What causes overfitting?

Say, there are two kids Jack and Jill in a maths exam. Jack only learnt additions and Jill memorized the questions and their answers from the maths book.  Now, who will succeed in the exam? The answer is neither. From machine learning lingo, Jack is underfitting and Jill is overfitting.

Overfitting is failing of the algorithm to generalize to new examples which are not in the training set, at the same time the algorithm works very well for training set data same as Jill can answer the question which is in the book but nothing besides it. Underfitting, on the other hand, refers to the model when it does not capture the underlying trend of the data (training data as well as test data). The remedy, in general, is to choose a better (more complex) machine learning algorithm.

So, the underfitting models are the ones that give bad performance both in training and test data. Overfitting is very important to keep a tab on while developing the machine learning algorithms. This is because, by intuition, if the model fits very well with the training set the developers tend to think that the algorithm is working well, sometimes failing to account for overfitting. Overfitting occurs when the model is too complex relative to the amount and noisiness of the training data. It also means that the algorithm is not be working for test data well, maybe because the test data does not come from the same distribution as that of training data. Below are some of the ways to avoid overfitting:

  • Simplify the model: regularization, controlled by hyperparameter
  • Gather more training data
  • Reduce the noise in the training data

Below are some of the ways to avoid underfitting:

  • Selecting a more powerful model
  • Feeding better features to the learning algorithm
  • Reducing the constraints on the model (reduce regularization hyperparameter)

Q9. What is cross-validation technique?

Let’s understand what validation set is and then we will go to cross-validation. When building the model, the training set is required to tune the weights by the means of backpropagation. And these weights are chosen such that the training error is minimum.

Now you need data to evaluate the model and the hyperparameters and this data can not be the same as the training set data. Hence a portion of the training set data is reserved for validation and is called the validation set. When testing different models to avoid wasting too much data in the validation of the models by keeping separate validation sets, the cross-validation technique is used. In cross-validation technique training data is divided into complimentary sub-sets and a different set of training and validation set are used for different models.

Then finally the best model is tested with test data.

For more details visit Machine Learning Specialization

Q10. How would you detect overfitting and underfitting?

This is one of the most important questions of practical machine learning. For answering this question, let’s understand the concept of bias and variance.

In order to conclude whether the algorithm is overfitting or underfitting, you need to find out the training set error (E_train) and cross-validation set error (E_cv). If your E_train is high and E_cv is also in the same range as E_train i.e. both E_train and E_cv are high. This is the case of high bias and the algorithm is underfitting. In another case, say, your training set error is low but your cross-validation set error is high: E_train is low and E_cv is high. This is the case of high variance and the algorithm is overfitting.

Q11. What’s the trade-off between bias and variance?

Explanation of Bias and Variance trade-off
Bias vs Variance

In simple terms, you can understand that a very simple algorithm (which does not capture the underlying details of the data) underfit, and has high bias and a very complex algorithm overfit and has high variance. There has to be a balance between the two. The picture below depicts how they are related in terms of the trade-off between them.

Q12. How would you overcome overfitting in the algorithms that you mentioned above?

As mentioned above, the ways to overcome overfitting are as below:

  • Simplify the model: regularization, controlled by hyperparameter
  • Gather more training data
  • Reduce the noise in the training data

Q13. There is a colleague who claims to have achieved 99.99% accuracy in the classifier that he has built? Would you believe him? If not, what could be the prime suspects? How would you solve it?

99.99% is a very high accuracy, in general, and should be suspected. At least a careful analysis of the data set and any flow in modelling the solution around it be checked thoroughly. My prime suspects would be the data set and the problem statement. For example: in a set of handwritten characters where there are digits from 0 to 9 and if one builds a model to detect whether a digit is 5 or not 5. A faulty model which always recognize a digit as 8 will also give 90% accuracy, given all digits have the equal number of images in the data set.
In this case, the data set is not having good distribution for the problem of detecting 5 or not 5.

Q14. Explain how a ROC curve works?

ROC Curve
ROC Curve

ROC stands for Receiver Operating Characteristic. ROC curve is used to measure the performance of different algorithms. This is a measurement of the area under the curve when the true positive rate and the false positive rate is plotted. More the area better the model.

For more details visit Machine Learning Specialization

Q15. Explain the ensemble methods? What is the basic principle?

Say you ask a question to thousands of people and then aggregates the answer, many times this answer is better than an expert’s answer. Ensemble methods are basically combining the predictions of different learning algorithms such as classification, regression etc., to achieve a higher accuracy. This aggregate prediction is better than the best individual predictor. These group of predictors are called ensemble and the technique is called ensemble learning.

Q16. Say, you have a dataset having city id as the feature, what would you do?

When you collect the data for your machine learning project, you need to carefully select the features from the data collected. City id is just a serial number which does not represent any property of the city unless otherwise stated, so I would just drop city id from the features list.

Q17. In a dataset, there is a feature hour_of_the_day which goes from 0 to 23. Do you think it is okay?

This feature can not be used as it is because of the simple reason that hour_of_the_day may imply a certain constraint on your problem to be solved using the machine learning technique but there is a flaw to use the feature as is. Consider 0 and 23, these two numbers have a large numeric difference but in fact, they are close in the actual occurrence in the day, hence the algorithm may not produce desired results. There are two ways to solve this. First is to apply sine function with the periodicity of 24 (hours in a day), this will result in a continuous data from a discontinuous data.

The second approach is to divide the hours of the day into categories such as morning, afternoon, evening, night etc. or in the split of peak hours and non-peak hours based on your knowledge of the domain of your problem.

Q18. If you have a smaller dataset, how would handle?

There are multiple ways to deal with this problem. Below are a few techniques.

  1. Data augmentation
  2. Pretrained Models
  3. Better algorithm
  4. Get started with generating the data
  5. Download from internet


To learn more in details, join the course on machine learning


Financial Aid, Scholarship Test & Free Resources

Financial Aid

At CloudxLab, we have always believed in quality education must be affordable for everyone so that we can help learners achieving career goals and build innovative products.

If you can’t afford to pay for a course, you can apply for financial aid using this form. Learners with Financial Aid in a course will be able to access all of the course content and complete all work required to earn a certificate. Financial Aid only applies to the course that the Financial Aid application was approved for. Most courses offer Financial Aid, but Financial Aid may not be available for certain courses. It will take a minimum of 7 days for us to review your financial aid application. When your application is reviewed, you’ll get an email letting you know whether it’s been approved or denied.

Continue reading “Financial Aid, Scholarship Test & Free Resources”

How to Teach Online Effectively


I founded KnowBigData.com in 2014 after working in Amazon. Teaching is my passion, and technology, specifically large-scale computing my forte, thanks to my working experience with Amazon, InMobi, D. E. Shaw and my own startup tBits Global. Therefore, I wanted to help people learn technology online. I launched KnowBigData.com, an online instructor-led training on MongoDB followed by Big Data and Machine learning. Eventually, we innovated a lot in learning and shaped KnowBigData into Cloudxlab.com which is currently a major gamified learning environment for Machine Learning, AI, and Big Data.

Teaching online has a lot of advantages. You can teach from anywhere in the world, your students can be from any corner of the globe, you save your daily travel time, and you get to talk to students across the world. By teaching, you become more compassionate to the learner and you work hard on self-learning every aspect of the subject that you are teaching. Teaching also helps you improve your ability to express.

Since Apr 2014 when I started KnowBigData (now CloudxLab), I have taught for around 2000 hours (4 years * 52 weeks * 3 classes avg. * 3 hours) and more than 3000 people.  I was able to get wonderful reviews from people with an average rating of above 4.8 out of 5. You can read our public reviews here: (Quora, Quora, Quora, Quora, Facebook, Facebook)

I would love to share with you my learnings that may help you in teaching effectively online. If you are able to follow these and are interested in teaching with CloudxLab, please apply here.

1. Deep Dive into the Subject

The first and foremost thing to do is to make sure you know the subject you are about to teach. That may need new learning, re-learning, or even a certain amount of unlearning. Let go of your ego that you know everything about your subject. Pick up some of the best books, go through each topic, and know each topic in and out. Challenge yourself with questions. Be prepared with the thought that ‘the learner may ask a question on this’ and get yourself ready.

The way I stay on top of my knowledge base is by keeping a list called “To Learn.” Any topic that  I need to study goes into this list. During my research, I pick topics from the top of the list and if I come across subtopics which need brushing up, I append it to the master list. In computing, we call this the ‘breadth-first’ approach but it is an incredible approach for deep diving into a topic. There are often dark spots in our learning that we tend to avoid. It is important to be strong enough, try to figure out those worrisome dark spots and study them. Nothing beats the thrill of learning something we have been avoiding for long.

2. Prepare Slides Well

While teaching, it becomes really easy to explain if you have a good set of slides. Slides guide you and keep you on track during your teaching sessions. A good set of slides should have less text, more graphics and bullet points.

If you have too much text on any slide, break it into multiple slides and work hard on what can be removed. Sacrifice grammatical completeness for brevity. Slides should also detail everything about the topic: the prerequisites, concepts and the references from where the learner can learn more.

Add images to slides. Images should be used in the following order of preference: Humans, Animals, Things, Text/Block diagrams. If you have to choose between the image of a man and the image of an animal for your slides, use the man’s photo.

Similarly, give preference to photos of things over the screenshot of text or block diagrams.

It is also advisable to share the topic deck with learners in advance.

3. Hands-on approach first

A lot of teachers may not agree with this point, but I’ve found it particularly useful. The implementation of this point is also difficult. Have your students work hands-on on something before jumping to the theory. The idea is to make the class work on practicals before you teach. Teaching concepts can be boring for learners. Let the learner figure out the concepts as much as they can instead of a teacher spoon-feeding them. For example, if I’m teaching a class how to code, I would ask my class to just follow the steps and get the first very basic code running. It is amazing to see how much a class can grasp the basics just by running the first code.

This is exactly why our Machine Learning course has the End-to-End project right at the beginning and not at the end of the course. That turned out to be the best way to teach my class.

When we were researching on developing our courses, we realized that students were hesitant to try hands-on exercises because setting up the environment was time-consuming, needed high-end hardware and permission on installation and consumed too many resources on the machine. Therefore, we decided to set up a shared online lab which was a solution to all these challenges and changed the face of online learning for good and became a product in itself. This lab is available 24/7 to all users. The hugely positive response to our lab convinced us to rebrand KnowBigData.com to CloudxLab.com.

4. Answer every question

There is no dearth of study material on the internet. Text or videos on topics are widely available for learners. What paralyzes self-study are unanswered questions. When learners face questions, they stop to look for answers but eventually get lost in the online noise. That is precisely why students join instructor-led classes – to get their questions answered. Questions don’t really delay a course. They become the purpose of a class.

The key to happy learners in a class is to answer their questions and encourage them to ask more. Here is my way of doing handling questions.

First, the following are clarified in the first few classes:

  • Q&A is the purpose of the class and not an obstruction
  • It is okay to delay a class but not okay to not ask questions.
  • Listen to the questions from other classmates. Try to answer or make a hypothesis. Never turn down your classmate’s questions.
  • No question is stupid. I repeat. No question is stupid.
  • Often, questions asked in a class are valid interview questions.  So, pay attention to the questions asked by your fellow-learners.

Second, I always acknowledge a question before answering it. I explicitly mention “It is a good question” and then I try to answer the question with real-life examples from my career.  Some of you may be familiar with ‘Oss,’ the Karate greeting Karate students and teachers use while bowing to each other. It is not only an affirmation of positive spirits, but ‘Oss’ also is a mark of mutual respect and admiration. In the similar fashion, before you answer a question, you should bow and say “Good Question.

During a certain corporate training, I noticed that there were no questions in class because the class had a mix of teams including managers and reportees. The learners were either shy to ask questions or afraid of being judged. Since it was an online class, I asked everyone to use pseudo-names while joining the webinar. You must know that the strategy helped and we had a lot of questions to answer during the entire session.    

5. Work on your accent

For a lot of instructors, the accent can be a problem especially if their students are scattered across the globe. I’ve grown up in a small village in India, and therefore, I needed to work on my accent to deliver my classes to students worldwide.

The best way I learned to improve my accent was by attending sales calls from all over the globe. I would talk in detail to prospective students and lab-users and figured out the areas for my improvement. This turned out to be the most effective way to improve my pronunciation. Though it was a lot of hard work considering time zone differences, the exercise turned out extremely fruitful. You may want to compare the oldest video from KnowBigData on our Youtube channel and latest video from CloudxLab.com to know the positive changes.

If there is a different method that works with you, feel free to implement it. Nevertheless,
hold 1-1 discussions with your learners, respond to sales and service calls.

Also, in conversation, be slow and clear, and ask if you were clear and the listeners understood you. Be mindful of every word that you speak. Avoid fillers such as ‘umm’ and ‘aahs’. If you are mindful of your words and how you speak them, your instruction will be crisp in class.

6. Repeat with a different selection of words

A number of times, you may not be very clear to your audience. Often, words we use may have different connotations in different countries. In a class with students from all over the world, it becomes inevitable to explain repeatedly using a different selection of words. I follow this strategy. I repeat the same sentence using a different selection of words. Keeping the meaning intact, I choose another set of words. (I just did it:))

7. Avoid monotone

The word ‘monotonous’ comes from ‘monotone’ which means sound of the same tone. And monotonous is boring and sleep-inducing.

A voice artist once advised me to find a comfort zone of my voice during long speaking hours and to stay in that comfort zone without any modulation. That advice was good for a voice artist, but not for a teacher.

My thumb rule is to keep varying my tone to prevent students from getting sleepy. Again, varying the tone doesn’t mean shouting. Keeping the acoustics conducive (good mic, small room), simply altering the cadence of my voice to avoid monotone keeps the class alive and active.

Also, I make sure to have sips of lukewarm water while teaching. That keeps my throat from getting sore after hours of speaking.

8. Don’t sit

The human mind is amazing in that it can infer the state of an unseen speaker’s mind from sound alone. Note that my camera is turned off during a class. Nevertheless, my learners can figure out from my voice if I am feeling active or lazy, or happy, or angry. To have an active and responsive class, I stand and teach. I do pace around my desk while teaching. I make sure that I wear most comfortable shoes (preferably jogging shoes) during my class even if I am delivering it from my bedroom.

To make it further easy, I have invested in a standing desk for my lectures.

9. Explain concepts by asking questions

This is probably the hardest part of teaching. In order to explain a concept, break it down into a series of questions. You need to come up with a set of questions such that the answers to this questions lead to the understanding of the concept. For example, to explain the concept of ‘refraction’ in physics, you should break it down into questions like:

  1. How do we see an object?
  2. If there is some other material between us and object can we see?
  3. If that “material” is a glass, can we see?
  4. Is there is some change in the view when we see an object through glass?
  5. What happens when we put a pen in a glass half full with water?

By the end of this set of questions, students naturally lead towards understanding the concept. You must know that students retain concepts longer and better by deducing them rather than by direct instruction.

In addition, while answering a question from the audience, it is always advisable to induce curiosity by giving an open-ended answer that sparks off another question leading to a chain until the concept is clear.

Let me try to give you an example. When I was with D.E.Shaw, I delivered tech talks on Secure Programming. I would show a piece of code to my audience and ask them to find out what was wrong with it.  The audience would come up with many flaws and I would note all of them on a whiteboard. Next, I would ask them how they would fix these flaws. A few more suggestions for fixes would lead to more suggestions for fixing the code and eventually, the chain of flaws and fixes would clarify the entire concept. It is flattering to me that my audience still remembers my Secure Programming classes even today.

Breaking a large concept into a series of questions is hard and it might seem as time-consuming in class but the positive impact lasts longer. As a passionate teacher, I am absolutely okay to take even five or six hours to teach a tiny concept. This is one of the reasons that my class expanded to twice the promised duration and the students were happy to cooperate because concepts were crystal clear for them.

10. Smile

Here is a quick test. Notice your state of mind now. And then smile. Yes, smile. Smile for no reason. Think of a beautiful moment and force yourself into a smile. How does it feel? Compare your state of mind with the previous moment? Do you notice a change? I am sure you will notice a change.

A smile is a great hack to change your state of mind. While it is true that when you are happy, you smile but it is also true that when you smile, you turn happy. And when you are happy, you would be more capable of solving problems.

I smile before my class starts. Smiling reflects the enthusiasm in my voice and peps-up my students. I keep a sticky note at my standing desk that reads “Smile”.

11. Lifestyle changes

This may be very specific to me but I find that eating fruits such as pears and apples makes me happier, more active, energetic, and less sleepy. Caffeine doesn’t work for me. The purpose is to stay energetic during a class. You may choose what works for you.

The more energetic you are the better you sound and your students will reflect your enthusiasm and positivity.

12. Avoid alcohol

You should not drink before at least 6 hours of a class. Being able to teach for a straight three hours requires extreme focus. And you can’t achieve that focus if you are under the influence of alcohol.

Also, avoid everything that can cause cold or a cough or affects your throat adversely. I avoid ice creams and smoking.

13. Have Good Setup

Here are the details of my setup that helps me deliver my classes stress-free:

  1. A power backup, specifically UPS for both my computer and the internet router. Extremely important.
  2. A backup internet connection with a good speed
  3. Good headphones. I prefer USB headphone because it has lower chances of loose connections. Also, since I walk around my desk while teaching, I have a headphone with a longer wire.
  4. During online classes, I record every session. I keep multiple recordings so if one fails, there is another.
  5. For teaching a class, I prefer two screens – one shared with the class, another for the questions window. An additional monitor connected to my MacBook works perfectly for this.
  6. I keep a cooling pad for my laptop because, during longer sessions, laptops tend to heat up.
  7. A remote presenter helps me change slides if I am walking around my desk while teaching.
  8. A drawing pad and whiteboard software. Get whiteboard software along with a drawing pad and make yourself comfortable with it. It is very much required in the class.
  9. I always prefer LAN cable over wifi.
  10. A one-to-many USB converter since I have a lot of hardware connecting to my laptop.

14. Avoid Distractions

Teaching a class needs a zero-distraction environment. It requires extreme focus and absolutely no distractions. I teach in a closed room and face a wall while teaching.

15. It is about learner not you

While we are on the topic of distractions, teachers often want to show their face on the screen all the time. I have different views on that.

I found that showing my face distracts users from looking at my screen where I might be drawing to depict a concept or showing a presentation.
I believe it is fine to show yourself during an introductory session, or an introductory hour, and no more than that. Also, showing the face could consume the bandwidth. It is safer to just focus on people’s questions.

16. Address individually in the class

Always, address everyone individually in the class. Students definitely feel connected if you address them individually in class. Also, in the first session, I let everyone introduce themselves. This helps learners connect with each other.

17. Seamless process for conducting a class

Make a good process for conducting the session. Here are some of the practices:

  1. Have at least one more team member in the panel during a session. This helps to solve the administrative issues faced by users.
  2. I really like the meeting tools. Even in offline class, I ask every attendee to join an online meeting. It makes it easy to see the screen of everyone and share the code and commands.

18. Have Online Auto Assessment with LMS

LMS or a Learning Management System is an inevitable part of our online learning exercises. There are quite a few learning management systems available for free online, for example, Moodle. Even for evaluating subjective questions, LMS plugins are hugely resourceful. We’ve created a module for auto assessment of coding exercises too! Feel free to use it.

19. Be regular and punctual

Never miss a class and be on time. This is something I learnt hard way. I realized that if I am late by even one minute, students will be late by 5 minutes in subsequent classes, and it only gets worse from there.

It is important to be available at least an hour before the class starts. This gives you ample amount of time for preparation. For an 8 pm class, I am available by 7 pm.

20. Make notes and improve every time.

During every class, I make quick notes and as the class ends, I formalize those notes. These notes direct me towards the improvements needed for further sessions, which further help me improve the quality of my course constantly from various aspects. Also, I have a system in place for feedbacks. A feedback survey is sent to all students as soon as a class gets over.


How To Optimise A Neural Network?

When we are solving an industry problem involving neural networks, very often we end up with bad performance. Here are some suggestions on what should be done in order to improve the performance.

Is your model underfitting or overfitting?

You must break down the input data set into two parts – training and test. The general practice is to have 80% for training and 20% for testing.

You should train your neural network with the training set and test with the testing set. This sounds like common sense but we often skip it.

Compare the performance (MSE in case of regression and accuracy/f1/recall/precision in case of classification) of your model with the training set and with the test set.

If it is performing badly for both test and training it is underfitting and if it is performing great for the training set but not test set, it is overfitting.

In case of Underfitting

If the performance over test set is continuously improving over the iterations or epochs, it means you need to increase the iterations/epochs. If it is taking too much time, you may want to use GPUs. You can also try adding an optimizer such as Adam instead of only plain Gradient Descent.

If the performance isn’t improving, it means you have a true case of underfitting. In such cases, There are three possibilities:

  1. Insufficient data
  2. No correlation in data – random data
  3. You need a better model

If the data is insufficient, you can do the following:

  • You can generate more data. This is called data augmenting. For example, you could take more pictures from different angles, You could reshape them a bit, put more colour filters, remove some pixels from border etc.
  • You can download similar data from the internet. Say you want to build a neural network to recognize the faces in your office. You can download more picture of faces from across the globe and first train the model on those faces and then train the model using the faces from your office.
  • You can download a pre-trained neural network and add a layer on top of it and further train it using your data.

If there is no correlation in data, you can’t do much. You can just recheck the labels. A common error is label mismatch. Imagine that there are two files one containing the features and other containing the label and those in different orders or we skipped just one line in either causing the label mismatch. So, recheck if the labels are in the same order as the features. Also, check with the data gathering team if there is something wrong with data.

The last case where you need to improve upon the model is the hardest. In case of neural networks, you can do the following:

  • Add more layers
  • Add more neurons to full connected / dense layers but prefer adding more neurons to increasing neurons
  • Add more filters
  • Experiment with different strides
  • Add RELU if you aren’t using it already
  • If you have the diminishing or exploding gradients problem,
    • use batch normalization.
    • Try initializing the weights using the xavier_initializer or other heuristics
    • Also, try gradient clipping
  • Normalize the features either using the min-max scaling or standardization
  • Try normalizing the labels too. Though it is not recommended first.

In case of overfitting

If you notice that your model is overfitting you should do the regularization and also make sure that you are shuffling the training set at every iteration such that every batch is different every time.

For regularization, you can use L1 or L2 normalization or dropout layer.

These are my quick notes. Feel free to let us know you observe any errors in this post .

If you liked it share it with your friends.

10 Things to Look for When Choosing a Big Data course / Institute

Every now and then, I keep seeing a new company coming up with Hadoop classes/courses. Also, my friends keep asking me which of these courses is good to take. I gave them a few tips to choose the best course suitable for them. Here are the few tips to decide which course you should attend to:

1. Does the instructor have domain expertise?

Know your instructor. You must know about the instructor’s background. Has (s)he done any big data related work? I have seen a lot of instructors who just attend a course somewhere and become instructors.

If the instructor never worked in the domain, do not take such classes. Also, avoid training institutes that do not tell you details about the instructor.

2. Is the instructor hands on? When did she/he code last time?

In the domain of technology, there is a humongous difference between one instructor who is hands-on in coding and another who is delivering based on theoretical knowledge. Also, know when the instructor worked on codes the last time. If instructor never coded, do not attend the class.

3. Does the instructor encourage & answer your questions?

There are many recorded free videos available across the internet. The only reason you would go for live classes would be to get your questions answered and doubts cleared immediately.

If the instructor does not encourage questions and answers, such classes are fairly useless.

Continue reading “10 Things to Look for When Choosing a Big Data course / Institute”

6 Reasons Why Big Data Career is a Smart Choice

Confused whether to take up a career in Big Data or not? Planning to invest your time in getting certified and to acquire expertise in related frameworks like Hadoop, Spark etc. and worried whether you are making a huge mistake? Just spend a few minutes reading this blog and you will get six reasons on why you are making a smart choice by selecting a career in big data.

Why Big Data?

There are several people out there who believe that Big Data is the next big thing which would help companies to spring up above others and help them position themselves as the best in class in their respective sectors.

Companies these days generate a gigantic amount of information irrespective of which industry they belong to and there is a need to store these data which are being generated so that they can be processed and not miss out on important information which could lead to a new breakthrough in their respective sector.  Atul Butte, of Stanford School of Medicine, has stressed the importance of data by saying “Hiding within those mounds of data is the knowledge that could change the life of a patient, or change the world”. And this is where Big Data analytics play a very crucial role.

With the use of Big Data platforms, a gigantic amount of data can be brought together and be processed to develop patterns which would help the company in making better decisions which would help them to grow, increase their productivity and to help create value to their products and services.

Continue reading “6 Reasons Why Big Data Career is a Smart Choice”

One Day Machine Learning Bootcamp at IITB – CloudxLab

Our past two Bootcamp on Machine Learning at National Singapore University and RV College of Engineering were very interesting and all the attendees found it very useful. These feedbacks prompted us to have more Bootcamps like these.

Thanks to Prof. Alankar, who invited us to conduct yet another Machine Learning Bootcamp at Indian Institute of Technology, Bombay. Before we move on to the details of Bootcamp, let us give you a brief introduction to Prof. Alankar. He is an Assistant Professor at IIT Bombay in Mechanical Engineering Department and works in the area of Multiscale Modeling of Deformation. He is a graduate of IIT Roorkee, holds a masters degree from University of British Columbia (Canada) and doctoral degree from Washington State University (USA). He has previously worked at Max-Planck Institute (Germany), Los Alamos National Laboratory (USA) and Modumetal, Inc (USA).

Machine Learning Bootcamp

So it all happened on Mar 17 where Machine Learning enthusiasts, which includes professors and students from every branch of IIT, gathered to attend the one day workshop on Machine Learning. The presenter was none other than Mr. Sandeep Giri, who has over 15 years of experience in the domain of Machine learning and Big Data technologies. He has worked in companies like Amazon, InMobi, and D. E. Shaw.

Continue reading “One Day Machine Learning Bootcamp at IITB – CloudxLab”

A Successful Machine Learning Bootcamp by CloudxLab

CloudxLab has hosted several webinars in the past and all of them have been successful. But this time we thought to try something different. So, we all sat together and decided to do an offline meetup for Machine Learning. Though we had done some in the past, the engagement and interaction that one can get in the online webinar are not comparable. Anyhow, we then got in touch with Drupal Bangalore and they were having this event in R. V College of engineering. And one of the topics was Introduction to Machine Learning. We found this a good opportunity to bring our knowledge in the offline circle too.

Machine Learning Bootcamp

So it all happened on Nov 17 where Machine Learning enthusiasts gathered to attend the one day workshop on Machine Learning. The presenter was none other than Mr. Sandeep Giri, who has over 15 years of experience in the domain of Machine learning and Big Data technologies. He has worked in companies like Amazon, InMobi, and D. E. Shaw.

Continue reading “A Successful Machine Learning Bootcamp by CloudxLab”

What is Big Data? An Easy Introduction to Big Data Terminologies

Unless you’ve been living under the rock, you must have heard or read the term – Big Data. But many people don’t know what Big Data actually means. Even if they do then the definition of the same is not clear to them. If you’re one of them then don’t be disheartened. By the time you complete reading this very article, you will have a clear idea about Big Data and its terminology.

What is Big Data?

In very simple words, Big Data is data of very big size which can not be processed with usual tools like file systems & relational databases. And to process such data we need to have distributed architecture. In other words, we need multiple systems to process the data to achieve a common goal.

Continue reading “What is Big Data? An Easy Introduction to Big Data Terminologies”