Phrase matching using Apache Spark

Recently, a friend whose company is working on large scale project reached out to us to seek a solution to a simple problem of finding a list of phrases (approximately 80,000) in a huge set of rich text documents (approx 6 million).

The problem at first looked simple. The way engineers had solved it is by simply loading the two documents in Apache Spark’s DataFrame and joining those using “like”. Something on these lines:

select, from phrases, docs where docs.txt like ‘%’ + phrases.phrase + ‘%’

But it was taking huge time even on the small subset of the data and processing is done in distributed fashion. Any Guesses, why?

They had also tried to use Apache Spark’s broadcast mechanism on the smaller dataset but still, it was taking a long while finishing even a small task.

So, how we solved it finally? Here is one of my approaches. Please feel free to provide your input.

We first brought together the phrase and documents where there is at least one match.  Then we grouped the data based on the pair of phrase id and document id. And finally, we filtered the results based on whether all of the words in the phrase are found in the document or not and in the same order.

You can take a look at the project here. The Scala version is not yet finished, though Python version is done.

You may be wondering if it really makes it faster? And what makes it faster?

If you have m phrases and n documents. The phrases have w words and documents have k words.

The total complexity will be of the order of m*w * n * k. Each word from phrases will be compared with each word in documents.

While complexity using our approach will not be that straightforward to compute. Let me try.

First, it is going to sort the data. The total number of words are m*w + n*k. Let’s call it W

W = m*w + n*k

The complexity of sorting it is: W log W

Then we are going to sort the data based on (phrase Id, document id). If every phrase was found in every document then there will be a total of m * n records to be sorted.

m*n log (m*n)

but it is going to be far lesser and can be approximated to n. Now, sorting the data based on

So, final sorting will take approx: n* log(n)

We can safely ignore other processing steps as those are linear. The overall complexity or the time consumption is going to be of the order of:

(m*w + n*k) log(m*w + n*k)  +  m*n log (m*n)

Which is definitely way better than m*w * n * k

I hope you find it useful. Please visit to see various courses and lab offerings.


Scholarship Test for Machine Learning Course

After receiving a huge response in our last scholarship test, we are once again back with a basic conceptual test to attain scholarship for our upcoming Specialization course on Machine Learning and Deep Learning.

Concepts to be tested: Linear algebra, probability theory, statistics, multivariable calculus, algorithms and complexity, aptitude and Data Interpretation.

  • Date and Time: September 2, 2018, 8:00 am PDT (8:30 pm IST)
  • Type: objective (MCQ)
  • Number of questions: 25
  • Duration: 90 minutes
  • Mode: Online

If you have a good aptitude and general problem-solving skills, this test is for you. So, go ahead and earn what you deserve.

If you have any questions on the test or if anything else comes up, just click here to let us know. We’re always happy to help.


How to Teach Online Effectively


I founded in 2014 after working in Amazon. Teaching is my passion, and technology, specifically large-scale computing my forte, thanks to my working experience with Amazon, InMobi, D. E. Shaw and my own startup tBits Global. Therefore, I wanted to help people learn technology online. I launched, an online instructor-led training on MongoDB followed by Big Data and Machine learning. Eventually, we innovated a lot in learning and shaped KnowBigData into which is currently a major gamified learning environment for Machine Learning, AI, and Big Data.

Teaching online has a lot of advantages. You can teach from anywhere in the world, your students can be from any corner of the globe, you save your daily travel time, and you get to talk to students across the world. By teaching, you become more compassionate to the learner and you work hard on self-learning every aspect of the subject that you are teaching. Teaching also helps you improve your ability to express.

Since Apr 2014 when I started KnowBigData (now CloudxLab), I have taught for around 2000 hours (4 years * 52 weeks * 3 classes avg. * 3 hours) and more than 3000 people.  I was able to get wonderful reviews from people with an average rating of above 4.8 out of 5. You can read our public reviews here: (Quora, Quora, Quora, Quora, Facebook, Facebook)

I would love to share with you my learnings that may help you in teaching effectively online. If you are able to follow these and are interested in teaching with CloudxLab, please apply here.

1. Deep Dive into the Subject

The first and foremost thing to do is to make sure you know the subject you are about to teach. That may need new learning, re-learning, or even a certain amount of unlearning. Let go of your ego that you know everything about your subject. Pick up some of the best books, go through each topic, and know each topic in and out. Challenge yourself with questions. Be prepared with the thought that ‘the learner may ask a question on this’ and get yourself ready.

The way I stay on top of my knowledge base is by keeping a list called “To Learn.” Any topic that  I need to study goes into this list. During my research, I pick topics from the top of the list and if I come across subtopics which need brushing up, I append it to the master list. In computing, we call this the ‘breadth-first’ approach but it is an incredible approach for deep diving into a topic. There are often dark spots in our learning that we tend to avoid. It is important to be strong enough, try to figure out those worrisome dark spots and study them. Nothing beats the thrill of learning something we have been avoiding for long.

2. Prepare Slides Well

While teaching, it becomes really easy to explain if you have a good set of slides. Slides guide you and keep you on track during your teaching sessions. A good set of slides should have less text, more graphics and bullet points.

If you have too much text on any slide, break it into multiple slides and work hard on what can be removed. Sacrifice grammatical completeness for brevity. Slides should also detail everything about the topic: the prerequisites, concepts and the references from where the learner can learn more.

Add images to slides. Images should be used in the following order of preference: Humans, Animals, Things, Text/Block diagrams. If you have to choose between the image of a man and the image of an animal for your slides, use the man’s photo.

Similarly, give preference to photos of things over the screenshot of text or block diagrams.

It is also advisable to share the topic deck with learners in advance.

3. Hands-on approach first

A lot of teachers may not agree with this point, but I’ve found it particularly useful. The implementation of this point is also difficult. Have your students work hands-on on something before jumping to the theory. The idea is to make the class work on practicals before you teach. Teaching concepts can be boring for learners. Let the learner figure out the concepts as much as they can instead of a teacher spoon-feeding them. For example, if I’m teaching a class how to code, I would ask my class to just follow the steps and get the first very basic code running. It is amazing to see how much a class can grasp the basics just by running the first code.

This is exactly why our Machine Learning course has the End-to-End project right at the beginning and not at the end of the course. That turned out to be the best way to teach my class.

When we were researching on developing our courses, we realized that students were hesitant to try hands-on exercises because setting up the environment was time-consuming, needed high-end hardware and permission on installation and consumed too many resources on the machine. Therefore, we decided to set up a shared online lab which was a solution to all these challenges and changed the face of online learning for good and became a product in itself. This lab is available 24/7 to all users. The hugely positive response to our lab convinced us to rebrand to

4. Answer every question

There is no dearth of study material on the internet. Text or videos on topics are widely available for learners. What paralyzes self-study are unanswered questions. When learners face questions, they stop to look for answers but eventually get lost in the online noise. That is precisely why students join instructor-led classes – to get their questions answered. Questions don’t really delay a course. They become the purpose of a class.

The key to happy learners in a class is to answer their questions and encourage them to ask more. Here is my way of doing handling questions.

First, the following are clarified in the first few classes:

  • Q&A is the purpose of the class and not an obstruction
  • It is okay to delay a class but not okay to not ask questions.
  • Listen to the questions from other classmates. Try to answer or make a hypothesis. Never turn down your classmate’s questions.
  • No question is stupid. I repeat. No question is stupid.
  • Often, questions asked in a class are valid interview questions.  So, pay attention to the questions asked by your fellow-learners.

Second, I always acknowledge a question before answering it. I explicitly mention “It is a good question” and then I try to answer the question with real-life examples from my career.  Some of you may be familiar with ‘Oss,’ the Karate greeting Karate students and teachers use while bowing to each other. It is not only an affirmation of positive spirits, but ‘Oss’ also is a mark of mutual respect and admiration. In the similar fashion, before you answer a question, you should bow and say “Good Question.

During a certain corporate training, I noticed that there were no questions in class because the class had a mix of teams including managers and reportees. The learners were either shy to ask questions or afraid of being judged. Since it was an online class, I asked everyone to use pseudo-names while joining the webinar. You must know that the strategy helped and we had a lot of questions to answer during the entire session.    

5. Work on your accent

For a lot of instructors, the accent can be a problem especially if their students are scattered across the globe. I’ve grown up in a small village in India, and therefore, I needed to work on my accent to deliver my classes to students worldwide.

The best way I learned to improve my accent was by attending sales calls from all over the globe. I would talk in detail to prospective students and lab-users and figured out the areas for my improvement. This turned out to be the most effective way to improve my pronunciation. Though it was a lot of hard work considering time zone differences, the exercise turned out extremely fruitful. You may want to compare the oldest video from KnowBigData on our Youtube channel and latest video from to know the positive changes.

If there is a different method that works with you, feel free to implement it. Nevertheless,
hold 1-1 discussions with your learners, respond to sales and service calls.

Also, in conversation, be slow and clear, and ask if you were clear and the listeners understood you. Be mindful of every word that you speak. Avoid fillers such as ‘umm’ and ‘aahs’. If you are mindful of your words and how you speak them, your instruction will be crisp in class.

6. Repeat with a different selection of words

A number of times, you may not be very clear to your audience. Often, words we use may have different connotations in different countries. In a class with students from all over the world, it becomes inevitable to explain repeatedly using a different selection of words. I follow this strategy. I repeat the same sentence using a different selection of words. Keeping the meaning intact, I choose another set of words. (I just did it:))

7. Avoid monotone

The word ‘monotonous’ comes from ‘monotone’ which means sound of the same tone. And monotonous is boring and sleep-inducing.

A voice artist once advised me to find a comfort zone of my voice during long speaking hours and to stay in that comfort zone without any modulation. That advice was good for a voice artist, but not for a teacher.

My thumb rule is to keep varying my tone to prevent students from getting sleepy. Again, varying the tone doesn’t mean shouting. Keeping the acoustics conducive (good mic, small room), simply altering the cadence of my voice to avoid monotone keeps the class alive and active.

Also, I make sure to have sips of lukewarm water while teaching. That keeps my throat from getting sore after hours of speaking.

8. Don’t sit

The human mind is amazing in that it can infer the state of an unseen speaker’s mind from sound alone. Note that my camera is turned off during a class. Nevertheless, my learners can figure out from my voice if I am feeling active or lazy, or happy, or angry. To have an active and responsive class, I stand and teach. I do pace around my desk while teaching. I make sure that I wear most comfortable shoes (preferably jogging shoes) during my class even if I am delivering it from my bedroom.

To make it further easy, I have invested in a standing desk for my lectures.

9. Explain concepts by asking questions

This is probably the hardest part of teaching. In order to explain a concept, break it down into a series of questions. You need to come up with a set of questions such that the answers to this questions lead to the understanding of the concept. For example, to explain the concept of ‘refraction’ in physics, you should break it down into questions like:

  1. How do we see an object?
  2. If there is some other material between us and object can we see?
  3. If that “material” is a glass, can we see?
  4. Is there is some change in the view when we see an object through glass?
  5. What happens when we put a pen in a glass half full with water?

By the end of this set of questions, students naturally lead towards understanding the concept. You must know that students retain concepts longer and better by deducing them rather than by direct instruction.

In addition, while answering a question from the audience, it is always advisable to induce curiosity by giving an open-ended answer that sparks off another question leading to a chain until the concept is clear.

Let me try to give you an example. When I was with D.E.Shaw, I delivered tech talks on Secure Programming. I would show a piece of code to my audience and ask them to find out what was wrong with it.  The audience would come up with many flaws and I would note all of them on a whiteboard. Next, I would ask them how they would fix these flaws. A few more suggestions for fixes would lead to more suggestions for fixing the code and eventually, the chain of flaws and fixes would clarify the entire concept. It is flattering to me that my audience still remembers my Secure Programming classes even today.

Breaking a large concept into a series of questions is hard and it might seem as time-consuming in class but the positive impact lasts longer. As a passionate teacher, I am absolutely okay to take even five or six hours to teach a tiny concept. This is one of the reasons that my class expanded to twice the promised duration and the students were happy to cooperate because concepts were crystal clear for them.

10. Smile

Here is a quick test. Notice your state of mind now. And then smile. Yes, smile. Smile for no reason. Think of a beautiful moment and force yourself into a smile. How does it feel? Compare your state of mind with the previous moment? Do you notice a change? I am sure you will notice a change.

A smile is a great hack to change your state of mind. While it is true that when you are happy, you smile but it is also true that when you smile, you turn happy. And when you are happy, you would be more capable of solving problems.

I smile before my class starts. Smiling reflects the enthusiasm in my voice and peps-up my students. I keep a sticky note at my standing desk that reads “Smile”.

11. Lifestyle changes

This may be very specific to me but I find that eating fruits such as pears and apples makes me happier, more active, energetic, and less sleepy. Caffeine doesn’t work for me. The purpose is to stay energetic during a class. You may choose what works for you.

The more energetic you are the better you sound and your students will reflect your enthusiasm and positivity.

12. Avoid alcohol

You should not drink before at least 6 hours of a class. Being able to teach for a straight three hours requires extreme focus. And you can’t achieve that focus if you are under the influence of alcohol.

Also, avoid everything that can cause cold or a cough or affects your throat adversely. I avoid ice creams and smoking.

13. Have Good Setup

Here are the details of my setup that helps me deliver my classes stress-free:

  1. A power backup, specifically UPS for both my computer and the internet router. Extremely important.
  2. A backup internet connection with a good speed
  3. Good headphones. I prefer USB headphone because it has lower chances of loose connections. Also, since I walk around my desk while teaching, I have a headphone with a longer wire.
  4. During online classes, I record every session. I keep multiple recordings so if one fails, there is another.
  5. For teaching a class, I prefer two screens – one shared with the class, another for the questions window. An additional monitor connected to my MacBook works perfectly for this.
  6. I keep a cooling pad for my laptop because, during longer sessions, laptops tend to heat up.
  7. A remote presenter helps me change slides if I am walking around my desk while teaching.
  8. A drawing pad and whiteboard software. Get whiteboard software along with a drawing pad and make yourself comfortable with it. It is very much required in the class.
  9. I always prefer LAN cable over wifi.
  10. A one-to-many USB converter since I have a lot of hardware connecting to my laptop.

14. Avoid Distractions

Teaching a class needs a zero-distraction environment. It requires extreme focus and absolutely no distractions. I teach in a closed room and face a wall while teaching.

15. It is about learner not you

While we are on the topic of distractions, teachers often want to show their face on the screen all the time. I have different views on that.

I found that showing my face distracts users from looking at my screen where I might be drawing to depict a concept or showing a presentation.
I believe it is fine to show yourself during an introductory session, or an introductory hour, and no more than that. Also, showing the face could consume the bandwidth. It is safer to just focus on people’s questions.

16. Address individually in the class

Always, address everyone individually in the class. Students definitely feel connected if you address them individually in class. Also, in the first session, I let everyone introduce themselves. This helps learners connect with each other.

17. Seamless process for conducting a class

Make a good process for conducting the session. Here are some of the practices:

  1. Have at least one more team member in the panel during a session. This helps to solve the administrative issues faced by users.
  2. I really like the meeting tools. Even in offline class, I ask every attendee to join an online meeting. It makes it easy to see the screen of everyone and share the code and commands.

18. Have Online Auto Assessment with LMS

LMS or a Learning Management System is an inevitable part of our online learning exercises. There are quite a few learning management systems available for free online, for example, Moodle. Even for evaluating subjective questions, LMS plugins are hugely resourceful. We’ve created a module for auto assessment of coding exercises too! Feel free to use it.

19. Be regular and punctual

Never miss a class and be on time. This is something I learnt hard way. I realized that if I am late by even one minute, students will be late by 5 minutes in subsequent classes, and it only gets worse from there.

It is important to be available at least an hour before the class starts. This gives you ample amount of time for preparation. For an 8 pm class, I am available by 7 pm.

20. Make notes and improve every time.

During every class, I make quick notes and as the class ends, I formalize those notes. These notes direct me towards the improvements needed for further sessions, which further help me improve the quality of my course constantly from various aspects. Also, I have a system in place for feedbacks. A feedback survey is sent to all students as soon as a class gets over.


Introduction to Apache Flume in 30 minutes

What is Apache Flume?

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating & moving large data from many different sources to a centralized data store.

Flume supports a large variety of sources Including:

  • tail (like unix tail -f),
  • syslog,
  • log4j – allowing Java applications to write logs to HDFS via flume

Flume Nodes

Flume nodes can be arranged in arbitrary topologies.Typically there is a node running on each source machine, with tiers of aggregating nodes that the data flows through on its way to HDFS.

Topics Covered

  • What is Flume
  • Flume: Use Case
  • Flume: Agents
  • Flume: Use Case – Agents
  • Flume: Multiple Agents
  • Flume: Sources
  • Flume: Delivery Reliability
  • Flume: Hands-on

Introduction to Flume Presentation


Please feel free to leave your comments in the comment box so that we can improve the guide and serve you better. Also, Follow CloudxLab on Twitter to get updates on new blogs and videos.

If you wish to learn Hadoop and Spark technologies such as MapReduce, Hive, HBase, Sqoop, Flume, Oozie, Spark RDD, Spark Streaming, Kafka, Data frames, SparkSQL, SparkR, MLlib, GraphX and build a career in BigData and Spark domain then check out our signature course on Big Data with Apache Spark and Hadoop which comes with

  • Online instructor-led training by professionals having years of experience in building world-class BigData products
  • High-quality learning content including videos and quizzes
  • Automated hands-on assessments
  • 90 days of lab access so that you can learn by doing
  • 24×7 support and forum access to answer all your queries throughout your learning journey
  • Real-world projects
  • A certificate which you can share on LinkedIn

Machine Learning with Mahout

[This blog is from It is pretty old. Many things have changed since then. People have moved to MLLib. We have also moved to]

What is Machine Learning?

Machine Learning is programming computers to optimize a Performance using example data or past experience, it is a branch of Artificial Intelligence.

Types of Machine Learning

Machine learning is broadly categorized into three buckets:

  • Supervised Learning – Using Labeled training data, to create a classifier that can predict the output for unseen inputs.
  • Unsupervised Learning – Using Unlabeled training data to create a function that can predict the output.
  • Semi-Supervised Learning – Make use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data.

Machine Learning Applications

  • Recommend Friends, Dates, Products to end-user.
  • Classify content into pre-defined groups.
  • Find Similar content based on Object Properties.
  • Identify key topics in large Collections of Text.
  • Detect Anomalies within given data.
  • Ranking Search Results with User Feedback Learning.
  • Classifying DNA sequences.
  • Sentiment Analysis/ Opinion Mining
  • Computer Vision.
  • Natural Language Processing,
  • BioInformatics.
  • Speech and HandWriting Recognition.


Mahout – Keeper/Driver of Elephants. Mahout is a Scalable Machine Learning Library built on Hadoop, written in Java and its Driven by Ng et al.’s paper “MapReduce for Machine Learning on Multicore”. Development of Mahout Started as a Lucene sub-project and it became Apache TLP in Apr’10.

Topics Covered

  • Introduction to Machine Learning and Mahout
  • Machine Learning- Types
  • Machine Learning- Applications
  • Machine Learning- Tools
  • Mahout – Recommendation Example
  • Mahout – Use Cases
  • Mahout Live Example
  • Mahout – Other Recommender Algos

Machine Learning with Mahout Presentation

Machine Learning with Mahout Video

A Successful Machine Learning Bootcamp by CloudxLab – Singapore

Cloudxlab has conducted many successful online events on Machine Learning and Big Data, for it is relatively easy to attend so many attendees simultaneously. Furthermore, it eliminates the need for a tiring visit to the event location. One can simply log in from the comfort of one’s house and start learning.

Sure, online events have their own perks. But it hasn’t stopped us to conduct offline events. Our Machine Learning session at R.V College of engineering was one such successful example.

This time, we wanted to conduct a little bigger event, therefore Cloudxlab joined hands together with IOTSG and National Singapore university, enterprise to organize another successful Machine Learning Bootcamp.

The Venue

CloudxLab was organizing the Machine learning Bootcamp for the first time in Singapore. To be frank, we were a little nervous as we did not know how welcoming the country is. But all our minor doubts were cleared once we experienced the warm welcome from everyone there. So much so that we would like to do one more Bootcamp in Singapore in near future.

National Singapore University was very cooperative in helping us to organize the Bootcamp in their campus.

Continue reading “A Successful Machine Learning Bootcamp by CloudxLab – Singapore”

Machine Learning & IoT Bootcamp – Singapore

Have you ever wondered how you can apply various Machine Learning and IOT techniques for everyday business problems? Or, are you someone who has heard of Machine Learning but couldn’t get a chance to dig a little deeper?  If your answer is Yes, then you’ve come to the right place.

Cloudxlab is conducting a Machine Learning & IoT Bootcamp in Singapore.

  • Date: Saturday, Feb 10, 2018
  • Place: NUS Enterprise, #02-01, 71 Ayer Rajah Crescent, Singapore
  • Time: 9:30 AM to 5:00 PM

What will be covered?

An exposure to Machine Learning using Python to analyze, draw intelligence and build powerful models using real-world datasets. You’ll also gain the insights to apply data processing and Machine Learning techniques in real time.

After completing this workshop, you will be able to build and optimize your own automated classifier to extract insights from real-world data sets.

Continue reading “Machine Learning & IoT Bootcamp – Singapore”

The Pursuit of Education – A Story of Strength

Today, we will not talk tech or discuss our regular tutorials. Instead, we will take you on a different journey – a journey about strength, a journey about hope, and a journey on life.

It was a regular working day for us when an email caught our attention. It was from an individual who faced unimaginable hardships in his life but still hopes for a better future by executing his passion for learning.

His message was rather long and it clearly showed that he was in desperate need of a higher education. We thought he was a student, and we offered him the student’s discount on one of our self-paced courses on Big Data. But much to our surprise, he was not in a state to pay even the discounted price.

We were not clear on why he would be requesting for a free course. However, we came to know about the kind of hardship that he had recently gone through, and about his real mission to move back to his native place and help poor and needy students by providing free education.

He was a Rohingya refugee and had lost his entire family in the recent clashes of Myanmar. He managed to survive the traumatic ordeal but thinking of a new life was more of an impossible dream for him. However, he stepped up and decided to move on with his life.

He wanted to continue his education, and therefore, started to look out for a Big Data course that he could do for free because of his terrible financial crisis. He came across CloudxLab and got in touch with us explaining his situation. He also mentioned that he wanted to help the needy back in his country for which he needed to go through the course.

We were much in awe of this person’s strength of mind. He came across as an epitome of strength who is ruthlessly following his dream despite all odds and ordeals.

We offered him our course at no cost, but we did not know how much it meant to him until he sent his reply:

I can’t explain my feelings in words how much happy I am now. You are an angel for me who help me to stand on my feet. Sir thank you for believing in me and giving me a chance to continue my dream. I promise I will do my best and complete the course as fast as I can. Thank you.

This is probably our biggest achievement as a team.

We salute this individual for his unthinkable strength in facing such a catastrophe in his life while nurturing a selfless desire to help others. We wish him good days ahead and hope that he completes his education and embarks on the journey to help his people.

Streaming Twitter Data using Flume

In this blog post, we will learn how to stream Twitter data using Flume on CloudxLab

For downloading tweets from Twitter, we have to configure Twitter App first.

Create Twitter App

Step 1

Navigate to Twitter app URL and sign in with your Twitter account

Step 2

Click on “Create New App”

Create New App

Continue reading “Streaming Twitter Data using Flume”

Machine Learning Bootcamp – Introduction and Hands-on @ RV College of Engineering, Bangalore

We have a one-day workshop on Introduction to Machine Learning with Drupal Bangalore. In this workshop, you will learn how to apply various Machine Learning techniques for everyday business problems.

  • Date: Saturday, Dec 16, 2017
  • Place: R. V. College of Engineering, Bangalore
  • Time: 11.30 am – 1.30 pm: Presentation and Demo, 2.30 pm – 4.30 pm: Hands-on

What will be covered?

An exposure to Machine Learning using Python to analyze, draw intelligence and build powerful models using real-world datasets. You’ll also gain the insights to apply data processing and Machine Learning techniques in real time.

After completing this workshop, you will be able to build and optimize your own automated classifier to extract insights from real-world data sets.

Continue reading “Machine Learning Bootcamp – Introduction and Hands-on @ RV College of Engineering, Bangalore”