How to create an Apache Thrift Service – Tutorial


Say you come up with a wonderful idea such as a really great phone service. You would want this phone service to be available to the APIs in various languages. Whether people are using Python, C++, Java or any other programming language, the users should be able to use your service. Also, you would want the users to be able to access globally. In such scenarios, you should create the Thrift Service. Thrift lets you create a generic interface which can be implemented on the server. The clients of this generic interface can be automatically generated in all kinds of languages.

Let us get started! Here we are going to create a very simple service that just prints the server time.

Continue reading “How to create an Apache Thrift Service – Tutorial”

Use-cases of Machine Learning in E-Commerce

What computing did to the usual industry earlier, Machine Learning is doing the same to usual rule-based computing now. It is eating the market of the same. Earlier, in organizations, there used to be separate groups for Image Processing, Audio Processing, Analytics and Predictions. Now, these groups are merged because machine learning is basically overlapping with every domain of computing. Let us discuss how machine learning is impacting e-commerce in particular.

The first use case of Machine Learning that became really popular was Amazon Recommendations. Afterwards, the Netflix launched a challenge of Movie Recommendations which gave birth to Kaggle, now an online platform of various machine learning challenges.

Before I dive deep into the details further, lets quickly brief the terms that are found often confusing. AI stands for Artificial Intelligence which means being able to display human-like intelligence. AI is basically an objective. Machine learning is making computers learn based on historical or empirical data instead of explicitly writing the rules. Artificial Neural networks are the computing constructs designed on a similar structure like the animal brain. Deep Learning is a branch of machine learning where we use a complex Artificial Neural network for predictions.

Continue reading “Use-cases of Machine Learning in E-Commerce”

Phrase matching using Apache Spark

Recently, a friend whose company is working on large scale project reached out to us to seek a solution to a simple problem of finding a list of phrases (approximately 80,000) in a huge set of rich text documents (approx 6 million).

The problem at first looked simple. The way engineers had solved it is by simply loading the two documents in Apache Spark’s DataFrame and joining those using “like”. Something on these lines:

select, from phrases, docs where docs.txt like ‘%’ + phrases.phrase + ‘%’

But it was taking huge time even on the small subset of the data and processing is done in distributed fashion. Any Guesses, why?

They had also tried to use Apache Spark’s broadcast mechanism on the smaller dataset but still, it was taking a long while finishing even a small task.

Continue reading “Phrase matching using Apache Spark”

Scholarship Test for Machine Learning Course

After receiving a huge response in our last scholarship test, we are once again back with a basic conceptual test to attain scholarship for our upcoming Specialization course on Machine Learning and Deep Learning.

Concepts to be tested: Linear algebra, probability theory, statistics, multivariable calculus, algorithms and complexity, aptitude and Data Interpretation.

  • Date and Time: September 2, 2018, 8:00 am PDT (8:30 pm IST)
  • Type: objective (MCQ)
  • Number of questions: 25
  • Duration: 90 minutes
  • Mode: Online

If you have a good aptitude and general problem-solving skills, this test is for you. So, go ahead and earn what you deserve.

If you have any questions on the test or if anything else comes up, just click here to let us know. We’re always happy to help.


How to Teach Online Effectively


I founded in 2014 after working in Amazon. Teaching is my passion, and technology, specifically large-scale computing my forte, thanks to my working experience with Amazon, InMobi, D. E. Shaw and my own startup tBits Global. Therefore, I wanted to help people learn technology online. I launched, an online instructor-led training on MongoDB followed by Big Data and Machine learning. Eventually, we innovated a lot in learning and shaped KnowBigData into which is currently a major gamified learning environment for Machine Learning, AI, and Big Data.

Continue reading “How to Teach Online Effectively”

How To Optimise A Neural Network?

When we are solving an industry problem involving neural networks, very often we end up with bad performance. Here are some suggestions on what should be done in order to improve the performance.

Is your model underfitting or overfitting?

You must break down the input data set into two parts – training and test. The general practice is to have 80% for training and 20% for testing.

You should train your neural network with the training set and test with the testing set. This sounds like common sense but we often skip it.

Compare the performance (MSE in case of regression and accuracy/f1/recall/precision in case of classification) of your model with the training set and with the test set.

If it is performing badly for both test and training it is underfitting and if it is performing great for the training set but not test set, it is overfitting.

Continue reading “How To Optimise A Neural Network?”

Machine Learning with Mahout

[This blog is from It is pretty old. Many things have changed since then. People have moved to MLLib. We have also moved to]

What is Machine Learning?

Machine Learning is programming computers to optimize a Performance using example data or past experience, it is a branch of Artificial Intelligence.

Types of Machine Learning

Machine learning is broadly categorized into three buckets:

  • Supervised Learning – Using Labeled training data, to create a classifier that can predict the output for unseen inputs.
  • Unsupervised Learning – Using Unlabeled training data to create a function that can predict the output.
  • Semi-Supervised Learning – Make use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data.

Machine Learning Applications

  • Recommend Friends, Dates, Products to end-user.
  • Classify content into pre-defined groups.
  • Find Similar content based on Object Properties.
  • Identify key topics in large Collections of Text.
  • Detect Anomalies within given data.
  • Ranking Search Results with User Feedback Learning.
  • Classifying DNA sequences.
  • Sentiment Analysis/ Opinion Mining
  • Computer Vision.
  • Natural Language Processing,
  • BioInformatics.
  • Speech and HandWriting Recognition.


Mahout – Keeper/Driver of Elephants. Mahout is a Scalable Machine Learning Library built on Hadoop, written in Java and its Driven by Ng et al.’s paper “MapReduce for Machine Learning on Multicore”. Development of Mahout Started as a Lucene sub-project and it became Apache TLP in Apr’10.

Topics Covered

  • Introduction to Machine Learning and Mahout
  • Machine Learning- Types
  • Machine Learning- Applications
  • Machine Learning- Tools
  • Mahout – Recommendation Example
  • Mahout – Use Cases
  • Mahout Live Example
  • Mahout – Other Recommender Algos

Machine Learning with Mahout Presentation

Machine Learning with Mahout Video

AutoQuiz: Generating ‘Fill in the Blank’ Type Questions with NLP

Can a machine create quiz which is good enough for testing a person’s knowledge of a subject?

So, last Friday, we wrote a program which can create simple ‘Fill in the blank’ type questions based on any valid English text.

This program basically figures out sentences in a text and then for each sentence it would first try to delete a proper noun and if there is no proper noun, it deletes a noun.

We are using textblob which is basically a wrapper over NLTK – The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.

Continue reading “AutoQuiz: Generating ‘Fill in the Blank’ Type Questions with NLP”

Top 50 Apache Spark Interview Questions And Answers

Here are the top Apache Spark interview questions and answers. There is a massive growth in the big data space, and job opportunities are skyrocketing, making this the perfect time to launch your career in this space.

Our experts have curated these questions to give you an idea of the type of questions which may be asked in an interview. Hope these Apache Spark interview questions and answers guide will help you in getting prepared for your next interview.

Spark Interview Questions
Spark Interview Questions

1. What is Apache Spark and what are the benefits of Spark over MapReduce?

  • Spark is really fast. If run in-memory it is 100x faster than Hadoop MapReduce.
  • In Hadoop MapReduce, you write many MapReduce jobs and then tie these jobs together using Oozie/shell script. This mechanism is very time consuming and MapReduce tasks have heavy latency. Between two consecutive MapReduce jobs, the data has to be written to HDFS and read from HDFS. This is time-consuming. In case of Spark, this is avoided using RDDs and utilizing memory (RAM). And quite often, translating the output of one MapReduce job into the input of another MapReduce job might require writing another code because Oozie may not suffice.
  • In Spark, you can basically do everything from single code or console (PySpark or Scala console) and get the results immediately. Switching between ‘Running something on cluster’ and ‘doing something locally’ is fairly easy and straightforward. This also leads to less context switch of the developer and more productivity.
  • Spark kind of equals to MapReduce and Oozie put together.

Watch this video to learn more about benefits of using Apache Spark over MapReduce.

Continue reading “Top 50 Apache Spark Interview Questions And Answers”

GraphFrames on CloudxLab

GraphFrames is quite a useful library of spark which helps in bringing Dataframes and GraphX package together.

From the website of Graphframes:

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs. It provides high-level APIs in Scala, Java, and Python. It aims to provide both the functionality of GraphX and extended functionality taking advantage of Spark DataFrames. This extended functionality includes motif finding, DataFrame-based serialization, and highly expressive graph queries.

You can use graph frames very easily with spark-shell at CloudxLab by using —package option in the following way. Continue reading “GraphFrames on CloudxLab”