CloudxLab Blog | Page 12 of 16 | Learn AI, Machine Learning, Deep Learning, Devops & Big Data

Phrase matching using Apache Spark

Recently, a friend whose company is working on large scale project reached out to us to seek a solution to a simple problem of finding a list of phrases (approximately 80,000) in a huge set of rich text documents (approx 6 million).

The problem at first looked simple. The way engineers had solved it is by simply loading the two documents in Apache Spark’s DataFrame and joining those using “like”. Something on these lines:

select phrase.id, docs.id from phrases, docs where docs.txt like ‘%’ + phrases.phrase + ‘%’

But it was taking huge time even on the small subset of the data and processing is done in distributed fashion. Any Guesses, why?

They had also tried to use Apache Spark’s broadcast mechanism on the smaller dataset but still, it was taking a long while finishing even a small task.

Continue reading “Phrase matching using Apache Spark”

Scholarship Test for Machine Learning Course

After receiving a huge response in our last scholarship test, we are once again back with a basic conceptual test to attain scholarship for our upcoming Specialization course on Machine Learning and Deep Learning.

Concepts to be tested: Linear algebra, probability theory, statistics, multivariable calculus, algorithms and complexity, aptitude and Data Interpretation.

Date and Time: September 2, 2018, 8:00 am PDT (8:30 pm IST)
Type: objective (MCQ)
Number of questions: 25
Duration: 90 minutes
Mode: Online

Register Now »

If you have a good aptitude and general problem-solving skills, this test is for you. So, go ahead and earn what you deserve.

If you have any questions on the test or if anything else comes up, just click here to let us know. We’re always happy to help.

How to Teach Online Effectively

I founded KnowBigData.com in 2014 after working in Amazon. Teaching is my passion, and technology, specifically large-scale computing my forte, thanks to my working experience with Amazon, InMobi, D. E. Shaw and my own startup tBits Global. Therefore, I wanted to help people learn technology online. I launched KnowBigData.com, an online instructor-led training on MongoDB followed by Big Data and Machine learning. Eventually, we innovated a lot in learning and shaped KnowBigData into Cloudxlab.com which is currently a major gamified learning environment for Machine Learning, AI, and Big Data.

Continue reading “How to Teach Online Effectively”

How To Optimise A Neural Network?

When we are solving an industry problem involving neural networks, very often we end up with bad performance. Here are some suggestions on what should be done in order to improve the performance.

Is your model underfitting or overfitting?

You must break down the input data set into two parts – training and test. The general practice is to have 80% for training and 20% for testing.

You should train your neural network with the training set and test with the testing set. This sounds like common sense but we often skip it.

Compare the performance (MSE in case of regression and accuracy/f1/recall/precision in case of classification) of your model with the training set and with the test set.

If it is performing badly for both test and training it is underfitting and if it is performing great for the training set but not test set, it is overfitting.

Continue reading “How To Optimise A Neural Network?”

10 Things to Look for When Choosing a Big Data course / Institute

Every now and then, I keep seeing a new company coming up with Hadoop classes/courses. Also, my friends keep asking me which of these courses is good to take. I gave them a few tips to choose the best course suitable for them. Here are the few tips to decide which course you should attend to:

1. Does the instructor have domain expertise?

Know your instructor. You must know about the instructor’s background. Has (s)he done any big data related work? I have seen a lot of instructors who just attend a course somewhere and become instructors.

If the instructor never worked in the domain, do not take such classes. Also, avoid training institutes that do not tell you details about the instructor.

2. Is the instructor hands on? When did she/he code last time?

In the domain of technology, there is a humongous difference between one instructor who is hands-on in coding and another who is delivering based on theoretical knowledge. Also, know when the instructor worked on codes the last time. If instructor never coded, do not attend the class.

3. Does the instructor encourage & answer your questions?

There are many recorded free videos available across the internet. The only reason you would go for live classes would be to get your questions answered and doubts cleared immediately.

If the instructor does not encourage questions and answers, such classes are fairly useless.

Continue reading “10 Things to Look for When Choosing a Big Data course / Institute”

Introduction to Apache Flume in 30 minutes

What is Apache Flume?

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating & moving large data from many different sources to a centralized data store.

Flume supports a large variety of sources Including:

tail (like unix tail -f),
syslog,
log4j – allowing Java applications to write logs to HDFS via flume

Flume Nodes

Flume nodes can be arranged in arbitrary topologies.Typically there is a node running on each source machine, with tiers of aggregating nodes that the data flows through on its way to HDFS.

Topics Covered

What is Flume
Flume: Use Case
Flume: Agents
Flume: Use Case – Agents
Flume: Multiple Agents
Flume: Sources
Flume: Delivery Reliability
Flume: Hands-on

Introduction to Flume Presentation

Please feel free to leave your comments in the comment box so that we can improve the guide and serve you better. Also, Follow CloudxLab on Twitter to get updates on new blogs and videos.

If you wish to learn Hadoop and Spark technologies such as MapReduce, Hive, HBase, Sqoop, Flume, Oozie, Spark RDD, Spark Streaming, Kafka, Data frames, SparkSQL, SparkR, MLlib, GraphX and build a career in BigData and Spark domain then check out our signature course on Big Data with Apache Spark and Hadoop which comes with

Online instructor-led training by professionals having years of experience in building world-class BigData products
High-quality learning content including videos and quizzes
Automated hands-on assessments
90 days of lab access so that you can learn by doing
24×7 support and forum access to answer all your queries throughout your learning journey
Real-world projects
A certificate which you can share on LinkedIn

Machine Learning with Mahout

[This blog is from KnowBigData.com. It is pretty old. Many things have changed since then. People have moved to MLLib. We have also moved to CloudxLab.com.]

What is Machine Learning?

Machine Learning is programming computers to optimize a Performance using example data or past experience, it is a branch of Artificial Intelligence.

Types of Machine Learning

Machine learning is broadly categorized into three buckets:

Supervised Learning – Using Labeled training data, to create a classifier that can predict the output for unseen inputs.
Unsupervised Learning – Using Unlabeled training data to create a function that can predict the output.
Semi-Supervised Learning – Make use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data.

Machine Learning Applications

Recommend Friends, Dates, Products to end-user.
Classify content into pre-defined groups.
Find Similar content based on Object Properties.
Identify key topics in large Collections of Text.
Detect Anomalies within given data.
Ranking Search Results with User Feedback Learning.
Classifying DNA sequences.
Sentiment Analysis/ Opinion Mining
Computer Vision.
Natural Language Processing,
BioInformatics.
Speech and HandWriting Recognition.

Mahout

Mahout – Keeper/Driver of Elephants. Mahout is a Scalable Machine Learning Library built on Hadoop, written in Java and its Driven by Ng et al.’s paper “MapReduce for Machine Learning on Multicore”. Development of Mahout Started as a Lucene sub-project and it became Apache TLP in Apr’10.

Topics Covered

Introduction to Machine Learning and Mahout
Machine Learning- Types
Machine Learning- Applications
Machine Learning- Tools
Mahout – Recommendation Example
Mahout – Use Cases
Mahout Live Example
Mahout – Other Recommender Algos

Machine Learning with Mahout Presentation

Machine Learning with Mahout Videohttps://www.youtube.com/embed/PZsTLIlSZhI

6 Reasons Why Big Data Career is a Smart Choice

Confused whether to take up a career in Big Data or not? Planning to invest your time in getting certified and to acquire expertise in related frameworks like Hadoop, Spark etc. and worried whether you are making a huge mistake? Just spend a few minutes reading this blog and you will get six reasons on why you are making a smart choice by selecting a career in big data.

Why Big Data?

There are several people out there who believe that Big Data is the next big thing which would help companies to spring up above others and help them position themselves as the best in class in their respective sectors.

Companies these days generate a gigantic amount of information irrespective of which industry they belong to and there is a need to store these data which are being generated so that they can be processed and not miss out on important information which could lead to a new breakthrough in their respective sector. Atul Butte, of Stanford School of Medicine, has stressed the importance of data by saying “Hiding within those mounds of data is the knowledge that could change the life of a patient, or change the world”. And this is where Big Data analytics play a very crucial role.

With the use of Big Data platforms, a gigantic amount of data can be brought together and be processed to develop patterns which would help the company in making better decisions which would help them to grow, increase their productivity and to help create value to their products and services.

Continue reading “6 Reasons Why Big Data Career is a Smart Choice”

One Day Machine Learning Bootcamp at IITB

Our past two Bootcamp on Machine Learning at National Singapore University and RV College of Engineering were very interesting and all the attendees found it very useful. These feedbacks prompted us to have more Bootcamps like these.

Thanks to Prof. Alankar, who invited us to conduct yet another Machine Learning Bootcamp at Indian Institute of Technology, Bombay. Before we move on to the details of Bootcamp, let us give you a brief introduction to Prof. Alankar. He is an Assistant Professor at IIT Bombay in Mechanical Engineering Department and works in the area of Multiscale Modeling of Deformation. He is a graduate of IIT Roorkee, holds a masters degree from University of British Columbia (Canada) and doctoral degree from Washington State University (USA). He has previously worked at Max-Planck Institute (Germany), Los Alamos National Laboratory (USA) and Modumetal, Inc (USA).

Machine Learning Bootcamp

So it all happened on Mar 17 where Machine Learning enthusiasts, which includes professors and students from every branch of IIT, gathered to attend the one day workshop on Machine Learning. The presenter was none other than Mr. Sandeep Giri, who has over 15 years of experience in the domain of Machine learning and Big Data technologies. He has worked in companies like Amazon, InMobi, and D. E. Shaw.

Continue reading “One Day Machine Learning Bootcamp at IITB”

How to Install Hortonworks Data Platform – HDP 2.6 on AWS

In this post, we will show you how you can install Hortonworks Data Platform on AWS.

You can also watch the video of this tutorial here

We start with three machines. We could install Hadoop on these machines by manually downloading and configuring them, but that’s very insufficient. So either we could use Cloudera manager or Ambari. In this tutorial, we are going to use Ambari.

On the first machine, we are going to install the Ambari server. For that, we need to buy these three instances at Amazon and we will follow the Ambari guidelines.

Ambari will then install all the components that are required in other two machines.

Please note, we will use 16 GB ram machines so that installation goes smoothly.

Let’s get started.

Continue reading “How to Install Hortonworks Data Platform – HDP 2.6 on AWS”