Big Data Archives | Page 2 of 2

Understanding Big Data Stack – Apache Hadoop and Spark

Introduction

There are many Big Data Solution stacks.

The first and most powerful stack is Apache Hadoop and Spark together. While Hadoop provides storage for structured and unstructured data, Spark provides the computational capability on top of Hadoop.

Introduction to Big Data and Distributed Systems

Introduction

As everyone knows, Big Data is a term of fascination in the present-day era of computing. It is in high demand in today’s IT industry and is believed to revolutionize technical solutions like never before.

Big Data vs Machine Learning

Every day the world is advancing into the new level of industrialization and this has resulted in the production of a vast amount of data. And, at initial stages, people started considering it as a bane, but later they found out that it’s a boon. So, they started using this data in a productive way. Big data and machine learning are terminologies based on the concept of analyzing and using the same data. Let’s get into more details.

Use-cases of Machine Learning in E-Commerce

What computing did to the usual industry earlier, Machine Learning is doing the same to usual rule-based computing now. It is eating the market of the same. Earlier, in organizations, there used to be separate groups for Image Processing, Audio Processing, Analytics and Predictions. Now, these groups are merged because machine learning is basically overlapping with every domain of computing. Let us discuss how machine learning is impacting e-commerce in particular.

The first use case of Machine Learning that became really popular was Amazon Recommendations. Afterwards, the Netflix launched a challenge of Movie Recommendations which gave birth to Kaggle, now an online platform of various machine learning challenges.

Before I dive deep into the details further, lets quickly brief the terms that are found often confusing. AI stands for Artificial Intelligence which means being able to display human-like intelligence. AI is basically an objective. Machine learning is making computers learn based on historical or empirical data instead of explicitly writing the rules. Artificial Neural networks are the computing constructs designed on a similar structure like the animal brain. Deep Learning is a branch of machine learning where we use a complex Artificial Neural network for predictions.

Continue reading “Use-cases of Machine Learning in E-Commerce”

What are the pre-requisites to learn big data?

We, at CloudxLab, keep getting a lot of questions online, sometimes offline, asking us

“I want to learn big data. But, just don’t know whether I am eligible or not.”

“I am so and so, can I learn big data?”

We have compiled the most common questions here. And, we will answer each one of them.

So, here we go.

What are those questions?

I am from a non-technical background. Can I learn big data?
Do I need to know programming languages such as Java, Python, PHP, etc.?
Or, since it is big data, do I need to know any other relational databases such as Oracle or in general do I need to be well versed with SQL?
And also, do I need to know the Unix or Linux?

Continue reading “What are the pre-requisites to learn big data?”

Phrase matching using Apache Spark

Recently, a friend whose company is working on large scale project reached out to us to seek a solution to a simple problem of finding a list of phrases (approximately 80,000) in a huge set of rich text documents (approx 6 million).

The problem at first looked simple. The way engineers had solved it is by simply loading the two documents in Apache Spark’s DataFrame and joining those using “like”. Something on these lines:

select phrase.id, docs.id from phrases, docs where docs.txt like ‘%’ + phrases.phrase + ‘%’

But it was taking huge time even on the small subset of the data and processing is done in distributed fashion. Any Guesses, why?

They had also tried to use Apache Spark’s broadcast mechanism on the smaller dataset but still, it was taking a long while finishing even a small task.

Continue reading “Phrase matching using Apache Spark”

Introduction to Apache Flume in 30 minutes

What is Apache Flume?

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating & moving large data from many different sources to a centralized data store.

Flume supports a large variety of sources Including:

tail (like unix tail -f),
syslog,
log4j – allowing Java applications to write logs to HDFS via flume

Flume Nodes

Flume nodes can be arranged in arbitrary topologies.Typically there is a node running on each source machine, with tiers of aggregating nodes that the data flows through on its way to HDFS.

Topics Covered

What is Flume
Flume: Use Case
Flume: Agents
Flume: Use Case – Agents
Flume: Multiple Agents
Flume: Sources
Flume: Delivery Reliability
Flume: Hands-on

Introduction to Flume Presentation

Please feel free to leave your comments in the comment box so that we can improve the guide and serve you better. Also, Follow CloudxLab on Twitter to get updates on new blogs and videos.

If you wish to learn Hadoop and Spark technologies such as MapReduce, Hive, HBase, Sqoop, Flume, Oozie, Spark RDD, Spark Streaming, Kafka, Data frames, SparkSQL, SparkR, MLlib, GraphX and build a career in BigData and Spark domain then check out our signature course on Big Data with Apache Spark and Hadoop which comes with

Online instructor-led training by professionals having years of experience in building world-class BigData products
High-quality learning content including videos and quizzes
Automated hands-on assessments
90 days of lab access so that you can learn by doing
24×7 support and forum access to answer all your queries throughout your learning journey
Real-world projects
A certificate which you can share on LinkedIn

Machine Learning with Mahout

[This blog is from KnowBigData.com. It is pretty old. Many things have changed since then. People have moved to MLLib. We have also moved to CloudxLab.com.]

What is Machine Learning?

Machine Learning is programming computers to optimize a Performance using example data or past experience, it is a branch of Artificial Intelligence.

Types of Machine Learning

Machine learning is broadly categorized into three buckets:

Supervised Learning – Using Labeled training data, to create a classifier that can predict the output for unseen inputs.
Unsupervised Learning – Using Unlabeled training data to create a function that can predict the output.
Semi-Supervised Learning – Make use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data.

Machine Learning Applications

Recommend Friends, Dates, Products to end-user.
Classify content into pre-defined groups.
Find Similar content based on Object Properties.
Identify key topics in large Collections of Text.
Detect Anomalies within given data.
Ranking Search Results with User Feedback Learning.
Classifying DNA sequences.
Sentiment Analysis/ Opinion Mining
Computer Vision.
Natural Language Processing,
BioInformatics.
Speech and HandWriting Recognition.

Mahout

Mahout – Keeper/Driver of Elephants. Mahout is a Scalable Machine Learning Library built on Hadoop, written in Java and its Driven by Ng et al.’s paper “MapReduce for Machine Learning on Multicore”. Development of Mahout Started as a Lucene sub-project and it became Apache TLP in Apr’10.

Topics Covered

Introduction to Machine Learning and Mahout
Machine Learning- Types
Machine Learning- Applications
Machine Learning- Tools
Mahout – Recommendation Example
Mahout – Use Cases
Mahout Live Example
Mahout – Other Recommender Algos

Machine Learning with Mahout Presentation

Machine Learning with Mahout Videohttps://www.youtube.com/embed/PZsTLIlSZhI

Streaming Twitter Data using Flume

In this blog post, we will learn how to stream Twitter data using Flume on CloudxLab

For downloading tweets from Twitter, we have to configure Twitter App first.

Create Twitter App

Step 1

Navigate to Twitter app URL and sign in with your Twitter account

Step 2

Click on “Create New App”

Continue reading “Streaming Twitter Data using Flume”