What computing did to the usual industry earlier, Machine Learning is doing the same to usual rule-based computing now. It is eating the market of the same. Earlier, in organizations, there used to be separate groups for Image Processing, Audio Processing, Analytics and Predictions. Now, these groups are merged because machine learning is basically overlapping with every domain of computing. Let us discuss how machine learning is impacting e-commerce in particular.
The first use case of Machine Learning that became really popular was Amazon Recommendations. Afterwards, the Netflix launched a challenge of Movie Recommendations which gave birth to Kaggle, now an online platform of various machine learning challenges.
Before I dive deep into the details further, lets quickly brief the terms that are found often confusing. AI stands for Artificial Intelligence which means being able to display human-like intelligence. AI is basically an objective. Machine learning is making computers learn based on historical or empirical data instead of explicitly writing the rules. Artificial Neural networks are the computing constructs designed on a similar structure like the animal brain. Deep Learning is a branch of machine learning where we use a complex Artificial Neural network for predictions.
Recently, a friend whose company is working on large scale project reached out to us to seek a solution to a simple problem of finding a list of phrases (approximately 80,000) in a huge set of rich text documents (approx 6 million).
The problem at first looked simple. The way engineers had solved it is by simply loading the two documents in Apache Spark’s DataFrame and joining those using “like”. Something on these lines:
select phrase.id, docs.id from phrases, docs where docs.txt like ‘%’ + phrases.phrase + ‘%’
But it was taking huge time even on the small subset of the data and processing is done in distributed fashion. Any Guesses, why?
They had also tried to use Apache Spark’s broadcast mechanism on the smaller dataset but still, it was taking a long while finishing even a small task.
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating & moving large data from many different sources to a centralized data store.
Flume supports a large variety of sources Including:
tail (like unix tail -f),
log4j – allowing Java applications to write logs to HDFS via flume
Flume nodes can be arranged in arbitrary topologies.Typically there is a node running on each source machine, with tiers of aggregating nodes that the data flows through on its way to HDFS.
What is Flume
Flume: Use Case
Flume: Use Case – Agents
Flume: Multiple Agents
Flume: Delivery Reliability
Introduction to Flume Presentation
Please feel free to leave your comments in the comment box so that we can improve the guide and serve you better. Also, Follow CloudxLab on Twitter to get updates on new blogs and videos.
If you wish to learn Hadoop and Spark technologies such as MapReduce, Hive, HBase, Sqoop, Flume, Oozie, Spark RDD, Spark Streaming, Kafka, Data frames, SparkSQL, SparkR, MLlib, GraphX and build a career in BigData and Spark domain then check out our signature course on Big Data with Apache Spark and Hadoop which comes with
Online instructor-led training by professionals having years of experience in building world-class BigData products
High-quality learning content including videos and quizzes
Automated hands-on assessments
90 days of lab access so that you can learn by doing
24×7 support and forum access to answer all your queries throughout your learning journey
[This blog is from KnowBigData.com. It is pretty old. Many things have changed since then. People have moved to MLLib. We have also moved to CloudxLab.com.]
What is Machine Learning?
Machine Learning is programming computers to optimize a Performance using example data or past experience, it is a branch of Artificial Intelligence.
Types of Machine Learning
Machine learning is broadly categorized into three buckets:
Supervised Learning – Using Labeled training data, to create a classifier that can predict the output for unseen inputs.
Unsupervised Learning – Using Unlabeled training data to create a function that can predict the output.
Semi-Supervised Learning – Make use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data.
Machine Learning Applications
Recommend Friends, Dates, Products to end-user.
Classify content into pre-defined groups.
Find Similar content based on Object Properties.
Identify key topics in large Collections of Text.
Detect Anomalies within given data.
Ranking Search Results with User Feedback Learning.
Classifying DNA sequences.
Sentiment Analysis/ Opinion Mining
Natural Language Processing,
Speech and HandWriting Recognition.
Mahout – Keeper/Driver of Elephants. Mahout is a Scalable Machine Learning Library built on Hadoop, written in Java and its Driven by Ng et al.’s paper “MapReduce for Machine Learning on Multicore”. Development of Mahout Started as a Lucene sub-project and it became Apache TLP in Apr’10.
Introduction to Machine Learning and Mahout
Machine Learning- Types
Machine Learning- Applications
Machine Learning- Tools
Mahout – Recommendation Example
Mahout – Use Cases
Mahout Live Example
Mahout – Other Recommender Algos
Machine Learning with Mahout Presentation
Machine Learning with Mahout Videohttps://www.youtube.com/embed/PZsTLIlSZhI