Every day the world is advancing into the new level of industrialization and this has resulted in the production of a vast amount of data. And, at initial stages, people started considering it as a bane, but later they found out that it’s a boon. So, they started using this data in a productive way. Big data and machine learning are terminologies based on the concept of analyzing and using the same data. Let’s get into more details.Continue reading “Big Data vs Machine Learning”
What computing did to the usual industry earlier, Machine Learning is doing the same to usual rule-based computing now. It is eating the market of the same. Earlier, in organizations, there used to be separate groups for Image Processing, Audio Processing, Analytics and Predictions. Now, these groups are merged because machine learning is basically overlapping with every domain of computing. Let us discuss how machine learning is impacting e-commerce in particular.
The first use case of Machine Learning that became really popular was Amazon Recommendations. Afterwards, the Netflix launched a challenge of Movie Recommendations which gave birth to Kaggle, now an online platform of various machine learning challenges.
Before I dive deep into the details further, lets quickly brief the terms that are found often confusing. AI stands for Artificial Intelligence which means being able to display human-like intelligence. AI is basically an objective. Machine learning is making computers learn based on historical or empirical data instead of explicitly writing the rules. Artificial Neural networks are the computing constructs designed on a similar structure like the animal brain. Deep Learning is a branch of machine learning where we use a complex Artificial Neural network for predictions.
We, at CloudxLab, keep getting a lot of questions online, sometimes offline, asking us
“I want to learn big data. But, just don’t know whether I am eligible or not.”
“I am so and so, can I learn big data?”
We have compiled the most common questions here. And, we will answer each one of them.
So, here we go.
What are those questions?
- I am from a non-technical background. Can I learn big data?
- Do I need to know programming languages such as Java, Python, PHP, etc.?
- Or, since it is big data, do I need to know any other relational databases such as Oracle or in general do I need to be well versed with SQL?
- And also, do I need to know the Unix or Linux?
Recently, a friend whose company is working on large scale project reached out to us to seek a solution to a simple problem of finding a list of phrases (approximately 80,000) in a huge set of rich text documents (approx 6 million).
The problem at first looked simple. The way engineers had solved it is by simply loading the two documents in Apache Spark’s DataFrame and joining those using “like”. Something on these lines:
select phrase.id, docs.id from phrases, docs where docs.txt like ‘%’ + phrases.phrase + ‘%’
But it was taking huge time even on the small subset of the data and processing is done in distributed fashion. Any Guesses, why?
They had also tried to use Apache Spark’s broadcast mechanism on the smaller dataset but still, it was taking a long while finishing even a small task.
What is Apache Flume?
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating & moving large data from many different sources to a centralized data store.
Flume supports a large variety of sources Including:
- tail (like unix tail -f),
- log4j – allowing Java applications to write logs to HDFS via flume
Flume nodes can be arranged in arbitrary topologies.Typically there is a node running on each source machine, with tiers of aggregating nodes that the data flows through on its way to HDFS.
- What is Flume
- Flume: Use Case
- Flume: Agents
- Flume: Use Case – Agents
- Flume: Multiple Agents
- Flume: Sources
- Flume: Delivery Reliability
- Flume: Hands-on
Introduction to Flume Presentation
Please feel free to leave your comments in the comment box so that we can improve the guide and serve you better. Also, Follow CloudxLab on Twitter to get updates on new blogs and videos.
If you wish to learn Hadoop and Spark technologies such as MapReduce, Hive, HBase, Sqoop, Flume, Oozie, Spark RDD, Spark Streaming, Kafka, Data frames, SparkSQL, SparkR, MLlib, GraphX and build a career in BigData and Spark domain then check out our signature course on Big Data with Apache Spark and Hadoop which comes with
- Online instructor-led training by professionals having years of experience in building world-class BigData products
- High-quality learning content including videos and quizzes
- Automated hands-on assessments
- 90 days of lab access so that you can learn by doing
- 24×7 support and forum access to answer all your queries throughout your learning journey
- Real-world projects
- A certificate which you can share on LinkedIn
[This blog is from KnowBigData.com. It is pretty old. Many things have changed since then. People have moved to MLLib. We have also moved to CloudxLab.com.]
What is Machine Learning?
Machine Learning is programming computers to optimize a Performance using example data or past experience, it is a branch of Artificial Intelligence.
Types of Machine Learning
Machine learning is broadly categorized into three buckets:
- Supervised Learning – Using Labeled training data, to create a classifier that can predict the output for unseen inputs.
- Unsupervised Learning – Using Unlabeled training data to create a function that can predict the output.
- Semi-Supervised Learning – Make use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data.
Machine Learning Applications
- Recommend Friends, Dates, Products to end-user.
- Classify content into pre-defined groups.
- Find Similar content based on Object Properties.
- Identify key topics in large Collections of Text.
- Detect Anomalies within given data.
- Ranking Search Results with User Feedback Learning.
- Classifying DNA sequences.
- Sentiment Analysis/ Opinion Mining
- Computer Vision.
- Natural Language Processing,
- Speech and HandWriting Recognition.
Mahout – Keeper/Driver of Elephants. Mahout is a Scalable Machine Learning Library built on Hadoop, written in Java and its Driven by Ng et al.’s paper “MapReduce for Machine Learning on Multicore”. Development of Mahout Started as a Lucene sub-project and it became Apache TLP in Apr’10.
- Introduction to Machine Learning and Mahout
- Machine Learning- Types
- Machine Learning- Applications
- Machine Learning- Tools
- Mahout – Recommendation Example
- Mahout – Use Cases
- Mahout Live Example
- Mahout – Other Recommender Algos
Machine Learning with Mahout Presentation
Machine Learning with Mahout Videohttps://www.youtube.com/embed/PZsTLIlSZhI
In this blog post, we will learn how to stream Twitter data using Flume on CloudxLab
For downloading tweets from Twitter, we have to configure Twitter App first.
Create Twitter App
Navigate to Twitter app URL and sign in with your Twitter account
Click on “Create New App”