Introduction to Pig & Pig Latin

What is PIG?

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

PIG is an engine for executing data flows in parallel on Hadoop. It runs on Hadoop and uses HDFS & MapReduce.

PIG Philosophy

  • Pigs eat anything
    • Data: Relational, nested, or unstructured.
  • Pigs live anywhere
    • Parallel processing Language. Not only Hadoop
  • Pigs are domestic animals
    • Controllable, provides custom functions in java/python
    • Custom load and store methods
  • Pigs fly
    • Designed for performance

PIG Data Types

  • Scaler
    • Int,long,Float – 4 bytes,Double – 8 bytes,Chararray,Bytearray
  • Complex
    • Map – [‘name’#’bob’,’age’#55],chararray => another complex type or scaler
    • Tuple – Fixed length, ordered collection.Made up of fields and their values- (‘bob’,55,12.3)
    • Bag – Unordered collection of tuples – {(‘ram’,55,12.3),(‘sally’,52,11.2)}
  • Schemas
    • PIG is not particular about schema, you can either tell upfront or it will make best type guesses based on data.

Topics Covered

  • Introduction to PIG
  • Use Cases
  • Installation
  • Using from CloudLabs and local mode
  • Schema
  • Filter & Joins
  • Stream
  • Non-linear Execution
For last 12 years, Sandeep has been building products and churning large amounts of data for various product firms. He has an all-around experience of software development and big data analysis.
Apart from digging data and technologies, Sandeep enjoys conducting interviews and explaining difficult concepts in simple ways.
A few resources from Know Big Data
Introduction to Apache PIG & PIG Latin Presentation

Introduction to Apache PIG & PIG Latin Video

10 things to look for when choosing a Big Data course / Institute

Every now and then, I keep seeing a new company coming up with Hadoop classes/courses. Also, my friends keep asking me which of these courses is good to take. I gave them a few tips to choose the best course suitable for them. Here are the few tips to decide which course you should attend to:

 

  • Does the instructor have domain expertise?

 

Know your instructor. You must know about the instructor’s background. Has (s)he done any big data related work? I have seen a lot of instructors who just attend a course somewhere and become instructors.

If the instructor never worked in the domain, do not take such classes. Also, avoid training institutes that do not tell you details about the instructor.

 

  • Is the instructor hands on? When did she/he code last time?

 

In the domain of technology, there is a humongous difference between one instructor who is hands-on in coding and another who is delivering based on theoretical knowledge. Also, know when the instructor worked on codes the last time. If instructor never coded, do not attend the class.

 

  • Does the instructor encourage & answer your questions?

 

There are many recorded free videos available across the internet. The only reason you would go for live classes would be to get your questions answered and doubts cleared immediately.

If the instructor does not encourage questions and answers, such classes are fairly useless.

 

  • Do they provide a cloud-based lab with multiple computer setups?

 

A cloud is basically a computer setup at someone else’s place. When I say my data is in the cloud, it means my data is on a computer that is remotely available.

In olden days, people use to have a physical laboratory of computers for learning basic computer skills. In today’s times, while learning advanced technologies, we require a similar setup but on the cloud i.e. at a remote location. A cloud-based lab provides the following benefits:

  1. Instantaneously available – you do not have to wait for your computer to boot or install something.
  2. Accessible from everywhere – whether you want to work on problems from your office or from home, they should be accessible from everywhere.
  3. Easy to get your code debugged through the instructors – While working on assignments, you might get stuck and need to show the assignments to your instructor and seek review. If your environment, code, error log and history of commands are available to the instructor immediately, the instructor will be able to test and debug your program right away.

Why multiple computer setups on cloud-based labs?

Since Big Data technologies are all about distributed computing i.e. tools that run on multiple computers simultaneously.

If you go through the following list of tools related to big data, you would understand that Big Data is all about multiple computers working together to solve a problem:

  1. Hadoop Distributed File System – A file system that utilizes multiple computers’ disk space and disk IO to provide really high performance and huge space.
  2. Hadoop Yarn / MapReduce – a compute engine that utilizes multiple computers’ processor and IOs (disk read/write) to solve computing problems without involving too much network transfer of data.
  3. NoSQL (HBase, Cassandra, MongoDB etc) – Databases that run on multiple computers (nodes) simultaneously to provide really high performance when dealing with a huge number of read-writes per second. Such databases provide really huge storage using the storage space of multiple computers.
  4. Apache Spark – Utilizes the memory (RAM) and CPU of multiple computers to provide really high throughput.

So, it is very important to have a setup that has multiple computers. It does not make any sense to have a setup with only one computer.

 

  • Do they not promise jobs?

 

If you find an institute promising jobs or providing job guarantees, stay away from them. An institute can at most try to connect you with the job industry, they can not give you job guarantees. If you are considering an institute that is promising you a job, please enquire before joining the course.

 

  • What is the refund policy?

 

What if you found after attending the first few classes that the course is not up to your expectations, and you want your refund. Check if they have a proper refund policy in place.

 

  • Is it online?

 

Finding instructors in advanced technologies is difficult. And it is even more difficult to find good instructors in your local location. So, the chances of getting a good instructor for classroom training is very very low. Getting a great instructor for online training is easier.

So, always prefer an online training over offline training in case of Big Data. By online, I am referring to online live training and not a recorded one.

 

  • Are the founders of the institute from a technology background?

 

In a good institution or university, even an administrator or a PR person is a professor or a lecturer.

Therefore an institute providing Big Data or Hadoop training, whether online or offline, whether big or small, cannot sustain if the founders are not from a technology background. The founders who are technologically challenged may hire a sub-par instructor and may not be able to address the real problems that students face.

So, always go for an institute where the founders have a good background in technology.

 

  • Has this institute published something useful in the big data world?

 

If the institute has a strong technology-based foundation, they will definitely do some innovations and/ or publish some articles and research papers from time to time. These research papers could be as blog posts or in ACM etc. These institutes can be considered.

If the institute’s blog is filled with marketing material only, and not any substantially useful information, the institute is not putting enough efforts into having good instructors or good subject matter experts. Such institutes are more focused on marketing themselves than in adding any value in their domain.

 

  • Are they asking for a direct transfer?

 

If an institute is accepting payments through net banking, they must have signed up with a payment gateway such as PayPal. Also, the payment gateways generally make sure there is a refund policy. However, if the institute is asking you to pay directly and not through any payment gateway, know that you should stay away from such an institute.

Introduction to Apache Flume in 30 minutes

What is Apache Flume?

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating & moving large data from many different sources to a centralized data store.

Flume supports a large variety of sources Including:

  • tail (like unix tail -f),
  • syslog,
  • log4j – allowing Java applications to write logs to HDFS via flume

Flume Nodes

Flume nodes can be arranged in arbitrary topologies.Typically there is a node running on each source machine, with tiers of aggregating nodes that the data flows through on its way to HDFS.

Topics Covered

  • What is Flume
  • Flume: Use Case
  • Flume: Agents
  • Flume: Use Case – Agents
  • Flume: Multiple Agents
  • Flume: Sources
  • Flume: Delivery Reliability
  • Flume: Hands-on

Introduction to Flume Presentation

 

Please feel free to leave your comments in the comment box so that we can improve the guide and serve you better. Also, Follow CloudxLab on Twitter to get updates on new blogs and videos.

If you wish to learn Hadoop and Spark technologies such as MapReduce, Hive, HBase, Sqoop, Flume, Oozie, Spark RDD, Spark Streaming, Kafka, Data frames, SparkSQL, SparkR, MLlib, GraphX and build a career in BigData and Spark domain then check out our signature course on Big Data with Apache Spark and Hadoop which comes with

  • Online instructor-led training by professionals having years of experience in building world-class BigData products
  • High-quality learning content including videos and quizzes
  • Automated hands-on assessments
  • 90 days of lab access so that you can learn by doing
  • 24×7 support and forum access to answer all your queries throughout your learning journey
  • Real-world projects
  • A certificate which you can share on LinkedIn

Machine Learning with Mahout

[This blog is from KnowBigData.com. It is pretty old. Many things have changed since then. People have moved to MLLib. We have also moved to CloudxLab.com.]

What is Machine Learning?

Machine Learning is programming computers to optimize a Performance using example data or past experience, it is a branch of Artificial Intelligence.

Types of Machine Learning

Machine learning is broadly categorized into three buckets:

  • Supervised Learning – Using Labeled training data, to create a classifier that can predict the output for unseen inputs.
  • Unsupervised Learning – Using Unlabeled training data to create a function that can predict the output.
  • Semi-Supervised Learning – Make use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data.

Machine Learning Applications

  • Recommend Friends, Dates, Products to end-user.
  • Classify content into pre-defined groups.
  • Find Similar content based on Object Properties.
  • Identify key topics in large Collections of Text.
  • Detect Anomalies within given data.
  • Ranking Search Results with User Feedback Learning.
  • Classifying DNA sequences.
  • Sentiment Analysis/ Opinion Mining
  • Computer Vision.
  • Natural Language Processing,
  • BioInformatics.
  • Speech and HandWriting Recognition.

Mahout

Mahout – Keeper/Driver of Elephants. Mahout is a Scalable Machine Learning Library built on Hadoop, written in Java and its Driven by Ng et al.’s paper “MapReduce for Machine Learning on Multicore”. Development of Mahout Started as a Lucene sub-project and it became Apache TLP in Apr’10.

Topics Covered

  • Introduction to Machine Learning and Mahout
  • Machine Learning- Types
  • Machine Learning- Applications
  • Machine Learning- Tools
  • Mahout – Recommendation Example
  • Mahout – Use Cases
  • Mahout Live Example
  • Mahout – Other Recommender Algos

Machine Learning with Mahout Presentation

Machine Learning with Mahout Video

6 Reasons Why Big Data Career is a Smart Choice

Confused whether to take up a career in Big Data or not? Planning to invest your time in getting certified and to acquire expertise in related frameworks like Hadoop, Spark etc. and worried whether you are making a huge mistake? Just spend a few minutes reading this blog and you will get six reasons on why you are making a smart choice by selecting a career in big data.

Why Big Data?

There are several people out there who believe that Big Data is the next big thing which would help companies to spring up above others and help them position themselves as the best in class in their respective sectors.

Companies these days generate a gigantic amount of information irrespective of which industry they belong to and there is a need to store these data which are being generated so that they can be processed and not miss out on important information which could lead to a new breakthrough in their respective sector.  Atul Butte, of Stanford School of Medicine, has stressed the importance of data by saying “Hiding within those mounds of data is the knowledge that could change the life of a patient, or change the world”. And this is where Big Data analytics play a very crucial role.

With the use of Big Data platforms, a gigantic amount of data can be brought together and be processed to develop patterns which would help the company in making better decisions which would help them to grow, increase their productivity and to help create value to their products and services.

Continue reading “6 Reasons Why Big Data Career is a Smart Choice”

One Day Machine Learning Bootcamp at IITB – CloudxLab

Our past two Bootcamp on Machine Learning at National Singapore University and RV College of Engineering were very interesting and all the attendees found it very useful. These feedbacks prompted us to have more Bootcamps like these.

Thanks to Prof. Alankar, who invited us to conduct yet another Machine Learning Bootcamp at Indian Institute of Technology, Bombay. Before we move on to the details of Bootcamp, let us give you a brief introduction to Prof. Alankar. He is an Assistant Professor at IIT Bombay in Mechanical Engineering Department and works in the area of Multiscale Modeling of Deformation. He is a graduate of IIT Roorkee, holds a masters degree from University of British Columbia (Canada) and doctoral degree from Washington State University (USA). He has previously worked at Max-Planck Institute (Germany), Los Alamos National Laboratory (USA) and Modumetal, Inc (USA).

Machine Learning Bootcamp

So it all happened on Mar 17 where Machine Learning enthusiasts, which includes professors and students from every branch of IIT, gathered to attend the one day workshop on Machine Learning. The presenter was none other than Mr. Sandeep Giri, who has over 15 years of experience in the domain of Machine learning and Big Data technologies. He has worked in companies like Amazon, InMobi, and D. E. Shaw.

Continue reading “One Day Machine Learning Bootcamp at IITB – CloudxLab”

A Successful Machine Learning Bootcamp by CloudxLab – Singapore

Cloudxlab has conducted many successful online events on Machine Learning and Big Data, for it is relatively easy to attend so many attendees simultaneously. Furthermore, it eliminates the need for a tiring visit to the event location. One can simply log in from the comfort of one’s house and start learning.

Sure, online events have their own perks. But it hasn’t stopped us to conduct offline events. Our Machine Learning session at R.V College of engineering was one such successful example.

This time, we wanted to conduct a little bigger event, therefore Cloudxlab joined hands together with IOTSG and National Singapore university, enterprise to organize another successful Machine Learning Bootcamp.

The Venue

CloudxLab was organizing the Machine learning Bootcamp for the first time in Singapore. To be frank, we were a little nervous as we did not know how welcoming the country is. But all our minor doubts were cleared once we experienced the warm welcome from everyone there. So much so that we would like to do one more Bootcamp in Singapore in near future.

National Singapore University was very cooperative in helping us to organize the Bootcamp in their campus.

Continue reading “A Successful Machine Learning Bootcamp by CloudxLab – Singapore”

Machine Learning & IoT Bootcamp – Singapore

Have you ever wondered how you can apply various Machine Learning and IOT techniques for everyday business problems? Or, are you someone who has heard of Machine Learning but couldn’t get a chance to dig a little deeper?  If your answer is Yes, then you’ve come to the right place.

Cloudxlab is conducting a Machine Learning & IoT Bootcamp in Singapore.

  • Date: Saturday, Feb 10, 2018
  • Place: NUS Enterprise, #02-01, 71 Ayer Rajah Crescent, Singapore
  • Time: 9:30 AM to 5:00 PM

What will be covered?

An exposure to Machine Learning using Python to analyze, draw intelligence and build powerful models using real-world datasets. You’ll also gain the insights to apply data processing and Machine Learning techniques in real time.

After completing this workshop, you will be able to build and optimize your own automated classifier to extract insights from real-world data sets.

Continue reading “Machine Learning & IoT Bootcamp – Singapore”

The Pursuit of Education – A Story of Strength

Today, we will not talk tech or discuss our regular tutorials. Instead, we will take you on a different journey – a journey about strength, a journey about hope, and a journey on life.

It was a regular working day for us when an email caught our attention. It was from an individual who faced unimaginable hardships in his life but still hopes for a better future by executing his passion for learning.

His message was rather long and it clearly showed that he was in desperate need of a higher education. We thought he was a student, and we offered him the student’s discount on one of our self-paced courses on Big Data. But much to our surprise, he was not in a state to pay even the discounted price.

We were not clear on why he would be requesting for a free course. However, we came to know about the kind of hardship that he had recently gone through, and about his real mission to move back to his native place and help poor and needy students by providing free education.

He was a Rohingya refugee and had lost his entire family in the recent clashes of Myanmar. He managed to survive the traumatic ordeal but thinking of a new life was more of an impossible dream for him. However, he stepped up and decided to move on with his life.

He wanted to continue his education, and therefore, started to look out for a Big Data course that he could do for free because of his terrible financial crisis. He came across CloudxLab and got in touch with us explaining his situation. He also mentioned that he wanted to help the needy back in his country for which he needed to go through the course.

We were much in awe of this person’s strength of mind. He came across as an epitome of strength who is ruthlessly following his dream despite all odds and ordeals.

We offered him our course at no cost, but we did not know how much it meant to him until he sent his reply:

I can’t explain my feelings in words how much happy I am now. You are an angel for me who help me to stand on my feet. Sir thank you for believing in me and giving me a chance to continue my dream. I promise I will do my best and complete the course as fast as I can. Thank you.

This is probably our biggest achievement as a team.

We salute this individual for his unthinkable strength in facing such a catastrophe in his life while nurturing a selfless desire to help others. We wish him good days ahead and hope that he completes his education and embarks on the journey to help his people.

Streaming Twitter Data using Flume

In this blog post, we will learn how to stream Twitter data using Flume on CloudxLab

For downloading tweets from Twitter, we have to configure Twitter App first.

Create Twitter App

Step 1

Navigate to Twitter app URL and sign in with your Twitter account

Step 2

Click on “Create New App”

Create New App

Continue reading “Streaming Twitter Data using Flume”