Pig & Pig Latin

24 / 48

Pig - More Operators




Not able to play video? Try with youtube

[Pig - Relational Operators - GROUP]

GROUP operator, groups the data based on keys like Group by in SQL.

Let's say we have a relation "A" containing the user name, age, and GPA. When we group relation A by age, "John" and "Joe" come together in the same group.

[Pig - First Pig Script - Average Value]

Let's do a hands-on exercise. We will calculate the average dividend of each stock listed in NYSE - The New York Stock Exchange - using Pig. The dataset is located at /data/NYSE_dividends on HDFS. It contains four columns Exchange name, Stock Symbol, date, and dividends.

Let's load data from HDFS, group it by stock_symbol and dump it. We can see that each stock symbol now has grouped data. Now for each stock, let's calculate the average dividend and store it into a file named "avged' in home directory in HDFS. We can see the file content using the command line. We have successfully calculated the average dividend of each stock.

[Pig - Relational Operators - FILTER]

Filter operator filters from relations based on the specified condition.

Let's say we want to filter out the stocks having the name starting with 'CM'. The command displayed on the screen will filter out the stocks starting with CM. Let's run the command. We have filtered all the stocks starting with 'CM'

[Pig - More Operators]

For more details on Pig operators, please go to the link displayed on the screen.

Code

Filter Operator

divs = LOAD '/data/NYSE_dividends' AS (exchange: chararray, symbol: chararray, date: datetime, dividends: float);
startswithcm = FILTER divs BY symbol matches 'CM.*';

Loading comments...