Flume - Hands-on Demo on CloudxLab

Flume - Hands-On Steps

#Get a copy of sample flume conf from common data
hadoop fs -copyToLocal /data/flume/conf

# Change the port if needed and location in HDFS
nano conf/flume.properties

#Launch the flume agent
flume-ng agent --conf conf --conf-file conf/flume.properties --name a1 Dflume.root.logger=INFO,console

# Open a new console and Connect to the same port that you defined in config
nc localhost 44443

# Generate some data 
Type something in the console

#Open a new console and Check in hdfs using 
hadoop fs -ls flume_webdata
hadoop fs -cat 'flume_webdata/FlumeData*'

Flume - Hands-on

Let's do a hands-on exercise in Flume. We will read the data from the port and push it to HDFS. Login to the CloudxLab Linux console on two different terminals. On the first terminal, we will run flume-agent and on the second terminal, we will run a server from which we will read the data.

Copy flume configuration from HDFS to the Linux console. It is located at /data/flume/conf on HDFS. Open flume.properties. We've defined configurations for agent a1 in this file. We can define configuration for multiple agents a1, a2, a3 in the same file. While running flume, we can specify the name of the agent which we want to run on that machine.

We have specified source type as netcat. netcat is a good way to quickly create a server which listens on a specified port. Let's change the port number to 44444. While running flume-agent if port 44444 is used by any other user, it will throw up an "Address already in use" error. In that case please change the port to some other number like 44445 or 44446 in flume configuration file.

Sink type is HDFS. Change HDFS sink path to your home directory in HDFS.

Also please note that we are specifying the channel type as memory which will buffer events in memory. Bind the source and sink to the channel.

Let's run the flume-agent on the first terminal. Please note that we are specifying the agent name as a1. Port 44444 is used by another process. Let's change the port to 44445 and run the flume agent again. Port 44445 is also used by another process. Change the port to 44443. Run the agent again and this time it is started successfully

Now let’s produce some data. Go to the second terminal and type nc localhost 44443. Type in some data and see if it gets pushed to HDFS in the sink path.

Now open a new console and find the list of files created by flume in hdfs using the command hadoop fs -ls flume_webdata.

Let us view the data in the file. To view the data in the first file, please run the command hadoop fs -cat followed by the flume_webdata/the filename. The filename will start with FlumeData followed by a dot and then a number. This number is the timestamp.

Here we can see the data which we entered earlier. If you wish to view the rest of the files, you can use -cat command with the other files in similar fashion. Please note that if the data is coming too frequently, please view the last few lines with -tail instead of viewing the entire file with -cat.

In this video, we have covered Flume introduction and its use case. we have also discussed flume agents and demonstrated steps on how to use Flume on CloudxLab.

Hope you enjoyed the video. Happy learning!

No hints are availble for this assesment

Answer is not availble for this assesment

Loading comments...