Building Real-Time Analytics Dashboard Using Apache Spark

Apache Spark

 

In this blog post, we will learn how to build a real-time analytics dashboard using Apache Spark streaming, Kafka, Node.js, Socket.IO and Highcharts.

Complete Spark Streaming topic on CloudxLab to refresh your Spark Streaming and Kafka concepts to get most out of this guide.

Problem Statement

An e-commerce portal (http://www.aaaa.com) wants to build a real-time analytics dashboard to visualize the number of orders getting shipped every minute to improve the performance of their logistics.

Solution

Before working on the solution, let’s take a quick look at all the tools we will be using:

Apache Spark – A fast and general engine for large-scale data processing. It is 100 times faster than Hadoop MapReduce in memory and 10x faster on disk. Learn more about Apache Spark here

Python – Python is a widely used high-level, general-purpose, interpreted, dynamic programming language. Learn more about Python here

Kafka – A high-throughput, distributed, publish-subscribe messaging system. Learn more about Kafka here

Node.js – Event-driven I/O server-side JavaScript environment based on V8. Learn more about Node.js here

Socket.IO – Socket.IO is a JavaScript library for real-time web applications. It enables real-time, bi-directional communication between web clients and servers. Read more about Socket.IO here

Highcharts – Interactive JavaScript charts for web pages. Read more about Highcharts here

CloudxLab – Provides a real cloud-based environment for practicing and learn various tools. You can start practicing right away by just signing up online.

How To Build A Data Pipeline?

Below is the high-level architecture of the data pipeline

Data Pipeline
Data Pipeline

Our real-time analytics dashboard will look like this

Real-Time Analytics Dashboard
Real-Time Analytics Dashboard

Let’s start with the description of each stage in the data pipeline and build the solution.

Stage 1

When a customer buys an item or an order status changes in the order management system, the corresponding order id along with the order status and time get pushed to the Kafka topic.

Dataset

Since we do not have an online e-commerce portal in place, we have prepared a dataset containing CSV files.  Let’s have a look at the dataset

Our dataset contains three columns ‘DateTime’, ‘OrderId’ & ‘Status’. Each row in the dataset represents the order status in particular date time. Here we have masked OrderId with “xxxxx-xxx”. We are only interested in the number of orders getting shipped every minute, so we do not need the actual order id.

The entire source code for the solution and dataset can be cloned from CloudxLab GitHub repository.

The dataset is located at spark/projects/real-time-analytics dashboard/data/order_data  directory in the above repository.

Push Dataset to Kafka

A shell script takes each row of these CSV files and pushes to Kafka. It waits for one minute to push the next CSV file to Kafka so that we can simulate the real-time e-commerce portal environment where an order status gets changed at different time intervals. In a real world scenario, when an order status changes, the corresponding order details gets pushed to Kafka.

Watch below video to understand how to use Kafka on CloudxLab.

Let’s run our shell script to push data to Kafka topic. Login to CloudxLab web console and run below commands.

Stage 2

After stage 1 each message in Kafka topic looks something like this

Stage 3

Watch below video to learn Spark Streaming and Kafka integration

Spark streaming code takes data from Kafka topic in a window of 60 seconds, process it so that we have the total count of each unique order status in that 60 seconds window. After processing the total count of each unique order status gets pushed to new Kafka topic (Create a new topic with your username say abhinav9884-order-one-min-data)

Run below commands in web console to create a one minute Kafka topic and run the spark streaming code

Stage 4

In this stage, each message in the new Kafka topic ‘abhinav9884-order-one-min-data’ looks something like the below JSON payload

Stage 5

Run Node.js server

Now we start a Node.js server to consume messages from one minute Kafka topic and push it to the web browser so that we can display the number of orders getting shipped per minute in the web browser.

Run below commands in web console to start the Node.js server

Node server starts on port 3001. If there is an ‘EADDRINUSE’ error while starting the node server, please edit the index.js file and change the port to 3002 …3003 ..3004 and so on until you find the next available port. Please use any of the available ports in 3001-3010 range to run the node server.

Access from browser

After node server is started, go to http://YOUR_WEB_CONSOLE:PORT_NUMBER to access the real-time analytics dashboard. If your web console is f.cloudxlab.com and your node server is running on port 3002, go to http://f.cloudxlab.com:3002 to access the dashboard.

When we access the above URL, socket.io-client library gets loaded on the browser which enables the bi-directional communication channel between the server and the browser.

Stage 6

As soon as a new message is available in the one minute Kafka topic, node process consumes it. The consumed message then gets emitted to the web browser via Socket.IO

Stage 7

As soon as socket.io-client in the web browser receives a new ‘message’ event, data in the event gets processed. If the order status is “shipped” in the received data, it gets added to HighCharts series and gets displayed on the browser.

Screencast

We’ve also recorded a screencast on how one can run all the above commands and build the real-time analytics dashboard

We have successfully built the real-time analytics dashboard. This was a basic example to show how can we integrate Spark Streaming, Kafka, Node.js, and Socket.IO to build a real-time analytics dashboard. Now, since we know the basics, we can build more complex systems using the above tools.

Hope this guide was helpful. Please feel free to leave your comments. Follow CloudxLab on Twitter to get updates on new blogs and videos.


About authors

Abhinav Singh

Seasoned hands-on technical architect with years of experience in building large scale products for a global audience and India
Sandeep Giri

Sandeep Giri

Seasoned hands-on technical architect with years of strong experience in building world- class software products