Building Real-Time Analytics Dashboard Using Apache Spark

Apache Spark

 

In this blog post, we will learn how to build a real-time analytics dashboard using Apache Spark streaming, Kafka, Node.js, Socket.IO and Highcharts.

Problem Statement

An e-commerce portal (http://www.aaaa.com) wants to build a real-time analytics dashboard to visualize the number of orders getting shipped every minute to improve the performance of their logistics.

Solution

Before working on the solution, let’s take a quick look at all the tools we will be using:

Apache Spark – A fast and general engine for large-scale data processing. It is 100 times faster than Hadoop MapReduce in memory and 10x faster on disk. Learn more about Apache Spark here

Python – Python is a widely used high-level, general-purpose, interpreted, dynamic programming language. Learn more about Python here

Kafka – A high-throughput, distributed, publish-subscribe messaging system. Learn more about Kafka here

Node.js – Event-driven I/O server-side JavaScript environment based on V8. Learn more about Node.js here

Socket.io – Socket.IO is a JavaScript library for real-time web applications. It enables real-time, bi-directional communication between web clients and servers. Read more about Socket.io here

Highcharts – Interactive JavaScript charts for web pages. Read more about Highcharts here

CloudxLab – Provides a real cloud-based environment for practicing and learn various tools. You can start practicing right away by just signing up online.

How To Build A Data Pipeline?

Below is the high-level architecture of the data pipeline

Data Pipeline
Data Pipeline

Our real-time analytics dashboard will look like this

Real-Time Analytics Dashboard
Real-Time Analytics Dashboard

Let’s start with the description of each stage in the data pipeline and build the solution.

Stage 1

When a customer buys an item or an order status changes in the order management system, the corresponding order id along with the order status and time will be pushed to Kafka topic.

Dataset

Since we do not have an online e-commerce portal in place, we have prepared a dataset containing CSV files.  Let’s have a look at the dataset

Our dataset contains three columns ‘DateTime’, ‘OrderId’ & ‘Status’. Each row in the dataset represents the order status in particular date time. Here we have masked OrderId with “xxxxx-xxx”. We are only interested in the number of orders getting shipped every minute, so we do not need the actual order id.

The entire source code for the solution and dataset can be cloned from CloudxLab GitHub repository. The dataset is located at  spark-streaming/data/order_data directory in the above repository.

Push Dataset to Kafka

A shell script will take each row from these CSV files and push to Kafka. It will wait for one minute to push the next CSV file to Kafka so that we can simulate the real-time e-commerce portal environment where order status will be changed at different time intervals. In a real world scenario, when an order status changes, the corresponding order details will be pushed to Kafka.

Let’s run our shell script for pushing data to Kafka topic. Login to CloudxLab web console and run below commands.

Stage 2

After stage 1 each message in Kafka topic ‘order-data’ will look something like this

Stage 3

Spark streaming code will take data from ‘order-data’ Kafka topic in a window of 60 seconds, process it so that we will have the total count of each unique order status in that 60 seconds window. After processing the total count of each unique order status gets pushed to “order-one-min-data” Kafka topic.

Please run these commands in web console to run the spark streaming code

Stage 4

In this stage, each message in the Kafka topic ‘order-one-min-data’ will look something like the below JSON payload

Stage 5

Run Node.js server

Now we will run a node.js server to consume messages from ‘order-one-min-data’ Kafka topic and push it to the web browser so that we can display the number of orders getting shipped per minute in the web browser.

Please run the below commands in web console to run the node.js server

Now node server will run on port 3001. If there is a ‘EADDRINUSE’ error while starting the node server, please edit the index.js file and change the port to 3002 …3003 ..3004 and so on. Please use any of the available ports in 3001-3010 range to run the node server.

Access from browser

After node server is started, go to http://YOUR_WEB_CONSOLE:PORT_NUMBER to access the real-time analytics dashboard. If your web console is f.cloudxlab.com and your node server is running on port 3002, go to http://f.cloudxlab.com:3002 to access the dashboard.

When we access the above URL, socket.io-client library gets loaded on the browser which enables the bi-directional communication channel between the server and the browser.

Stage 6

As soon as a new message is available in the Kafka ‘order-one-min-data’ topic, node process will consume it. The consumed message will be emitted to the web browser via socket.io

Stage 7

As soon as socket.io-client in the web browser receives a new ‘message’ event, data in the event gets processed. If the order status is “shipped” in the received data, it gets added to HighCharts series and gets displayed on the browser.

Screencast

We’ve also recorded a screencast on how one can run all the above commands and build the real-time analytics dashboard

We have successfully built the real-time analytics dashboard. This was a basic example to show how can we integrate spark-streaming, Kafka, node.js and socket.io to build a real-time analytics dashboard. Now, since we know the basics, we can build more complex systems using the above tools.

Hope this guide was helpful. Please feel free to leave your comments. Follow CloudxLab on Twitter to get updates on new blogs and videos.


About authors

Abhinav Singh

Seasoned hands-on technical architect with years of experience in building large scale products for a global audience and India
Sandeep Giri

Sandeep Giri

Seasoned hands-on technical architect with years of strong experience in building world- class software products

  • Interesting read :)

    • abhinav singh

      Glad to know that you liked it :)