Building Real-Time Analytics Dashboard Using Apache Spark

Apache Spark

 

In this blog post, we will learn how to build a real-time analytics dashboard using Apache Spark streaming, Kafka, Node.js, Socket.IO and Highcharts.

Problem Statement

An e-commerce portal (http://www.aaaa.com) wants to build a real-time analytics dashboard to visualize the number of orders getting shipped every minute to improve the performance of their logistics.

Solution

Before working on the solution, let’s take a quick look at all the tools we will be using:

Apache Spark – A fast and general engine for large-scale data processing. It is 100 times faster than Hadoop MapReduce in memory and 10x faster on disk. Learn more about Apache Spark here

Python – Python is a widely used high-level, general-purpose, interpreted, dynamic programming language. Learn more about Python here

Kafka – A high-throughput, distributed, publish-subscribe messaging system. Learn more about Kafka here

Node.js – Event-driven I/O server-side JavaScript environment based on V8. Learn more about Node.js here

Socket.io – Socket.IO is a JavaScript library for real-time web applications. It enables real-time, bi-directional communication between web clients and servers. Read more about Socket.io here

Highcharts – Interactive JavaScript charts for web pages. Read more about Highcharts here

CloudxLab – Provides a real cloud-based environment for practicing and learn various tools. You can start practicing right away by just signing up online.

How To Build A Data Pipeline?

Below is the high-level architecture of the data pipeline

Data Pipeline
Data Pipeline

Our real-time analytics dashboard will look like this

Real-Time Analytics Dashboard
Real-Time Analytics Dashboard

Let’s start with the description of each stage in the data pipeline and build the solution.

Stage 1

When a customer buys an item or an order status changes in the order management system, the corresponding order id along with the order status and time will be pushed to Kafka topic.

Dataset

Since we do not have an online e-commerce portal in place, we have prepared a dataset containing CSV files.  Let’s have a look at the dataset

Our dataset contains three columns ‘DateTime’, ‘OrderId’ & ‘Status’. Each row in the dataset represents the order status in particular date time. Here we have masked OrderId with “xxxxx-xxx”. We are only interested in the number of orders getting shipped every minute, so we do not need the actual order id.

The entire source code for the solution and dataset can be cloned from CloudxLab GitHub repository. The dataset is located at  spark/projects/real-time-analytics-dashboard/data/order_data directory in the above repository.

Push Dataset to Kafka

A shell script will take each row from these CSV files and push to Kafka. It will wait for one minute to push the next CSV file to Kafka so that we can simulate the real-time e-commerce portal environment where order status will be changed at different time intervals. In a real world scenario, when an order status changes, the corresponding order details will be pushed to Kafka.

Let’s run our shell script for pushing data to Kafka topic. Login to CloudxLab web console and run below commands.

Stage 2

After stage 1 each message in Kafka topic ‘order-data’ will look something like this

Stage 3

Spark streaming code will take data from ‘order-data’ Kafka topic in a window of 60 seconds, process it so that we will have the total count of each unique order status in that 60 seconds window. After processing the total count of each unique order status gets pushed to “order-one-min-data” Kafka topic.

Please run these commands in web console to run the spark streaming code

Stage 4

In this stage, each message in the Kafka topic ‘order-one-min-data’ will look something like the below JSON payload

Stage 5

Run Node.js server

Now we will run a node.js server to consume messages from ‘order-one-min-data’ Kafka topic and push it to the web browser so that we can display the number of orders getting shipped per minute in the web browser.

Please run the below commands in web console to run the node.js server

Now node server will run on port 3001. If there is a ‘EADDRINUSE’ error while starting the node server, please edit the index.js file and change the port to 3002 …3003 ..3004 and so on. Please use any of the available ports in 3001-3010 range to run the node server.

Access from browser

After node server is started, go to http://YOUR_WEB_CONSOLE:PORT_NUMBER to access the real-time analytics dashboard. If your web console is f.cloudxlab.com and your node server is running on port 3002, go to http://f.cloudxlab.com:3002 to access the dashboard.

When we access the above URL, socket.io-client library gets loaded on the browser which enables the bi-directional communication channel between the server and the browser.

Stage 6

As soon as a new message is available in the Kafka ‘order-one-min-data’ topic, node process will consume it. The consumed message will be emitted to the web browser via socket.io

Stage 7

As soon as socket.io-client in the web browser receives a new ‘message’ event, data in the event gets processed. If the order status is “shipped” in the received data, it gets added to HighCharts series and gets displayed on the browser.

Screencast

We’ve also recorded a screencast on how one can run all the above commands and build the real-time analytics dashboard

We have successfully built the real-time analytics dashboard. This was a basic example to show how can we integrate spark-streaming, Kafka, node.js and socket.io to build a real-time analytics dashboard. Now, since we know the basics, we can build more complex systems using the above tools.

Hope this guide was helpful. Please feel free to leave your comments. Follow CloudxLab on Twitter to get updates on new blogs and videos.


About authors

Abhinav Singh

Seasoned hands-on technical architect with years of experience in building large scale products for a global audience and India
Sandeep Giri

Sandeep Giri

Seasoned hands-on technical architect with years of strong experience in building world- class software products

  • Interesting read 🙂

    • abhinav singh

      Glad to know that you liked it 🙂

  • Ramcharan Teja

    Hi i am working on Kafka spark streaming using node js same related to this article
    can you please provide code in more detail

    • abhinav singh

      Hi Ramacharan,

      Just curious, in which part you are looking more detail.

      You can send me an email at abhinav at cloudxlab dot com and we can discuss the improvements.

      • Ramcharan Teja

        Im able to produce and consume data from kafka but i want to process data streaming using spark using node js client
        Im completely new to spark streaming so am not able to find a solution to integrate kafka and spark using node js client

        • abhinav singh

          Hi Ramcharan,

          We can connect over email. Later I will update the post with summary of our discussion so that it will be useful to other users also

          • Dinesh

            Hi Abinav,

            Could you please share the completed post, so that new users like us will refer into it?

          • abhinav singh

            Hi Dinesh,

            This is the complete blog post. Just curious if you have found any thing missing in the post. Feel free to share your thoughts so that I can improve it.

          • abhinav singh

            Hi Dinesh,

            This is the complete blog post. Just curious if you have found any thing missing in the post. Feel free to share your thoughts so that I can improve it.

  • kaylawei

    One suggestion – try to write one name other than ‘order-data’, like ‘order-data-1000’ … or you cannot push successfully. But I got events.js:74 after I run ‘node index.js’, I don not know why

  • kaylawei

    One suggestion – try to write one name other than ‘order-data’, like ‘order-data-1000’ … or you cannot push successfully. But I got events.js:74 after I run ‘node index.js’, I don not know why

  • sachin arora

    Hello Team,

    I tried to start the node client for Viewing the orders but everytime it fails with below error

    Running on port 3008
    express deprecated res.sendfile: Use res.sendFile instead index.js:16:9
    a user connected
    events.js:74
    throw TypeError(‘Uncaught, unspecified “error” event.’);
    ^
    TypeError: Uncaught, unspecified “error” event.
    at TypeError ()
    at emit (events.js:74:15)
    at /home/svaheguru2242/cloudxlab/spark/projects/real-time-analytics-dashboard/node/node_modules/kafka-node/lib/highLevelConsumer.js:178:36
    at null. (/home/svaheguru2242/cloudxlab/spark/projects/real-time-analytics-dashboard/node/node_modules/kafka-node/lib/client.js
    :364:28)
    at Client.handleReceivedData (/home/svaheguru2242/cloudxlab/spark/projects/real-time-analytics-dashboard/node/node_modules/kafka-node/lib/
    client.js:587:18)
    at Socket. (/home/svaheguru2242/cloudxlab/spark/projects/real-time-analytics-dashboard/node/node_modules/kafka-node/lib/client.
    js:550:14)
    at Socket.emit (events.js:95:17)
    at Socket. (_stream_readable.js:765:14)
    at Socket.emit (events.js:92:17)
    at emitReadable_ (_stream_readable.js:427:10)

  • sachin arora

    Hello Team,

    I tried to start the node client for Viewing the orders but everytime it fails with below error

    Running on port 3008
    express deprecated res.sendfile: Use res.sendFile instead index.js:16:9
    a user connected
    events.js:74
    throw TypeError(‘Uncaught, unspecified “error” event.’);
    ^
    TypeError: Uncaught, unspecified “error” event.
    at TypeError ()
    at emit (events.js:74:15)
    at /home/svaheguru2242/cloudxlab/spark/projects/real-time-analytics-dashboard/node/node_modules/kafka-node/lib/highLevelConsumer.js:178:36
    at null. (/home/svaheguru2242/cloudxlab/spark/projects/real-time-analytics-dashboard/node/node_modules/kafka-node/lib/client.js
    :364:28)
    at Client.handleReceivedData (/home/svaheguru2242/cloudxlab/spark/projects/real-time-analytics-dashboard/node/node_modules/kafka-node/lib/
    client.js:587:18)
    at Socket. (/home/svaheguru2242/cloudxlab/spark/projects/real-time-analytics-dashboard/node/node_modules/kafka-node/lib/client.
    js:550:14)
    at Socket.emit (events.js:95:17)
    at Socket. (_stream_readable.js:765:14)
    at Socket.emit (events.js:92:17)
    at emitReadable_ (_stream_readable.js:427:10)