Building Real-Time Analytics Dashboard Using Apache Spark

Apache Spark

 

In this blog post, we will learn how to build a real-time analytics dashboard using Apache Spark streaming, Kafka, Node.js, Socket.IO and Highcharts.

Complete Spark Streaming topic on CloudxLab to refresh your Spark Streaming and Kafka concepts to get most out of this guide.

Problem Statement

An e-commerce portal (http://www.aaaa.com) wants to build a real-time analytics dashboard to visualize the number of orders getting shipped every minute to improve the performance of their logistics.

Solution

Before working on the solution, let’s take a quick look at all the tools we will be using:

Apache Spark – A fast and general engine for large-scale data processing. It is 100 times faster than Hadoop MapReduce in memory and 10x faster on disk. Learn more about Apache Spark here

Python – Python is a widely used high-level, general-purpose, interpreted, dynamic programming language. Learn more about Python here

Kafka – A high-throughput, distributed, publish-subscribe messaging system. Learn more about Kafka here

Node.js – Event-driven I/O server-side JavaScript environment based on V8. Learn more about Node.js here

Socket.IO – Socket.IO is a JavaScript library for real-time web applications. It enables real-time, bi-directional communication between web clients and servers. Read more about Socket.IO here

Highcharts – Interactive JavaScript charts for web pages. Read more about Highcharts here

CloudxLab – Provides a real cloud-based environment for practicing and learn various tools. You can start practicing right away by just signing up online.

How To Build A Data Pipeline?

Below is the high-level architecture of the data pipeline

Data Pipeline
Data Pipeline

Our real-time analytics dashboard will look like this

Real-Time Analytics Dashboard
Real-Time Analytics Dashboard

Let’s start with the description of each stage in the data pipeline and build the solution.

Stage 1

When a customer buys an item or an order status changes in the order management system, the corresponding order id along with the order status and time get pushed to the Kafka topic.

Dataset

Since we do not have an online e-commerce portal in place, we have prepared a dataset containing CSV files.  Let’s have a look at the dataset

Our dataset contains three columns ‘DateTime’, ‘OrderId’ & ‘Status’. Each row in the dataset represents the order status in particular date time. Here we have masked OrderId with “xxxxx-xxx”. We are only interested in the number of orders getting shipped every minute, so we do not need the actual order id.

The entire source code for the solution and dataset can be cloned from CloudxLab GitHub repository.

The dataset is located at spark/projects/real-time-analytics dashboard/data/order_data  directory in the above repository.

Push Dataset to Kafka

A shell script takes each row of these CSV files and pushes to Kafka. It waits for one minute to push the next CSV file to Kafka so that we can simulate the real-time e-commerce portal environment where an order status gets changed at different time intervals. In a real world scenario, when an order status changes, the corresponding order details gets pushed to Kafka.

Watch below video to understand how to use Kafka on CloudxLab.

Let’s run our shell script to push data to Kafka topic. Login to CloudxLab web console and run below commands.

Stage 2

After stage 1 each message in Kafka topic looks something like this

Stage 3

Watch below video to learn Spark Streaming and Kafka integration

Spark streaming code takes data from Kafka topic in a window of 60 seconds, process it so that we have the total count of each unique order status in that 60 seconds window. After processing the total count of each unique order status gets pushed to new Kafka topic (Create a new topic with your username say abhinav9884-order-one-min-data)

Run below commands in web console to create a one minute Kafka topic and run the spark streaming code

Stage 4

In this stage, each message in the new Kafka topic ‘abhinav9884-order-one-min-data’ looks something like the below JSON payload

Stage 5

Run Node.js server

Now we start a Node.js server to consume messages from one minute Kafka topic and push it to the web browser so that we can display the number of orders getting shipped per minute in the web browser.

Run below commands in web console to start the Node.js server

Node server starts on port 3001. If there is an ‘EADDRINUSE’ error while starting the node server, please edit the index.js file and change the port to 3002 …3003 ..3004 and so on until you find the next available port. Please use any of the available ports in 3001-3010 range to run the node server.

Access from browser

After node server is started, go to http://YOUR_WEB_CONSOLE:PORT_NUMBER to access the real-time analytics dashboard. If your web console is f.cloudxlab.com and your node server is running on port 3002, go to http://f.cloudxlab.com:3002 to access the dashboard.

When we access the above URL, socket.io-client library gets loaded on the browser which enables the bi-directional communication channel between the server and the browser.

Stage 6

As soon as a new message is available in the one minute Kafka topic, node process consumes it. The consumed message then gets emitted to the web browser via Socket.IO

Stage 7

As soon as socket.io-client in the web browser receives a new ‘message’ event, data in the event gets processed. If the order status is “shipped” in the received data, it gets added to HighCharts series and gets displayed on the browser.

Screencast

We’ve also recorded a screencast on how one can run all the above commands and build the real-time analytics dashboard

We have successfully built the real-time analytics dashboard. This was a basic example to show how can we integrate Spark Streaming, Kafka, Node.js, and Socket.IO to build a real-time analytics dashboard. Now, since we know the basics, we can build more complex systems using the above tools.

Hope this guide was helpful. Please feel free to leave your comments. Follow CloudxLab on Twitter to get updates on new blogs and videos.


About authors

Abhinav Singh

Seasoned hands-on technical architect with years of experience in building large scale products for a global audience and India
Sandeep Giri

Sandeep Giri

Seasoned hands-on technical architect with years of strong experience in building world- class software products

  • Interesting read 🙂

    • abhinav singh

      Glad to know that you liked it 🙂

  • Ramcharan Teja

    Hi i am working on Kafka spark streaming using node js same related to this article
    can you please provide code in more detail

    • abhinav singh

      Hi Ramacharan,

      Just curious, in which part you are looking more detail.

      You can send me an email at abhinav at cloudxlab dot com and we can discuss the improvements.

      • Ramcharan Teja

        Im able to produce and consume data from kafka but i want to process data streaming using spark using node js client
        Im completely new to spark streaming so am not able to find a solution to integrate kafka and spark using node js client

        • abhinav singh

          Hi Ramcharan,

          We can connect over email. Later I will update the post with summary of our discussion so that it will be useful to other users also

          • Dinesh

            Hi Abinav,

            Could you please share the completed post, so that new users like us will refer into it?

          • abhinav singh

            Hi Dinesh,

            This is the complete blog post. Just curious if you have found any thing missing in the post. Feel free to share your thoughts so that I can improve it.

          • abhinav singh

            Hi Dinesh,

            This is the complete blog post. Just curious if you have found any thing missing in the post. Feel free to share your thoughts so that I can improve it.

        • abhinav singh

          Hi Ramcharan,

          I have updated the code and guide.

          Please check it now and keep me updated.

  • kaylawei

    One suggestion – try to write one name other than ‘order-data’, like ‘order-data-1000’ … or you cannot push successfully. But I got events.js:74 after I run ‘node index.js’, I don not know why

  • kaylawei

    One suggestion – try to write one name other than ‘order-data’, like ‘order-data-1000’ … or you cannot push successfully. But I got events.js:74 after I run ‘node index.js’, I don not know why

    • abhinav singh

      Hi,

      Sorry I missed your comment. Let me check it out.

      • abhinav singh

        Hi,

        I have updated the code and guide.

        Can you please check now? It should work.

        Please keep me updated.

  • sachin arora

    Hello Team,

    I tried to start the node client for Viewing the orders but everytime it fails with below error

    Running on port 3008
    express deprecated res.sendfile: Use res.sendFile instead index.js:16:9
    a user connected
    events.js:74
    throw TypeError(‘Uncaught, unspecified “error” event.’);
    ^
    TypeError: Uncaught, unspecified “error” event.
    at TypeError ()
    at emit (events.js:74:15)
    at /home/svaheguru2242/cloudxlab/spark/projects/real-time-analytics-dashboard/node/node_modules/kafka-node/lib/highLevelConsumer.js:178:36
    at null. (/home/svaheguru2242/cloudxlab/spark/projects/real-time-analytics-dashboard/node/node_modules/kafka-node/lib/client.js
    :364:28)
    at Client.handleReceivedData (/home/svaheguru2242/cloudxlab/spark/projects/real-time-analytics-dashboard/node/node_modules/kafka-node/lib/
    client.js:587:18)
    at Socket. (/home/svaheguru2242/cloudxlab/spark/projects/real-time-analytics-dashboard/node/node_modules/kafka-node/lib/client.
    js:550:14)
    at Socket.emit (events.js:95:17)
    at Socket. (_stream_readable.js:765:14)
    at Socket.emit (events.js:92:17)
    at emitReadable_ (_stream_readable.js:427:10)

  • sachin arora

    Hello Team,

    I tried to start the node client for Viewing the orders but everytime it fails with below error

    Running on port 3008
    express deprecated res.sendfile: Use res.sendFile instead index.js:16:9
    a user connected
    events.js:74
    throw TypeError(‘Uncaught, unspecified “error” event.’);
    ^
    TypeError: Uncaught, unspecified “error” event.
    at TypeError ()
    at emit (events.js:74:15)
    at /home/svaheguru2242/cloudxlab/spark/projects/real-time-analytics-dashboard/node/node_modules/kafka-node/lib/highLevelConsumer.js:178:36
    at null. (/home/svaheguru2242/cloudxlab/spark/projects/real-time-analytics-dashboard/node/node_modules/kafka-node/lib/client.js
    :364:28)
    at Client.handleReceivedData (/home/svaheguru2242/cloudxlab/spark/projects/real-time-analytics-dashboard/node/node_modules/kafka-node/lib/
    client.js:587:18)
    at Socket. (/home/svaheguru2242/cloudxlab/spark/projects/real-time-analytics-dashboard/node/node_modules/kafka-node/lib/client.
    js:550:14)
    at Socket.emit (events.js:95:17)
    at Socket. (_stream_readable.js:765:14)
    at Socket.emit (events.js:92:17)
    at emitReadable_ (_stream_readable.js:427:10)

    • abhinav singh

      Hi Sachin,

      Let me check it. Can you please let me know your topic name in kafka?

      • sachin arora

        Hi Abhinav,

        Topic name is order-data11 while pushing data into kafka from csv and rest is the same as explained in above blog.

        Regards,
        Sachin

      • abhinav singh

        Hi Sachin,

        I have updated the code and guide.

        Can you please check now? It should work.

        Please keep me updated.