Learn to load and save data using Spark, compression, and how to handle various file formats using Spark from the industry experts.
Whenever you make a request to a web server for a page, it records it in a file which is called logs.
The logs of a webserver are the gold mines for gaining insights in the user behaviour. Every data scientists usually look at the logs first to understand the behaviour of the users. But since the logs are humongous in size, it takes a distributed framework like Hadoop or Spark to process it.
As part of this project, you will learn to parse the text data stored in logs of a web server using the Apache Spark.
In this project, we will learn how to build a real-time analytics dashboard using Apache Spark Streaming, Kafka, Node.js, Socket.IO, and Highcharts.
There are many Big Data Solution stacks.
The first and most powerful stack is Apache Hadoop and Spark together. While Hadoop provides storage for structured and unstructured data, Spark provides the computational capability on top of Hadoop.
The second way could be to use Cassandra or MongoDB. The third could be to use Google Compute Engine or Microsoft Azure. In such cases, you would have to upload your data to Google or Microsoft which may not be acceptable to your organization sometimes.
In this post, we will understand the basics of: