Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant processing of live data streams.
Spark streaming basically processes the continuous stream of input data. It reads data from sources like Kafka, Flume, Twitter, and TCP sockets. It then processes the input data by applying user-defined transformations. Finally, processed data is pushed out to file systems like HDFS, databases, or live dashboards.
Spark Streaming receives live input data streams and divides the data into batches. Batch interval is defined in the spark streaming program and its value is set as per the requirement. Let's say if we set the batch interval as 20 seconds, then spark streaming creates batches with 20 seconds of data from the input data stream. As soon as one batch is done with collecting 20 seconds of data, a new batch gets created from the input data stream. These batches of input data are processed by the Spark engine to generate the final stream of results in batches