DataFrames, Spark SQL, R

1 / 18

Spark SQL - Introduction

Spark - Dataframes & Spark SQL (Part1)

Spark - Dataframes & Spark SQL (Part2)

Spark SQL is a module of apache spark for handling structured data. With Spark SQL, you can process structured data using the SQL kind of interface. So, if your data can be represented in tabular format or is already located in the structured data sources such as SQL database, you can use SparkSQL for processing it.

Spark SQL provides an API called dataframes API which makes it possible to mix SQL queries, R like dataframe manipulation techniques and usual transformations and actions of an RDD. So, it is very well integrated.

Whether your data is in HDFS, Hive or Relational Databases and whether your data is in AVO, parquet, ORC or JSON format, you can access and process data uniformly.

With spark SQL, you can run your hive queries without any modifications. And you can use your existing BI tools to query big data.

Moreover, you can even join data across different formats and different data sources.

This is how Spark SQL provides uniform data access.