DataFrames, Spark SQL, R

3 / 18

Spark SQL - Getting Started

So, let us get started with spark SQL. Since Spark SQL module was stabilized in spark 2.0 onwards, we would need to use spark2. The spark SQL module can be accessed using the usual spark shell, using usual spark application, PySpark or Java. So, we don't need to create any new application for using spark SQL.

And to use the spark with Hadoop, as we have discussed earlier, we do not need to install spark on Hadoop cluster.

We would just have to export two variables YARN_CONF_DIR and HADOOP_CONF_DIR both having the location of configuration directory of Hadoop. Out of these two variable YARN_CONF_DIR is the new one while HADOOP_CONF_DIR is for legacy reasons.

On CloudxLab, the version of default spark installation as of now is Spark 1.5.4. It may get upgraded in future. For now, we have installed more version of spark in /usr directory. Please note that you can also download a newer version of spark in your home directory using wget and use it if you want a different version.

Once we have set the variables, you can launch the spark-shell from any of the installations in /usr folder.

The spark shell and any other binaries would be available in the bin folder of the installation. Here we are launching spark-shell with sparks' 2.0.2 version by using /usr/spark2.0.2/bin/spark-shell

Once spark shell has finished launching you would see the spark banner with the version displayed by the side and you would be provided a scala prompt. This is where you would write your scala code.

Had you used sparkR instead of spark-shell command, you would see the R prompt where you would write code in R.

It is essentially scala prompt with few extra variables such as sc and spark.

So, we are essentially at scala prompt. How is it different from usual scala prompt?

There are two extra variables sc and spark which are our gateway to spark. For the spark sql or data frames API, we will be using Spark session which available as simply spark object highlighted in yellow.

For usual RDD or dataset interactions, we use the sc variable.

In case, if you want to create spark-application which you can submit as a batch job, instead of interactive way, you would need to create an object of spark session in your main method of application.

Here, using builder method on SparkSession object we are creating spark session builder. The method appname called on builder object here sets the name of the application which will be displayed in the logs and context web UI.

Also, we can set various configuration properties on builder by the way of calling config method on the builder.

Then using the getOrCreate method on builder we can create a sparksession object. Here the name we have given to the session is "spark". It could be anything.

The spark-shell or pyspark or any other spark consoles are internally doing the same before launching respective interactive interfaces.

Please note that it is also useful to import the spark implicits using import spark.implicts dot underscore.

When the Scala compiler finds a variable or expression of the wrong type, it will look for an implicit function, expression or class to provide the correct type. The implicit function (or class) needs to be in the current scope for the compiler to do its work. This is typically accomplished by importing a Scala object that contains the implicit definition(s).