DataFrames, Spark SQL, R

3 / 18

Spark SQL - Getting Started

So, let us get started with spark SQL. Since Spark SQL module was stabilized in spark 2.0 onwards, we would need to use spark2. The spark SQL module can be accessed using the usual spark shell, using usual spark application, PySpark or Java. So, we don't need to create any new application for using spark SQL.

And to use the spark with Hadoop, as we have discussed earlier, we do not need to install spark on Hadoop cluster.

We would just have to export two variables YARN_CONF_DIR and HADOOP_CONF_DIR both having the location of configuration directory of Hadoop. Out of these two variable YARN_CONF_DIR is the new one while HADOOP_CONF_DIR is for legacy reasons.

On CloudxLab, the version of default spark installation as of now is Spark 1.5.4. It may get upgraded in future. For now, we have installed more version of spark in /usr directory. Please note that you can also download a newer version of spark in your home directory using wget and use it if you want a different version.

Once we have set the variables, you can launch the spark-shell from any of the installations in /usr folder.

The spark shell and any other binaries would be available in the bin folder of the installation. Here we are launching spark-shell with sparks' 2.0.2 version by using /usr/spark2.0.2/bin/spark-shell

Once spark shell has finished launching you would see the spark banner with the version displayed by the side and you would be provided a scala prompt. This is where you would write your scala code.

Had you used sparkR instead of spark-shell command, you would see the R prompt where you would write code in R.

It is essentially scala prompt with few extra variables such as sc and spark.

So, we are essentially at scala prompt. How is it different from usual scala prompt?

There are two extra variables sc and spark which are our gateway to spark. For the spark sql or data frames API, we will be using Spark session which available as simply spark object highlighted in yellow.

For usual RDD or dataset interactions, we use the sc variable.

In case, if you want to create spark-application which you can submit as a batch job, instead of interactive way, you would need to create an object of spark session in your main method of application.

Here, using builder method on SparkSession object we are creating spark session builder. The method appname called on builder object here sets the name of the application which will be displayed in the logs and context web UI.

Also, we can set various configuration properties on builder by the way of calling config method on the builder.

Then using the getOrCreate method on builder we can create a sparksession object. Here the name we have given to the session is "spark". It could be anything.

The spark-shell or pyspark or any other spark consoles are internally doing the same before launching respective interactive interfaces.

Please note that it is also useful to import the spark implicits using import spark.implicts dot underscore.

When the Scala compiler finds a variable or expression of the wrong type, it will look for an implicit function, expression or class to provide the correct type. The implicit function (or class) needs to be in the current scope for the compiler to do its work. This is typically accomplished by importing a Scala object that contains the implicit definition(s).

Note - If you are facing an error with importing implicit then please follow the below commands

val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate()

Spark - Dataframes & Spark SQL (Part1)


No hints are availble for this assesment

Answer is not availble for this assesment

Please login to comment

19 Comments

Getting error for implicits and read command

scala> import spark.implicits._
<console>:26: error: value implicits is not a member of object org.apache.spark.sql.SparkSession
       import spark.implicits._

scala> var df = spark.read.json("/data/spark/people.json")
<console>:26: error: value read is not a member of object org.apache.spark.sql.SparkSession
       var df = spark.read.json("/data/spark/people.json")
                      ^

Please help at the earliest.

  Upvote    Share

Hi Punit,

I believe the error is self-explanatory. It needs a stable identifier.

Below commands should fix it.

val spark2 = spark
import spark2.implicits._

 

  Upvote    Share

Or we can initialize spark with "val". The below command should work

val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate()
  Upvote    Share

Thanks Abhinav. This is resolved with the help from Sandeep.

  Upvote    Share

This is resolved

  Upvote    Share

Hi, I have a general doubt. I can access /usr directory through cd /usr command. But when I use ls -la in my home directory, I can't find the usr directory in the list which is displayed on the terminal. Why is that so? and how can I see such hidden folders through the terminal?

  Upvote    Share

Hi Pranav,

The "usr" is in "/" (root) directory and your home is in "/home/<your_lab_username>", So when you're in your home directory you're actually two directories down the root directory.

To get a better understanding please execute "pwd" in your home directory and also in "/usr" then you'll be able to see the difference.

Hope it helps.

 1  Upvote    Share

Thanks a lot sir. I understood my silly mistake! :)

  Upvote    Share

Hi Pranav,

Questions are never silly and mistakes make us learn.

Happy learning.

  Upvote    Share

Very true...

  Upvote    Share

When i tried importing implicits i got this error. please help

scala> import spark.implicits._
<console>:26: error: value implicits is not a member of objectorg.apache.spark.sql.SparkSession
       import spark.implicits._

  Upvote    Share

same  thing occur sir any method to fix it.

 

  Upvote    Share

Initiate SparkSession before using implicits

 

    import org.apache.spark.sql.SparkSession

    val spark = SparkSession
        .builder()
        .appName("Spark SQL basic example")
        .config("spark.some.config.option", "some-value")
        .getOrCreate()

import spark.implicits._

  Upvote    Share

I am getting the same error. Have followed all the steps as provided. Please check the snapshot. Moreover the session is hanged once the error is thrown by "import spark.implicits._". I tried 3 times and observed the same pattern.

  Upvote    Share

This is resolved.

  Upvote    Share

If we can convert RDD into DataFrame , so we can efficiently churn data sets. In previous examples where we were parsing weblogs to find top 10 IPs, can we achieve same using DataFrames ?

  Upvote    Share

Hi,

where is the spark located? I cannot find spark in usr folder.

  Upvote    Share

By default it is located at /usr/hdp/current/spark-client/ and extra spark versions we have installed at /usr now. Please check again.
Also, please note that you can start using spark by the way of any of these commands:
spark-shell
pyspark
sparkR
spark-submit
spark-sql

  Upvote    Share

Great!

  Upvote    Share