DataFrames, Spark SQL, R

6 / 18

Spark SQL - SQL Queries On Dataframes

If you are more comfortable with SQL, you can use SQL for processing a dataframe in the following way.

First, you would need to create a temporary view of this data frame by calling a method createOrReplaceTempView on it. You need to provide a name for the view as an argument. Here are trying to register df dataframe as a view with the name people.

Afterward, you can call sql method on spark session object with an whatever SQL query you want. The data frame will be made available as a table view to your queries. Here we have created the tempview with the name people from df and then used select * from people as SQL.

The result of SQL method is another dataframe on which you can call various dataframe methods.

To see the result you would need to call show() method on the dataframe. You can see that the SQL query has worked successfully.

We had df which was loaded from JSON file. We register df as people. And the created another dataframe sqlDF using sql query and then displayed it using show() method.

We can further register some other dataframe as another view and join it with people view using SQL.

Spark - Dataframes & Spark SQL (Part1)


No hints are availble for this assesment

Answer is not availble for this assesment

Please login to comment

8 Comments

df.write.format("hive").option("compression","uncompressed").saveAsTable("meta.categories_replica")Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py", line 588, in saveAsTable self._jwrite.saveAsTable(name) File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace)pyspark.sql.utils.AnalysisException: u'Cannot create hive serde table with saveAsTable API;'

  Upvote    Share

How resolve this issue

  Upvote    Share

Hi,

Sorry for the late reply, in case this issue was not resolved could you please share a screenshot of your code and the error you are getting.

Thanks.

  Upvote    Share

how can we define GlobalTempView instead of TempView.

df.createOrReplaceGlobalTempView("globalpeople")<console>:32: error: value createOrReplaceGlobalTempView is not a member of org.apache.spark.sql.DataFrame df.createOrReplaceGlobalTempView("globalpeople")

its showing error as mentioned above

can anyone help me???

  Upvote    Share
df1.createGlobalTempView("df2")

This is available.

 1  Upvote    Share

How to run SQL Queries and hive Queries in spark - sql and spark -shell and what is the difference between them ?

  Upvote    Share

-- Please reply above this line --

Hi, Pavan.

Running in Spark-SQL or Spark shell both are same as you running the in Scala shell only, you need to import the API instance "sqlContext"

Use the provided HiveContext instance sqlContext to create a new query in HiveQL by calling the sql method on the sqlContext object..

scala> val results = sqlContext.sql("SELECT * FROM my_keyspace.my_table")

Below are some of the advantages of running in Spark than in Hive.

1) Faster Execution - Spark SQL is faster than Hive.

2) No Migration Hurdles - Though both HiveQL and Spark SQL follow the SQL way of writing queries, the syntax for both is completely different, but the developers can continue to write queries in HiveQL and during execution they will internally gets converted to Spark SQL.
3) Supports Real-Time Processing – Spark SQL supports real-time querying of data by using the metastore services of Hive.

All the best.

--
Best,
Satyajit Das

  Upvote    Share

When I am trying to use sql queries on dataframes and try to show the results, I am getting an error . Please help.(Please note that myDF.show() gives proper results)

scala> myDf.createOrReplaceTempView("mylog")
scala> val test=spark.sql("SELECT * from mylog")
test: org.apache.spark.sql.DataFrame = [host: string, timeStamp: string ... 2 more fields]

scala> test.show()
20/02/08 11:41:13 WARN BlockManager: Putting block rdd_7_0 failed due to an exception
20/02/08 11:41:13 WARN BlockManager: Block rdd_7_0 could not be removed as it was not found on disk or in memory
20/02/08 11:41:13 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 3)
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at $line27.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.parseLogLine(<console>:32)
at $line29.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:35)
at $line29.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:35)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:99) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:91) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1016) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1007) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:947) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1007) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:711) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler....(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

  Upvote    Share