If you are more comfortable with SQL, you can use SQL for processing a dataframe in the following way.
First, you would need to create a temporary view of this data frame by calling a method createOrReplaceTempView on it. You need to provide a name for the view as an argument. Here are trying to register df dataframe as a view with the name people.
Afterward, you can call sql method on spark session object with an whatever SQL query you want. The data frame will be made available as a table view to your queries. Here we have created the tempview with the name people from df and then used select * from people as SQL.
The result of SQL method is another dataframe on which you can call various dataframe methods.
To see the result you would need to call show() method on the dataframe. You can see that the SQL query has worked successfully.
We had df which was loaded from JSON file. We register df as people. And the created another dataframe sqlDF using sql query and then displayed it using show() method.
We can further register some other dataframe as another view and join it with people view using SQL.
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Answer is not availble for this assesment
Please login to comment
8 Comments
df.write.format("hive").option("compression","uncompressed").saveAsTable("meta.categories_replica")Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py", line 588, in saveAsTable self._jwrite.saveAsTable(name) File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace)pyspark.sql.utils.AnalysisException: u'Cannot create hive serde table with saveAsTable API;'
Upvote ShareHow resolve this issue
Upvote ShareHi,
Sorry for the late reply, in case this issue was not resolved could you please share a screenshot of your code and the error you are getting.
Thanks.
Upvote Sharehow can we define GlobalTempView instead of TempView.
df.createOrReplaceGlobalTempView("globalpeople")<console>:32: error: value createOrReplaceGlobalTempView is not a member of org.apache.spark.sql.DataFrame df.createOrReplaceGlobalTempView("globalpeople")
its showing error as mentioned above
can anyone help me???
Upvote ShareThis is available.
1 Upvote ShareHow to run SQL Queries and hive Queries in spark - sql and spark -shell and what is the difference between them ?
Upvote Share-- Please reply above this line --
Hi, Pavan.
Running in Spark-SQL or Spark shell both are same as you running the in Scala shell only, you need to import the API instance "sqlContext"
Use the provided HiveContext instance sqlContext to create a new query in HiveQL by calling the sql method on the sqlContext object..
scala> val results = sqlContext.sql("SELECT * FROM my_keyspace.my_table")
Below are some of the advantages of running in Spark than in Hive.
1) Faster Execution - Spark SQL is faster than Hive.
2) No Migration Hurdles - Though both HiveQL and Spark SQL follow the SQL way of writing queries, the syntax for both is completely different, but the developers can continue to write queries in HiveQL and during execution they will internally gets converted to Spark SQL.
3) Supports Real-Time Processing – Spark SQL supports real-time querying of data by using the metastore services of Hive.
All the best.
--
Upvote ShareBest,
Satyajit Das
When I am trying to use sql queries on dataframes and try to show the results, I am getting an error . Please help.(Please note that myDF.show() gives proper results)
scala> myDf.createOrReplaceTempView("mylog")
scala> val test=spark.sql("SELECT * from mylog")
test: org.apache.spark.sql.DataFrame = [host: string, timeStamp: string ... 2 more fields]
scala> test.show()
Upvote Share20/02/08 11:41:13 WARN BlockManager: Putting block rdd_7_0 failed due to an exception
20/02/08 11:41:13 WARN BlockManager: Block rdd_7_0 could not be removed as it was not found on disk or in memory
20/02/08 11:41:13 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 3)
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at $line27.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.parseLogLine(<console>:32)
at $line29.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:35)
at $line29.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:35)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:99) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:91) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1016) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1007) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:947) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1007) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:711) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler....(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)