DataFrames, Spark SQL, R

7 / 18

Spark SQL - Understanding Datasets

Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network.

While both encoders and standard serialization are responsible for turning an object into bytes, encoders are code generated dynamically and use a format that allows Spark to perform many operations like filtering, sorting and hashing without deserializing the bytes back into an object.

The most simple way of creating a dataset is by calling a .toDS method on scala collection. Please note that though Seq is a usual collection of scala, the method toDS has been made available by the way of importing implicits. In interactive, You don't have to import,

In case you are building spark application, you would need to import spark implicits apart from creating spark session.

The other way is to create a collection of objects and then call toDS. The objects could be created with case classes or custom classes.

Here we are first defining Person case class with name and age. And then creating a collection of person objects using Seq. Here is there is only one object.

Afterward, we are converting to Dataset using toDS method. The caseClassDS is a dataset. To see the details, we can use show method.

We could also load data from files such as JSON as a dataset of Person using as a method.

DataFrames can be converted to a Dataset by providing a class with a method. Mapping will be done based on the name of attributes.

Spark - Dataframes & Spark SQL (Part1)


No hints are availble for this assesment

Answer is not availble for this assesment

Please login to comment

12 Comments

Hi

I believe Datasets API is avaialble in Scala/Java. Due to dynamic property of Python/R, Datasets API not available/not required for Python/R.

Please correct me if my understanding is wrong.

Thanks
Sairam Srinivas. V

 2  Upvote    Share

Your understanding is absolutely correct.

  Upvote    Share

Can someone tell me the exact difference between the dataframe and dataset.

Can we use dataset's in python.

  Upvote    Share

I did not find any difference in the execution of two

val peopleDS = spark.read.json(path)

val peopleDS = spark.read.json(path).as[Person]

Both the outputs are same,

  Upvote    Share

Hi,

As mentioned, the output would be the same however, it is another method of loading the data.

Thanks.

  Upvote    Share

In the first slide - Encoders section 2nd bullet, shouldn't it be "without deserializing"?

  Upvote    Share

Hi, Swapnik.

Yes, you are right.

It should be Encoders will do the operations without de-serializing the bytes back into the object.
It is correctly written in our Transcripts.

Thanks
All the best.

  Upvote    Share

what is the need of .as[Person]
in below command
val peopleDS=spark.read.json(path).as[Person]

  Upvote    Share

It provide name of the dataframe as Person.So we can perform any operation using Person as a name.

  Upvote    Share