Spark SQL - Understanding Datasets

Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network.

While both encoders and standard serialization are responsible for turning an object into bytes, encoders are code generated dynamically and use a format that allows Spark to perform many operations like filtering, sorting and hashing without deserializing the bytes back into an object.

The most simple way of creating a dataset is by calling a .toDS method on scala collection. Please note that though Seq is a usual collection of scala, the method toDS has been made available by the way of importing implicits. In interactive, You don't have to import,

In case you are building spark application, you would need to import spark implicits apart from creating spark session.

The other way is to create a collection of objects and then call toDS. The objects could be created with case classes or custom classes.

Here we are first defining Person case class with name and age. And then creating a collection of person objects using Seq. Here is there is only one object.

Afterward, we are converting to Dataset using toDS method. The caseClassDS is a dataset. To see the details, we can use show method.

We could also load data from files such as JSON as a dataset of Person using as a method.

DataFrames can be converted to a Dataset by providing a class with a method. Mapping will be done based on the name of attributes.

Spark - Dataframes & Spark SQL (Part1)

Previous Index Next

Please login to comment

12 Comments

Mayank Chaubey

4 years ago

What's the issue over here?

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

Try this fix:

https://stackoverflow.com/questions/49900232/value-todf-is-not-a-member-of-seqint-string

Thanks.

Upvote Share

Sairam Srinivas Vasantada

4 years ago

Hi

I believe Datasets API is avaialble in Scala/Java. Due to dynamic property of Python/R, Datasets API not available/not required for Python/R.

Please correct me if my understanding is wrong.

Thanks
Sairam Srinivas. V

2 Upvote Share