DataFrames, Spark SQL, R

7 / 18

Spark SQL - Understanding Datasets

Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network.

While both encoders and standard serialization are responsible for turning an object into bytes, encoders are code generated dynamically and use a format that allows Spark to perform many operations like filtering, sorting and hashing without deserializing the bytes back into an object.

The most simple way of creating a dataset is by calling a .toDS method on scala collection. Please note that though Seq is a usual collection of scala, the method toDS has been made available by the way of importing implicits. In interactive, You don't have to import,

In case you are building spark application, you would need to import spark implicits apart from creating spark session.

The other way is to create a collection of objects and then call toDS. The objects could be created with case classes or custom classes.

Here we are first defining Person case class with name and age. And then creating a collection of person objects using Seq. Here is there is only one object.

Afterward, we are converting to Dataset using toDS method. The caseClassDS is a dataset. To see the details, we can use show method.

We could also load data from files such as JSON as a dataset of Person using as a method.

DataFrames can be converted to a Dataset by providing a class with a method. Mapping will be done based on the name of attributes.