Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network.
While both encoders and standard serialization are responsible for turning an object into bytes, encoders are code generated dynamically and use a format that allows Spark to perform many operations like filtering, sorting and hashing without deserializing the bytes back into an object.
The most simple way of creating a dataset is by calling a .toDS method on scala collection. Please note that though Seq is a usual collection of scala, the method toDS has been made available by the way of importing implicits. In interactive, You don't have to import,
In case you are building spark application, you would need to import spark implicits apart from creating spark session.
The other way is to create a collection of objects and then call toDS. The objects could be created with case classes or custom classes.
Here we are first defining Person case class with name and age. And then creating a collection of person objects using Seq. Here is there is only one object.
Afterward, we are converting to Dataset using toDS method. The caseClassDS is a dataset. To see the details, we can use show method.
We could also load data from files such as JSON as a dataset of Person using as a method.
DataFrames can be converted to a Dataset by providing a class with a method. Mapping will be done based on the name of attributes.
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Answer is not availble for this assesment
Please login to comment
12 Comments
What's the issue over here?
Upvote ShareHi,
Try this fix:
https://stackoverflow.com/questions/49900232/value-todf-is-not-a-member-of-seqint-string
Thanks.
Upvote ShareHi
2 Upvote ShareI believe Datasets API is avaialble in Scala/Java. Due to dynamic property of Python/R, Datasets API not available/not required for Python/R.
Please correct me if my understanding is wrong.
Thanks
Sairam Srinivas. V
Your understanding is absolutely correct.
Upvote ShareCan someone tell me the exact difference between the dataframe and dataset.
Can we use dataset's in python.
Upvote ShareHi Chriannjivi please visit the below url it describe beautifully.
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
Upvote ShareI did not find any difference in the execution of two
val peopleDS = spark.read.json(path)
val peopleDS = spark.read.json(path).as[Person]
Both the outputs are same,
Upvote ShareHi,
As mentioned, the output would be the same however, it is another method of loading the data.
Thanks.
Upvote ShareIn the first slide - Encoders section 2nd bullet, shouldn't it be "without deserializing"?
Upvote ShareHi, Swapnik.
Yes, you are right.
It should be Encoders will do the operations without de-serializing the bytes back into the object.
It is correctly written in our Transcripts.
Thanks
Upvote ShareAll the best.
what is the need of .as[Person]
Upvote Sharein below command
val peopleDS=spark.read.json(path).as[Person]
It provide name of the dataframe as Person.So we can perform any operation using Person as a name.
Upvote Share