Once you have created the dataframe, you can operate on it. There are many operations available on a dataframe.
To see the schema of a dataframe we can call printSchema method and it would show you the details of each of the columns. Here, we can see that it has automatically figured out the data type of age column as long and name column as String.
If we need to just project a single column, we could use the select method with the name of the column as an argument and then call show method on it.
You can see that it has displayed the values of the first column. This is very much like dataframe operations of R programming.
If we want to do complex projections on data such as adding 1 to the age and displaying it, we can simply use $age + 1. Please note that the evaluation is lazy in Spark. So, you would have to use show() or other action in order to start the computation.
The select method basically generates another dataframe but it does not hold actual data else it could cause memory overflow. So, in order to avoid memory overflows and optimize the computing, spark uses the lazy evaluation model.
It does not do the computation unless we really ask for it. It just keeps on making notes.
Further, if we would like to filter our data based on various conditions, you can use a method filter on a dataframe. This method takes an expression and in this expression, you can refer the column value using the dollar sign.
Here you can see that Andy is the only one having age above 21. Please note that this filter is not the same method as it was in RDD. In RDD, filter method was accepting a method as an argument while here it is accepting an expression in the argument.
A data frame also provides group by operation. GroupBy basically returns grouped dataset on which we execute aggregates such as count. This basically computes the counts of people of each age.
This operation is essentially equivalent to SQL query: Select age, count(*) from df group by age