GraphFrames on CloudxLab

GraphFrames is quite a useful library of spark which helps in bringing Dataframes and GraphX package together.

From the website of Graphframes:

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs. It provides high-level APIs in Scala, Java, and Python. It aims to provide both the functionality of GraphX and extended functionality taking advantage of Spark DataFrames. This extended functionality includes motif finding, DataFrame-based serialization, and highly expressive graph queries.

You can use graph frames very easily with spark-shell at CloudxLab by using —package option in the following way.

For spark-shell:

/usr/spark2.0.1/bin/spark-shell --packages graphframes:graphframes:0.4.0-spark2.0-s_2.11

For python spark shell:

/usr/spark2.0.1/bin/pyspark --packages graphframes:graphframes:0.4.0-spark2.0-s_2.11

When you launch the shell with the –packages argument, it is going to download graphframes and make available in the shell. Now, lets create a graph frame. Here is some example code (scala):

//Lets import all classes of graphframes package
import org.graphframes._
import spark.sqlContext

//Create a DataFrame representing Vertices of the graph from a list of tuples using toDF function
//This dataframe has unique ID "id" column and other details.
val v = sqlContext.createDataFrame(List(
  ("x", "Jack", 34),
  ("y", "Jill", 36),
  ("z", "Maggie", 30)
)).toDF("id", "name", "age")

//Now we would create a dataframe representing the edges of the graph
// Create an Edge DataFrame with "src" and "dst" columns
val e = sqlContext.createDataFrame(List(
  ("x", "y", "follow"), //Jack follows Jill
  ("y", "z", "friend"),
  ("z", "y", "follow")
)).toDF("src", "dst", "relationship")

// Now, with these two Dataframes, we can create a GraphFrame
import org.graphframes.GraphFrame
val g = GraphFrame(v, e)

// Query: Get in-degree of each vertex.

This would display the total in degrees of each vertex:

| id|inDegree|
|  z|       1|
|  y|       2|

Now, lets try to filter. The following code would display the counts of edges that have follow relationship which 2.

// Query: Count the number of "follow" connections in the graph.
g.edges.filter("relationship = 'follow'").count()

Now, lets try to run the an algorithm such as pagerank on the graph.

// Run PageRank algorithm, and show results.
val results = g.pageRank.resetProbability(0.01).maxIter(20).run()"id", "pagerank").show()

After few iterations, it should display the page rank of each element as follows:

| id|           pagerank|
|  x|               0.01|
|  z|0.27995525261339177|
|  y| 0.2808611427228327|