Graph Processing With Spark

1 / 2

GraphX Quick Walkthrough

Welcome to a session on GraphX package of Apache Spark.

Before understand graphx, let us first understand what is a graph? A graph is data structure having vertices and edges. The vertices are connected via edge. We can represent complex relationships using graph. Here in the example, the vertices represent the people. Each vertex has the id and properties. Rxin is a student, jgonzal is a postdoctoral candidate and so on.

The edges represent the relationship between these people. For example, student marked with 3 id is collaboarator of 7 post doctoral candidate. Similarly, 5 franklin is the advisor of 3 student.

So, the graph is defined by two tables vertex table and edge table.

Most of the computational problems do have graph in one way or another. Many real life problems are solved using the graph data structures. Here are few examples of graph problems at large scale.

First one is finding common friends in a social graph such as facebook. The second example could be to find the page rank.

Most of the computational problems do have graph in one way or another. Many real life problems are solved using the graph data structures. Here are few examples of graph problems at large scale.

First one is finding common friends in a social graph such as facebook.

Second example is pagerank. Pagerank is the algorithm which made google the most successful search engine. A search engine basically keeps downloading various pages from the internet. After downloading it creates indexes and tries to understand the pages. Index is like a list of keywords with details about where those were found.

When we search for some keywords, the search engine returns the results from its index. These results are ordered by relevence. Google computed the relevance based on how many pages are linking to a page. The more number of pages would link to your website, the higher would be the rank of your website. Also, the more important pages link to your website, the rank of your page will be further higher.

This kind of computation required keeping the relationships between the various pages as a graph. And the graph is generally repsented as a table of vertices and edge. In case of webpages, the vertices would repsent the pages and the edges would repsent the hyperlink. Since there are billions of webpages, google needed a map-reduce kind of infrastructure to do the pagerank computation.

The GraphX package of Spark unifies the ETL - extract transform and load functionality of data into the graph structure. GraphX also provides exploratory analysis. GraphX is faster than Giraph and slightly slower than graphlab.

GraphX makes it possible to view the same data as graph as well as the collections. It can convert the RDD into graph and vice-versa. GraphX can Transform and join graphs with RDDs efficiently. GraphX at its core basically Extends the Spark RDD by introducing a new Graph abstraction.

The graphX provide a very rich library of algorithms that involve graph computations.

GraphX provides pagerank algorithms. Pagerank means if important pages are linking to your website, you will be more important. This algorithm can be applied in situations other than websites too. For example, we can use pagerank algorithm to find the rank of various twitter handles.

The connected components algorithms provided by GraphX can be used to identify the clusters amongst your facebook friends.

The another important algorithm on Graphs is Triangle Counting. With this algorithms you can count the total triangles passing through each vertex. The counts of triangles is also a measure of clustering.

The other algorithms such as Label Propagation, SVD++ and strongly connect components are also provided by graphx.

It provides the fundamental operators on graph such as subgraph, joinVertices, aggregateMessages and many more. Take a look at the API and programming guide for more details.

With these operators, we can create our own algorithms.

Let us try to use pagerank of spark graphx. Say we have the following graph of people. Just beiber is following barack obama and barack obama and lady gaga are following each other. Similarly there john resig, martin, matel following one another.

Here, out of Barack Obama, Justin Beiber and lady gaga who is least important? It would be justin beiber because he is not followed up anyone. While out of barack obama and lady gaga, barack obama is more important because he is being folowing by two persons. This is what pagerank algorithm is.

As more and more people are added to this graph, the computation of ranks will become complex. It would need a decent amount of computation to find out the ranks of each node.

Let us try to understand how to compute the pagerank using graphx. First take a look at the data. The data is located in /data/spark/graphx.

This file is located in in HDFS so we can use hadoop fs cat command to see the contents of file. Here it is a very small data. Each line contains two numbers which are space separated. The first number represent the follower and the second number represents who is being followed. So, from first two line we can say that 2 and 4 are following 1. Also, these are just the edges. For graphx to compute pagerank, it only needs the edges or relations.


Now, lets start spark-shell. We would need to export the HADOOP_CONF_DIR and YARN_CONF_DIR environment variables. Now, launch the newer version of spark-shell using /usr/spark2.0.2/bin/spark-shell

Wait for scala prompt to appear. On the scala prompt we will be writing the code. This code is available in our github repository inside spark / examples / graphx /pagerank.scala the url is also displayed below.

First, let's load the graph using GraphLoader.edgeListFile with spark context and the file name. This creates a graph object from plain text file having information about edges.

Now, we can simply call graph.pageRank method. And select the vertices after the pagerank has been executed. That's it. We have calculated the pagerank of each vertex. To take a look at result, lets collect ranks object. You can see that each vertex id has a rank.

We can further replace the id with the name of the person. This can be done by joining this vertices id with users. Lets create the RDD of users using textFile followed by the map transformation. Next join the users with ranks. Afterwards, we cleanup the data structure using map method.

You can see that it has properly computed the pageranks of users and also replaced id with the name of user.

After the graph computation, we can see the final numbers were as displayed in the graph.

That completes our quick introduction to graphX. To know more about the graphx please visit the graphx api documentation on their own website as mentioned in the references below.