Let us understand an important machine learning use case of big data called Collaborative filtering. It is commonly used for recommender systems. These techniques aim to fill in the missing entries of a user-item association matrix. Mllib supports model-based collaborative filtering meaning it would give-out the model instead of directly generating the recommendations.
In MLlib, Users and products are described by a small set of latent factors that can be used to predict missing entries. MLlib uses the alternating least squares (ALS) algorithm to learn these latent factors.
Let's take an example of how to generate recommendations. It is also known as collaborative filtering.
Let us first look at the data ratings.dat. This file is in HDFS at the location /data/ml-1m/ratings.dat. You can open Hue and click on file browser. Click the first slash in the path to go to the top level and then click on ml-1m. Inside this folder, you will find a file ratings.dat. Click to open it. This file contains textual data.
Each line represents a rating. The values in each line are separated by a double colon. The first number is the user id, the second number is movie id and the third one is the ratings given by this user to this movie id and the fourth number is the time stamp.
In the first line of the file, 1 is user id, 1193 is the movie id and 5 is the rating.
To know more about this data you can take a look at README file in the same folder.
Now, let's use the spark MLlib to generate recommendations. We are going to divide this movies rating dataset into two parts - training and test. Training being 80% and test being remaining 20%. Using training data we will train the model and we will generate recommendations for the test data.
Let's open the WebConsole and start a version spark two. We are going to use spark2. Using the command /usr/spark2.0.1/bin/spark-shell lets launch the scala spark shell. Wait for it to start. Once the scala prompt appears we can start working with it.
Let us first import the needed libraries. Here we are importing ALS which is the algorithm Alternating Least Squares that we are going to use. The data we saw earlier we are going to create a dataframe out of it. Since we already know the schema, we can use the reflections way of creating dataframe. First we would create an rdd having objects of a case class and then use toDF to convert this rdd into dataframe.
So, let us first define the case class Rating with the four members userid, movieid, rating and timestamp. Next we define a method that would convert each line of the text into the object of Rating. Let us test this function individually with the first row of text. You can see that it has successfully converted the string into the object.
Next, we create an RDD with the name "raw" using sc.textFile with the path name. Let us check if the rdd is okay by calling take function with 1 as an argument. It should display the first line of text. For us, it is working fine. If it does not correctly, double check the location of the file you passed in sc.textFile.
Now, let's apply our parseRating function on each record of this raw rdd by using the map transformation. And call toDF method on it to convert it into dataframe. Now, we have a dataframe ratings. Let us check if this dataframe is correctly formed using show method. You can see that it is in tabular form.
Now, let's break this dataframe into two parts training and test using randomSplit function. With training dataframe we are going to train model and with test dataframe we are going to test the model.
Now, let's first create the instance of ALS algorithm and then set various parameters. Afterward, set the name of user column, item column and ratings column. We are basically mapping our dataframe's columns to various fields that the algorithm needs.
We can simply call als.fit on our "training" dataframe. It might take some time. Once done, it would return the model. This model can be used to generate the recommendations. You can save this model on a file which can be transferred to a production environment and generate the recommendations real time.
Now, let's prepare the recommendations using the function "transform" on the model with test dataframe as the argument. And using take, you can quickly check if it has generated the predictions. It if is working fine. You can save the results to a file using prediction.write.format() followed by save.