Machine Learning With Spark

4 / 7

MLlib Overview

There are many tools available for machine learning. Here we are trying to present the tools with respect to the size of the data.

If our data is just a few lines, we would not require any tool as such, we can simply use our whiteboard or pen-paper for both analysis and visualization.

But if our data goes beyond few records, we can use Matlab, octave or R for analysis and visualization up to the data size of few megabytes.

If our data is beyond megabytes and is within tens of gigabytes, the analysis can be done using NumPy, SciPy, Weka. And visualization can be done using flare, amCharts, etc.

But if the data goes beyond gigabytes, we would have to use distributed computing libraries such as MLLib for machine learning, SparkR for analysis and graphX for complex graph processing. The other similar libraries from Hadoop ecosystem are mahout and Giraph.

Also, notice that there is no visualization tool in case of big data. Why? Because the visualization of such a huge data is difficult and it does not make much sense to the human eye. Instead, we first process big data to compute summary and then plot summary.

Let us now understand MLlib. What is MLlib? It is a machine learning library or package of Apache Spark. The goal of this package is to make machine learning scalable and easy.

This library has all common machine learning algorithms and utilities, including:

  • Classification

  • Regression

  • Clustering

  • Collaborative filtering

  • Dimensionality reduction

It also provides the Lower-level optimization primitives for creating your own algorithms.

It has the concepts of pipelines to help you create the machine learning workflows.

The functionality of MLLib is roughly classified into five packages.

The first one is ML Algorithms which has common machine learning algorithms such as classification, regression, clustering, and collaborative filtering

Featurization tools contain the functions to help in feature extraction, transformation, dimensionality reduction, and selection.

The third is the Pipelines. In pipelines, it has tools for constructing, evaluating, and tuning Machine learning Pipelines

As part of the persistence, it provides the ability to save and load algorithms, models, and Pipelines. The saved models and pipelines can be transmitted over the wire for running in production.

The rest of utilities that it provides are related linear algebra, statistics, data handling, etc.