Machine Learning With Spark

6 / 7

MLlib Data Types And Libraries

Not able to play video? Try with youtube

Spark MLlib is not only a library of algorithms, it also provides various useful data types such as Local vector, Labeled Point, and matrices.

Local vector is an integer-typed and 0-based indices and double-typed values. For example dv2 = [1.0, 0.0, 3.0]

Labeled point is a local vector, either dense or sparse, associated with a label/response. Example: pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])

In Mllib, Matrices can be represented in many ways such as:

  • Local matrix

  • Distributed matrix

  • RowMatrix

  • IndexedRowMatrix

  • CoordinateMatrix

  • BlockMatrix

Spark provides a robust pipeline system to make the life of data scientist easier. It is partially inspired by sci-kit learn.

DataFrame: This ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions.

Transformer: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions.

Estimator: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.

Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.

Parameter: All Transformers and Estimators now share a common API for specifying parameters.

Here, we have illustrated pipeline for the simple text document workflow. The diagram is for the training time usage of a Pipeline.

The top row represents a Pipeline with three stages. The first two (Tokenizer and HashingTF) are Transformers (blue), and the third (LogisticRegression) is an Estimator (red). The bottom row represents data flowing through the pipeline, where cylinders indicate DataFrames.

The method is called on the original DataFrame, which has raw text documents and labels.

The Tokenizer.transform() method splits the raw text documents into words, adding a new column with words to the DataFrame.

The HashingTF.transform() method converts the words column into feature vectors, adding a new column with those vectors to the DataFrame.

Now, since LogisticRegression is an Estimator, the Pipeline first calls to produce a LogisticRegressionModel. If the Pipeline had more Estimators, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage.

A Pipeline is an Estimator. Thus, after a Pipeline’s fit() method runs, it produces a PipelineModel, which is aTransformer. This PipelineModel is used at test time; the figure below illustrates this usage.

Spark Mllib also provides various basic statistics functions such as: Correlations, Stratified sampling, Hypothesis testing, Random data generation, Kernel density estimation

MLlib supports various methods:

Binary Classification

linear SVMs, logistic regression, decision trees, random forests, gradient-boosted trees, naive Bayes

Multiclass Classification

logistic regression, decision trees, random forests, naive Bayes


linear least squares, Lasso, ridge regression, decision trees, random forests, gradient-boosted trees, isotonic regression

You can understand more about each algorithm by following these links.

That completes our quick introduction of MlLib. Thank you.

Loading comments...