Machine Learning with Spark MLlib

Apache Spark 2.4

@jaceklaskowski / StackOverflow / GitHub
Books: Mastering Apache Spark / Mastering Spark SQL / Spark Structured Streaming

Spark MLlib

  1. Spark library for distributed machine learning
  2. Simplifies the development and usage of large-scale machine learning
  3. Uses Spark SQL for data access

Features of Spark MLlib

  1. Machine learning algorithms
    • Classification, regression, clustering, collaborative filtering
  2. Featurization
    • Feature extraction, transformation, dimensionality reduction, selection
  3. Pipelines
    • Constructing, evaluating, and tuning machine learning pipelines
  4. Persistence
    • Saving and loading algorithms, models, and pipelines
  5. Utilities
    • Linear algebra, statistics, data handling

Motivation

Predictive analytic workflow

  1. Use of a machine learning algorithm is only one component of a predictive analytic workflow
  2. There may also be pre-processing steps for the machine learning algorithm to work
  3. Expectations of data scientists and data engineers

Typical machine learning workflow

  1. Loading data (aka data ingestion)
  2. Preparing data (aka data cleanup)
  3. Extracting features (aka feature extraction)
  4. Fitting model (aka model training)
  5. Scoring (or predictionize)

Before Going To Production

  1. Testing model (aka model testing)
  2. Selecting the best model (aka model selection or model tuning)
  3. Deploying model (aka model deployment and integration)

ML Pipeline

Goal of ML Pipeline

Assemble and configure
practical distributed machine learning pipelines
as easy-to-use pieces
to compose more complex ones with ease (like Lego™ blocks)

Features of ML Pipeline

  1. DataFrame as a dataset format
  2. ML Pipelines API is similar to scikit-learn
  3. Easy debugging (via inspecting columns added during execution)
  4. Parameter tuning
  5. Compositions (to build more complex pipelines out of existing ones)

Components of ML Pipeline

  1. Pipelines and PipelineStages
  2. Transformers
  3. Estimators
  4. Models
  5. Evaluators
  6. Cross Validators
  7. Params and ParamMaps
Pipeline with Transformers, Estimator, and Model

ML Pipeline Design

  1. Choose Transformers
  2. Select Estimator (to produce a Model)
  3. Create Pipeline
  4. Fit the pipeline to a training Dataset
  5. Use the Model

ML Pipeline Design Applied - Step 1

ML Pipeline Design Applied - Step 2

ML Pipeline Design Applied - Step 3

  • Creating Pipeline
    1. new Pipeline()
    2. setStages

ML Pipeline Design Applied - Step 4

ML Pipeline Design Applied - Step 5

  • Using trained Model (to generate predictions)
    1. Requires a real Dataset
    2. Pipeline.transform

Demo

Email Classification

Using Logistic Regression