ML Pipelines

Apache Spark 2.4 / Spark MLlib



@jaceklaskowski / StackOverflow / GitHub / LinkedIn
The "Internals" Books: Apache Spark / Spark SQL / Spark Structured Streaming / Delta Lake

ML Pipelines (spark.ml)

  1. DataFrame-based API under spark.ml package.
    import org.apache.spark.ml._
    • spark.mllib package obsolete (as of Spark 2.0)
  2. Switch to The Internals Of Apache Spark

text document pipeline

ML Pipeline From the official documentation of Apache Spark

Transformers

  1. Transformer transforms DataFrame into "enhanced" DataFrame.
    transformer: DataFrame =[transform]=> DataFrame
  2. Switch to The Internals Of Apache Spark

Estimators

  1. Estimator produces Model (Transformer) for training DataFrame
    estimator: DataFrame =[fit]=> Model
  2. Switch to The Internals Of Apache Spark

Models

  1. Model - transformer that generates predictions for DataFrame
    model: DataFrame =[predict]=> DataFrame (with predictions)
  2. Switch to The Internals Of Apache Spark

text document pipeline model

ML Pipeline Model From the official documentation of Apache Spark

Evaluators

  1. Evaluator - transformation that measures effectiveness of Model, i.e. how good a model is.
    evaluator: DataFrame =[evaluate]=> Double
  2. Switch to The Internals Of Apache Spark

CrossValidator

  1. CrossValidator - estimator that gives the best Model for parameters
    import org.apache.spark.ml.tuning.CrossValidator
  2. Switch to The Internals Of Apache Spark

Persistence — MLWriter and MLReader

  1. Allows saving and loading models
    model.write
      .overwrite()
      .save("/path/where/to/save/model")
    val model =
      PipelineModel.load("/path/with/model")
  2. Switch to The Internals Of Apache Spark

Example

  1. Switch to The Internals Of Apache Spark