Solutions Review

HackOn(Data) / Toronto / Sept. 10-11

https://bit.ly/mastering-apache-spark

Agenda

  1. Loading and Saving Datasets
  2. Standard Functions
  3. Aggregation - Typed and Untyped Grouping
  4. Window Aggregates (Windows)
  5. User-Defined Functions (UDFs)
  6. "Explaining" Query Plans of Windows
  7. Transformers
  8. ML Pipeline Persistence
  9. Tooling - Git & GitHub & Source Code

Loading Datasets (1 of 3)

Please DON'T do this!


            sc.textFile("...CSV") <-- HERE
              .filter(lambda l : "ADD_NUM" not in l)
              .map(parseCultureLoc)
              

            sc.textFile("...TXT") <-- HERE
              .map(parseTwitter)
              .filter(lambda t : "toronto" in t['place'].lower()))
              

            sc.parallelize(grid)
              

Loading Datasets (2 of 3)

Please DON'T do this!


            wdat <- read.df(sqlContext,
              "./ParkingTicket/weather/201*",
              source = "com.databricks.spark.csv", <-- this
              inferSchema = "true",
              header="true", skiprows=16)
              

            sqlContext.jsonFile(file_path1)
              

Loading Datasets (3 of 3)

Please DON'T do this!

jsonFile Warning

Saving Datasets

Please DON'T do this!


parking_2015_withTrial_df
   .coalesce(1) <------------------------- HERE
   .write
   .format("com.databricks.spark.csv") <-- HERE
   .options(header="true")
   .save("/mnt/%s/Parking/Parking_Tags_2015_and_other_data" % MOUNT_NAME)
              

Type Mismatch?

Please DON'T do this!

Type Mismatch?

Standard Functions

Nothing here...you're on your own :-)

Transformers

  1. Tokenizer
  2. HashingTF
  3. VectorAssembler
  4. OneHotEncoder
  5. StringIndexer

Tooling - Git & GitHub & Source Code

  1. Very impressed by the variety of tools
  2. GitHub
  3. GitHub Pages for website
  4. Databricks Cloud

Resolution

I did learn a few things from YOU! Thanks!

Tweet on Databricks Cloud

Questions?