spark-workshop

Exercise: Email Classification

Develop a Spark MLlib application that uses Logistic Regression for email classification, i.e. what emails are spam and not a spam.

Module: Spark MLlib

Duration: 45 mins

Steps

Use Spark MLlib’s LogisticRegression and the transformers: Tokenizer and HashingTF
Use Spark MLlib’s Pipeline
Load CSV datasets for training and as a raw data
Use Spark MLlib’s CrossValidator for model selection
Persist the (best) model
Calculate predictions
1. Display the values, i.e. 0 is OK while 1 is SPAM, using when standard function
2. val status = when('prediction === 0, "OK").otherwise("SPAM").as("status")

Input Dataset

Use Online Generate Test Data to generate a CSV dataset with fake emails and the columns: id, body, and label.

id,body,label
1,Zushad zam fawo gur licidtug zar honepru zolor muahada lep pired ciuvi.,0
2,Elfi ez lirde vizavbak depmapav us piwojaw sihhib novo luzkut de teb apemimi hezotce rubumzer mowja jowte.,1

This site is open source. Improve this page.