Introduction to

Apache Spark

RDD is the main abstraction of Apache Spark
Resilient
Distributed
Dataset
In-Memory, Immutable, Lazy Evaluated, Partitioned, Cacheable, Parallel, Typed
CAUTION: Don't use directly. Too low-level. Leave it alone until you really really want it. Even then, think twice.

Large-Scale Distributed Machine Learning
"Making practical machine learning scalable and easy"
DataFrame-based API
Pipeline API for designing, evaluating, and tuning machine learning pipelines
Classification, Regression, Clustering, Recommendation, Collaborative Filtering, ...
Model Import / Export

spark-shell - an interactive shell for Apache Spark
spark-submit - a tool to manage Spark applications (i.e. submit, kill, status)
spark-sql - an interactive shell for Spark SQL and Hive queries
web UI - a web interface to monitor Spark computations
Scala API for Apache Spark
Spark Python API Docs