Skip to content

Day 1 / May 9 (Mon)

Introduction to Apache Hadoop 3.3.2

Read the following documents. Get familiar with the basics.

  1. Apache Hadoop
  2. Release 3.3.2 available
  3. Hadoop Commands Guide
  4. FileSystem Shell
  5. The Hadoop FileSystem API Definition

Exercise: Setting Up Hadoop Cluster

Hadoop: Setting up a Single Node Cluster which shows you how to set up a single-node Hadoop installation.

We are interested in Pseudo-Distributed Mode.

Please note that you should download a binary distribution (e.g., hadoop-3.3.2.tar.gz).

Code Review

  1. https://github.com/szczepanja/file-listing
  2. https://github.com/1Gize/list-files-in-folder

Introduction to HDFS

Read the following documents:

  1. Architecture
  2. Users Guide
  3. Commands Guide

Exercise: Spark SQL and HDFS

Create a Spark SQL application that loads CSV files from a HDFS directory

  1. Use hdfs:// URI
  2. Review Load Spark data locally Incomplete HDFS URI et al.

Tips

./sbin/start-dfs.sh
./bin/hdfs dfs -mkdir /files
./bin/hdfs dfs -put README.txt /files/
./bin/hdfs dfs -ls /files
spark.read.text("hdfs://localhost:9000/files/").show
Back to top