Skip to content

Using Hadoop and Spark

This document describes how to run jobs that use Hadoop and Spark, on the Savio high-performance computing cluster at the University of California, Berkeley, via auxiliary scripts provided on the cluster.

Running Hadoop Jobs on Savio

The Hadoop framework and an auxiliary script are provided to help users to run Hadoop jobs on the HPC clusters in Hadoop On Demand (HOD) fashion. The auxiliary script "hadoop_helper.sh" is located in /global/home/groups/allhands/bin/hadoop_helper.sh and can be used interactively or from a job script. Please note that this script only provides functions to help to build a Hadoop environment, so it should never be run directly. The proper way to use it is to source it from your current environment by running "source /global/home/groups/allhands/bin/hadoop_helper.sh" (only bash is supported right now). After that please run "hadoop-usage" to see how to run Hadoop jobs. You will need to run "hadoop-start" to initialize an HOD environment and run "hadoop-stop" to destroy the HOD environment after your Hadoop job completes.

The example below shows how to use it interactively.

[user@ln000 ~]$ srun -p savio -A ac_abc --qos=savio_debug -N 4 -t 10:0 --pty bash
[user@ln000 ~]$ module load java hadoop
[user@n0000 ~]$ source /global/home/groups/allhands/bin/hadoop_helper.sh
[user@n0000 ~]$ hadoop-start
starting jobtracker, ...
[user@n0000 bash.738294]$ hadoop jar $HADOOP_DIR/hadoop-examples-1.2.1.jar pi 4 10000
Number of Maps = 4
...
Estimated value of Pi is 3.14140000000000000000
[user@n0000 bash.738294]$ hadoop-stop
stopping jobtracker
...

The example below shows how to use it in a job script:

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --partition=savio
#SBATCH --account=ac_abc
#SBATCH --qos=savio_debug
#SBATCH --nodes=4
#SBATCH --time=00:10:00

module load java hadoop
source /global/home/groups/allhands/bin/hadoop_helper.sh

# Start Hadoop On Demand
hadoop-start

# Example 1
hadoop jar $HADOOP_DIR/hadoop-examples-1.2.1.jar pi 4 10000

# Example 2
mkdir in
cp /foo/bar in/
hadoop jar $HADOOP_DIR/hadoop-examples-1.2.1.jar wordcount in out

# Stop Hadoop On Demand
hadoop-stop

Running Spark Jobs on Savio

The Spark framework and an auxiliary script are provided to help users to run Spark jobs on the HPC clusters in Spark On Demand (SOD) fashion. The auxiliary script "spark_helper.sh" is located in /global/home/groups/allhands/bin/spark_helper.sh and can be used interactively or from a job script. Please note that this script only provides functions to help to build a Spark environment, so it should never be run directly. The proper way to use it is to source it from your current environment by running "source /global/home/groups/allhands/bin/spark_helper.sh" (only bash is supported right now). After that please run "spark-usage" to see how to run Spark jobs. You will need to run "spark-start" to initialize an SOD environment and run "spark-stop" to destroy the SOD environment after your Spark job completes.

The example below shows how to use it interactively:

[user@ln000 ~]$ srun -p savio -A ac_abc --qos=savio_debug -N 4 -t 10:0 --pty bash
[user@ln000 ~]$ module load java spark
[user@n0000 ~]$ source /global/home/groups/allhands/bin/spark_helper.sh
[user@n0000 ~]$ spark-start
starting org.apache.spark.deploy.master.Master, ...

[user@n0000 bash.738307]$ spark-submit --master $SPARK_URL $SPARK_DIR/examples/src/main/python/pi.py
Spark assembly has been built with Hive
...
Pi is roughly 3.147280
...

[user@n0000 bash.738307]$ pyspark $SPARK_DIR/examples/src/main/python/pi.py
WARNING: Running python applications through ./bin/pyspark is deprecated as of Spark 1.0.
...
Pi is roughly 3.143360
...

[user@n0000 bash.738307]$ spark-stop
...

The example below shows how to use it in a job script:

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --partition=savio
#SBATCH --account=ac_abc
#SBATCH --qos=savio_debug
#SBATCH --nodes=4
#SBATCH --time=00:10:00

module load java spark
source /global/home/groups/allhands/bin/spark_helper.sh

# Start Spark On Demand
spark-start

# Example 1
spark-submit --master $SPARK_URL $SPARK_DIR/examples/src/main/python/pi.py

# Example 2
spark-submit --master $SPARK_URL $SPARK_DIR/examples/src/main/python/wordcount.py /foo/bar

# PySpark Example
pyspark $SPARK_DIR/examples/src/main/python/pi.py

# Stop Spark On Demand
spark-stop