Using Hadoop and Spark on Savio

Spark/Hadoop no longer maintained on Savio

Some years ago, we set up Spark and Hadoop on Savio. Given limited usage, we have not updated the installations, and these instructions are unlikely to work. If you would like assistance in using Spark or Hadoop, please contact us.

This document describes how to run jobs that use Hadoop and Spark, on the Savio high-performance computing cluster at the University of California, Berkeley, via auxiliary scripts provided on the cluster.

Running Hadoop Jobs on Savio¶

The Hadoop framework and an auxiliary script are provided to help users to run Hadoop jobs on the HPC clusters in Hadoop On Demand (HOD) fashion. The auxiliary script "hadoop_helper.sh" is located in /global/home/groups/allhands/bin/hadoop_helper.sh and can be used interactively or from a job script. Please note that this script only provides functions to help to build a Hadoop environment, so it should never be run directly. The proper way to use it is to source it from your current environment by running "source /global/home/groups/allhands/bin/hadoop_helper.sh" (only bash is supported right now). After that please run "hadoop-usage" to see how to run Hadoop jobs. You will need to run "hadoop-start" to initialize an HOD environment and run "hadoop-stop" to destroy the HOD environment after your Hadoop job completes.

The example below shows how to use it interactively.

[user@ln000 ~]$ srun -p savio -A ac_abc --qos=savio_debug -N 4 -t 10:0 --pty bash [user@ln000 ~]$ module load java hadoop [user@n0000 ~]$ source /global/home/groups/allhands/bin/hadoop_helper.sh [user@n0000 ~]$ hadoop-start starting jobtracker, ... [user@n0000 bash.738294]$ hadoop jar $HADOOP_DIR/hadoop-examples-1.2.1.jar pi 4 10000 Number of Maps = 4 ... Estimated value of Pi is 3.14140000000000000000 [user@n0000 bash.738294]$ hadoop-stop stopping jobtracker ...

The example below shows how to use it in a job script:

#!/bin/bash #SBATCH --job-name=test #SBATCH --partition=savio #SBATCH --account=ac_abc #SBATCH --qos=savio_debug #SBATCH --nodes=4 #SBATCH --time=00:10:00

module load java hadoop source /global/home/groups/allhands/bin/hadoop_helper.sh

# Start Hadoop On Demand hadoop-start

# Example 1 hadoop jar $HADOOP_DIR/hadoop-examples-1.2.1.jar pi 4 10000

# Example 2 mkdir in cp /foo/bar in/ hadoop jar $HADOOP_DIR/hadoop-examples-1.2.1.jar wordcount in out

# Stop Hadoop On Demand hadoop-stop

Running Spark Jobs on Savio¶

The Spark framework and an auxiliary script are provided to help users to run Spark jobs on the HPC clusters in Spark On Demand (SOD) fashion. The auxiliary script "spark_helper.sh" is located in /global/home/groups/allhands/bin/spark_helper.sh and can be used interactively or from a job script. Please note that this script only provides functions to help to build a Spark environment, so it should never be run directly. The proper way to use it is to source it from your current environment by running "source /global/home/groups/allhands/bin/spark_helper.sh" (only bash is supported right now). After that please run "spark-usage" to see how to run Spark jobs. You will need to run "spark-start" to initialize an SOD environment and run "spark-stop" to destroy the SOD environment after your Spark job completes.

The example below shows how to use it interactively:

[user@ln000 ~]$ srun -p savio -A ac_abc --qos=savio_debug -N 4 -t 10:0 --pty bash [user@ln000 ~]$ module load java spark [user@n0000 ~]$ source /global/home/groups/allhands/bin/spark_helper.sh [user@n0000 ~]$ spark-start starting org.apache.spark.deploy.master.Master, ...

[user@n0000 bash.738307]$ spark-submit --master $SPARK_URL $SPARK_DIR/examples/src/main/python/pi.py Spark assembly has been built with Hive ... Pi is roughly 3.147280 ...

[user@n0000 bash.738307]$ pyspark $SPARK_DIR/examples/src/main/python/pi.py WARNING: Running python applications through ./bin/pyspark is deprecated as of Spark 1.0. ... Pi is roughly 3.143360 ...

[user@n0000 bash.738307]$ spark-stop ...

The example below shows how to use it in a job script:

#!/bin/bash #SBATCH --job-name=test #SBATCH --partition=savio #SBATCH --account=ac_abc #SBATCH --qos=savio_debug #SBATCH --nodes=4 #SBATCH --time=00:10:00

module load java spark source /global/home/groups/allhands/bin/spark_helper.sh

# Start Spark On Demand spark-start

# Example 1 spark-submit --master $SPARK_URL $SPARK_DIR/examples/src/main/python/pi.py

# Example 2 spark-submit --master $SPARK_URL $SPARK_DIR/examples/src/main/python/wordcount.py /foo/bar

# PySpark Example pyspark $SPARK_DIR/examples/src/main/python/pi.py

# Stop Spark On Demand spark-stop