Big Data

This version is outdated by a newer approved version.

This version (2021/05/27 07:13) is a draft.
Approvals: 0/1

This is an old revision of the document!

There are many frameworks in Data Science that are using <html>&dzigrarr; </html> Apache Spark<html>^TM</html> , for instance, ADAM for genomic analysis, GeoTrellis for geospatial data analysis, or Koalas (distributed Pandas).

Let us calculate the value of pi by means of the following Python script <html>&dzigrarr; </html> pi.py (from the Spark distribution):

from __future__ import print_function

import sys
from random import random
from operator import add

from pyspark.sql import SparkSession

if __name__ == "__main__":
    """
        Usage: pi [partitions]
    """
    spark = SparkSession\
        .builder\
        .appName("PythonPi")\
        .getOrCreate()

    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
    n = 100000 * partitions

    def f(_):
        x = random() * 2 - 1
        y = random() * 2 - 1
        return 1 if x ** 2 + y ** 2 <= 1 else 0

    count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
    print("Pi is roughly %f" % (4.0 * count / n))

    spark.stop()

The idea of the calculation is to check (very often) whether a random point in the area [-1,-1; 1,1] is within a unit circle (i.e. has a distance less than 1 from origin). Since we know formulas for the area of the square as well as the circle we can estimate pi.

In order to submit a batch job to the queuing system we use the SLURM script <html>&dzigrarr; </html> pi.slrm:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --job-name=spark-yarn-pi
#SBATCH --time=00:10:00
#SBATCH --error=err_spark-yarn-pi-%A
#SBATCH --output=out_spark-yarn-pi-%A

module purge
module load python/3.8.0-gcc-9.1.0-wkjbtaa
module load openjdk/11.0.2-gcc-9.1.0-ayy5f5t
module load hadoop
module load spark

prolog_create_key.sh
. vsc_start_hadoop.sh
. vsc_start_spark.sh

export SPARK_EXAMPLES=${HOME}/BigData/SparkDistributionExamples/

spark-submit --master yarn --deploy-mode client --num-executors 140 --executor-memory 2G $SPARK_EXAMPLES/src/main/python/pi.py 1000

epilog_discard_key.sh

In this slurm script we have

slurm commands, starting with #SBATCH to set the job name, the maximum execution time and where the output files will go,
module commands, loading all the modules which are required by our job: python, Java (Hadoop is written in Java), Hadoop, and Spark,
setup scripts which create temporary ssh keys, and start Hadoop and Spark services in user context on the nodes of the job,
commands to start our job with 'spark-submit', and
a script to discard the temporary ssh keys.

References

Apache Spark is here to stay: Talk at Austrian HPC Meeting 2020
Hadoop for HPC Users: Overview and first steps: Tutorial at Austrian HPC Meeting 2018 by Elmar Kiesling
Big Data on VSC: VSC training course, including course material

At the Technische Universität Wien there is a Big Data cluster, mainly for teaching purposes, which is also used by researchers.

Big Data

Apache Spark on VSC

Example SLURM & Python scripts

References

Hadoop Ecosystem