This version is outdated by a newer approved version.DiffThis version (2021/05/27 07:13) is a draft.
Approvals: 0/1

This is an old revision of the document!


Big Data

There are many frameworks in Data Science that are using <html><span style=“color:#cc3300;font-size:100%;”>&dzigrarr;</span> </html> Apache Spark<html>TM</html> , for instance, ADAM for genomic analysis, GeoTrellis for geospatial data analysis, or Koalas (distributed Pandas).

Let us calculate the value of pi by means of the following Python script <html><span style=“color:#cc3300;font-size:100%;”>&dzigrarr;</span> </html> pi.py (from the Spark distribution):

from __future__ import print_function

import sys
from random import random
from operator import add

from pyspark.sql import SparkSession

if __name__ == "__main__":
    """
        Usage: pi [partitions]
    """
    spark = SparkSession\
        .builder\
        .appName("PythonPi")\
        .getOrCreate()

    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
    n = 100000 * partitions

    def f(_):
        x = random() * 2 - 1
        y = random() * 2 - 1
        return 1 if x ** 2 + y ** 2 <= 1 else 0

    count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
    print("Pi is roughly %f" % (4.0 * count / n))

    spark.stop()

The idea of the calculation is to check (very often) whether a random point in the area [-1,-1; 1,1] is within a unit circle (i.e. has a distance less than 1 from origin). Since we know formulas for the area of the square as well as the circle we can estimate pi.

In order to submit a batch job to the queuing system we use the SLURM script <html><span style=“color:#cc3300;font-size:100%;”>&dzigrarr;</span> </html> pi.slrm:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --job-name=spark-yarn-pi
#SBATCH --time=00:10:00
#SBATCH --error=err_spark-yarn-pi-%A
#SBATCH --output=out_spark-yarn-pi-%A

module purge
module load python/3.8.0-gcc-9.1.0-wkjbtaa
module load openjdk/11.0.2-gcc-9.1.0-ayy5f5t
module load hadoop
module load spark

prolog_create_key.sh
. vsc_start_hadoop.sh
. vsc_start_spark.sh

export SPARK_EXAMPLES=${HOME}/BigData/SparkDistributionExamples/

spark-submit --master yarn --deploy-mode client --num-executors 140 --executor-memory 2G $SPARK_EXAMPLES/src/main/python/pi.py 1000

epilog_discard_key.sh

In this slurm script we have

  • slurm commands, starting with #SBATCH to set the job name, the maximum execution time and where the output files will go,
  • module commands, loading all the modules which are required by our job: python, Java (Hadoop is written in Java), Hadoop, and Spark,
  • setup scripts which create temporary ssh keys, and start Hadoop and Spark services in user context on the nodes of the job,
  • commands to start our job with 'spark-submit', and
  • a script to discard the temporary ssh keys.

References

At the Technische Universität Wien there is a Big Data cluster, mainly for teaching purposes, which is also used by researchers.

  • doku/bigdata.1622099582.txt.gz
  • Last modified: 2021/05/27 07:13
  • by dieter