This version is outdated by a newer approved version.DiffThis version (2021/05/27 07:11) is a draft.
Approvals: 0/1

This is an old revision of the document!


Big Data

There are many frameworks in Data Science that are using Apache SparkTM , for instance, ADAM for genomic analysis, GeoTrellis for geospatial data analysis, or Koalas (distributed Pandas).

Let us calculate the value of pi by means of the following Python script pi.py (from the Spark distribution):

import sys
from random import random
from operator import add
from pyspark.sql import SparkSession
if __name__ == "__main__": spark = SparkSession\
.builder\ .appName("PythonPi")\ .getOrCreate()
partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2 n = 100000 * partitions
def f(_):
x = random() * 2 - 1
y = random() * 2 – 1
return 1 if x ** 2 + y ** 2 <= 1 else 0
count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n)) spark.stop()

The idea of the calculation is to check (very often) whether a random point in the area [-1,-1; 1,1] is within a unit circle (i.e. has a distance less than 1 from origin). Since we know formulas for the area of the square as well as the circle we can estimate pi.

In order to submit a batch job to the queuing system we use the SLURM script pi.slrm:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --job-name=spark-yarn-pi
#SBATCH --time=00:10:00
#SBATCH --error=err_spark-yarn-pi-%A
#SBATCH --output=out_spark-yarn-pi-%A

module purge
module load python/3.8.0-gcc-9.1.0-wkjbtaa
module load openjdk/11.0.2-gcc-9.1.0-ayy5f5t
module load hadoop
module load spark

prolog_create_key.sh
. vsc_start_hadoop.sh
. vsc_start_spark.sh

export SPARK_EXAMPLES=${HOME}/BigData/SparkDistributionExamples/

spark-submit --master yarn --deploy-mode client --num-executors 140 --executor-memory 2G $SPARK_EXAMPLES/src/main/python/pi.py 1000

epilog_discard_key.sh

In this slurm script we have

  • slurm commands, starting with #SBATCH to set the job name, the maximum execution time and where the output files will go,
  • module commands, loading all the modules which are required by our job: python, Java (Hadoop is written in Java), Hadoop, and Spark,
  • setup scripts which create temporary ssh keys, and start Hadoop and Spark services in user context on the nodes of the job,
  • commands to start our job with 'spark-submit', and
  • a script to discard the temporary ssh keys.

References

At the Technische Universität Wien there is a Big Data cluster, mainly for teaching purposes, which is also used by researchers.

  • doku/bigdata.1622099491.txt.gz
  • Last modified: 2021/05/27 07:11
  • by dieter