Approvals: 0/1
This is an old revision of the document!
Big Data
Apache Spark on VSC
There are many frameworks in Data Science that are using ⟿ Apache SparkTM , for instance, ADAM for genomic analysis, GeoTrellis for geospatial data analysis, or Koalas (distributed Pandas).
Example SLURM & Python scripts
Let us calculate the value of pi by means of the following Python script ⟿ pi.py
(from the Spark distribution):
import sys from random import random from operator import add from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession\ .builder\ .appName("PythonPi")\ .getOrCreate() partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2 n = 100000 * partitions def f(_): x = random() * 2 - 1 y = random() * 2 – 1 return 1 if x ** 2 + y ** 2 <= 1 else 0 count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add) print("Pi is roughly %f" % (4.0 * count / n)) spark.stop()
The idea of the calculation is to check (very often) whether a random point in the area [-1,-1; 1,1] is within a unit circle (i.e. has a distance less than 1 from origin). Since we know formulas for the area of the square as well as the circle we can estimate pi.
In order to submit a batch job to the queuing system we use the SLURM script ⟿ pi.slrm
:
#!/bin/bash #SBATCH --nodes=1 #SBATCH --job-name=spark-yarn-pi #SBATCH --time=00:10:00 #SBATCH --error=err_spark-yarn-pi-%A #SBATCH --output=out_spark-yarn-pi-%A module purge module load python/3.8.0-gcc-9.1.0-wkjbtaa module load openjdk/11.0.2-gcc-9.1.0-ayy5f5t module load hadoop module load spark prolog_create_key.sh . vsc_start_hadoop.sh . vsc_start_spark.sh export SPARK_EXAMPLES=${HOME}/BigData/SparkDistributionExamples/ spark-submit --master yarn --deploy-mode client --num-executors 140 --executor-memory 2G $SPARK_EXAMPLES/src/main/python/pi.py 1000 epilog_discard_key.sh
In this slurm script we have
- slurm commands, starting with #SBATCH to set the job name, the maximum execution time and where the output files will go,
- module commands, loading all the modules which are required by our job: python, Java (Hadoop is written in Java), Hadoop, and Spark,
- setup scripts which create temporary ssh keys, and start Hadoop and Spark services in user context on the nodes of the job,
- commands to start our job with
'spark-submit
', and - a script to discard the temporary ssh keys.
References
- Apache Spark is here to stay: Talk at Austrian HPC Meeting 2020
- Hadoop for HPC Users: Overview and first steps: Tutorial at Austrian HPC Meeting 2018 by Elmar Kiesling
- Big Data on VSC: VSC training course, including course material
Hadoop Ecosystem
At the Technische Universität Wien there is a Big Data cluster, mainly for teaching purposes, which is also used by researchers.