Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revisionLast revisionBoth sides next revision | ||
doku:bigdata [2021/06/23 09:36] – [Apache Spark on VSC] dieter | doku:bigdata [2023/05/11 08:52] – [Available modules] katrin | ||
---|---|---|---|
Line 40: | Line 40: | ||
=== Example slurm & python scripts === | === Example slurm & python scripts === | ||
- | Let us calculate the value of pi by means of the following Python script | + | Let us calculate the value of pi by means of the following Python script '' |
- | < | + | |
- | from __future__ import print_function | + | |
+ | <code python> | ||
+ | from operator import add | ||
+ | from random import random | ||
import sys | import sys | ||
- | from random import random | + | |
- | from operator import add | + | |
from pyspark.sql import SparkSession | from pyspark.sql import SparkSession | ||
+ | |||
if __name__ == " | if __name__ == " | ||
""" | """ | ||
Line 58: | Line 57: | ||
.appName(" | .appName(" | ||
.getOrCreate() | .getOrCreate() | ||
+ | |||
partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2 | partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2 | ||
n = 100000 * partitions | n = 100000 * partitions | ||
+ | |||
def f(_): | def f(_): | ||
x = random() * 2 - 1 | x = random() * 2 - 1 | ||
y = random() * 2 - 1 | y = random() * 2 - 1 | ||
return 1 if x ** 2 + y ** 2 <= 1 else 0 | return 1 if x ** 2 + y ** 2 <= 1 else 0 | ||
- | + | ||
- | count = spark.sparkContext.parallelize(range(1, | + | count = spark.sparkContext\ |
- | map(f).reduce(add) | + | |
- | | + | .map(f)\ |
- | + | | |
- | spark.stop()</ | + | |
+ | print(f" | ||
+ | |||
+ | spark.stop() | ||
+ | </ | ||
The idea of the calculation is to check (very often) whether a random point in the area [-1,-1; 1,1] is within a unit circle (i.e. has a distance less than 1 from origin). Since we know formulas for the area of the square as well as the circle we can estimate pi. | The idea of the calculation is to check (very often) whether a random point in the area [-1,-1; 1,1] is within a unit circle (i.e. has a distance less than 1 from origin). Since we know formulas for the area of the square as well as the circle we can estimate pi. | ||
- | In order to submit this python job to the queuing system | + | Assuming that the script is saved to '' |
- | < | + | < |
#!/bin/bash | #!/bin/bash | ||
#SBATCH --nodes=1 | #SBATCH --nodes=1 | ||
Line 83: | Line 86: | ||
#SBATCH --error=err_spark-yarn-pi-%A | #SBATCH --error=err_spark-yarn-pi-%A | ||
#SBATCH --output=out_spark-yarn-pi-%A | #SBATCH --output=out_spark-yarn-pi-%A | ||
+ | #SBATCH --partition=skylake_0096 | ||
+ | #SBATCH --qos=skylake_0096 | ||
module purge | module purge | ||
- | module load python/3.8.0-gcc-9.1.0-wkjbtaa | + | |
- | module load openjdk/11.0.2-gcc-9.1.0-ayy5f5t | + | # you can choose one of the following python versions |
+ | # with the current hadoop/ | ||
+ | # note: the current hadoop version does NOT work with python 3.11 | ||
+ | module load python/3.10.7-gcc-12.2.0-5a2kkeu | ||
+ | # module load python/3.9.13-gcc-12.2.0-ctxezzj | ||
+ | # module load python/ | ||
+ | |||
+ | module load openjdk | ||
module load hadoop | module load hadoop | ||
module load spark | module load spark | ||
+ | |||
+ | export PDSH_RCMD_TYPE=ssh | ||
prolog_create_key.sh | prolog_create_key.sh | ||
. vsc_start_hadoop.sh | . vsc_start_hadoop.sh | ||
. vsc_start_spark.sh | . vsc_start_spark.sh | ||
- | |||
- | export SPARK_EXAMPLES=${HOME}/ | ||
spark-submit --master yarn --deploy-mode client --num-executors 140 \ | spark-submit --master yarn --deploy-mode client --num-executors 140 \ | ||
- | --executor-memory 2G $SPARK_EXAMPLES/ | + | --executor-memory 2G $HOME/pi.py 1000 |
+ | |||
+ | . vsc_stop_spark.sh | ||
+ | . vsc_stop_hadoop.sh | ||
epilog_discard_key.sh | epilog_discard_key.sh | ||
Line 103: | Line 118: | ||
In this slurm script we have | In this slurm script we have | ||
- | * slurm commands, starting with #SBATCH to set the job name, the maximum execution time and where the output files will go, | + | * slurm commands, starting with #SBATCH to set the job name, the maximum execution time and where the output files will go, as well as the slurm qos and partition that should be used, |
* module commands, loading all the modules which are required by our job: python, Java (Hadoop is written in Java), Hadoop, and Spark, | * module commands, loading all the modules which are required by our job: python, Java (Hadoop is written in Java), Hadoop, and Spark, | ||
* setup scripts which create temporary ssh keys, and start Hadoop and Spark services in user context on the nodes of the job, | * setup scripts which create temporary ssh keys, and start Hadoop and Spark services in user context on the nodes of the job, | ||
Line 109: | Line 124: | ||
* a script to discard the temporary ssh keys. | * a script to discard the temporary ssh keys. | ||
- | The script is submitted to the slurm scheduler by | + | The script is submitted to the slurm scheduler by executing: |
- | + | < | |
- | sbatch pi.slrm | + | $ sbatch pi.slrm |
+ | </ | ||
==== MapReduce: Concepts and example ==== | ==== MapReduce: Concepts and example ==== | ||
Line 122: | Line 138: | ||
An example is | An example is | ||
- | < | + | < |
#!/bin/bash | #!/bin/bash | ||
#SBATCH --job-name=simple_mapreduce | #SBATCH --job-name=simple_mapreduce | ||
Line 130: | Line 146: | ||
module purge | module purge | ||
- | module load openjdk/ | + | module load openjdk |
module load hadoop | module load hadoop | ||
+ | |||
+ | export PDSH_RCMD_TYPE=ssh | ||
prolog_create_key.sh | prolog_create_key.sh | ||
Line 161: | Line 179: | ||
To use Big Data on VSC use those modules: | To use Big Data on VSC use those modules: | ||
- | * python/3.8.0-gcc-9.1.0-wkjbtaa | + | |
- | * openjdk/ | + | |
- | * hadoop | + | * python/3.9.13-gcc-12.2.0-ctxezzj |
- | * spark | + | * python/ |
+ | * openjdk/ | ||
+ | * hadoop | ||
+ | * spark (spark/ | ||
* r | * r | ||
+ | |||
Depending on the application, | Depending on the application, |