Differences

This shows you the differences between two versions of the page.

--- doku:bigdata [2021/06/23 09:36] – [Apache Spark on VSC] dieter
+++ doku:bigdata [2023/05/11 08:52] – [Available modules] katrin
@@ Line 40: / Line 40: @@
 === Example slurm & python scripts ===
-Let us calculate the value of pi by means of the following Python script <html><span style="color:#cc3300;font-size:100%;">&dzigrarr;</span> </html> ''pi.py'' (from the Spark distribution):
+Let us calculate the value of pi by means of the following Python script ''pi.py'' (example originally taken from the Spark distribution):
-<code>
-from __future__ import print_function
+<code python>
+from operator import add
+from random import random
 import sys
-from random import random
-from operator import add
 from pyspark.sql import SparkSession
 if __name__ == "__main__":
     """
@@ Line 58: / Line 57: @@
         .appName("PythonPi")\
         .getOrCreate()
     partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
     n = 100000 * partitions
     def f(_):
         x = random() * 2 - 1
         y = random() * 2 - 1
         return 1 if x ** 2 + y ** 2 <= 1 else 0
-    count = spark.sparkContext.parallelize(range(1, n + 1), partitions).
+    count = spark.sparkContext\
-            map(f).reduce(add)
+        .parallelize(range(1, n + 1), partitions)\
-    print("Pi is roughly %f" % (4.0 * count / n))
+        .map(f)\
+        .reduce(add)
-    spark.stop()</code>
+    pi = 4.0 * count / n
+    print(f"Pi is roughly {pi}")
+    spark.stop()
+</code>
 The idea of the calculation is to check (very often) whether a random point in the area [-1,-1; 1,1] is within a unit circle (i.e. has a distance less than 1 from origin). Since we know formulas for the area of the square as well as the circle we can estimate pi.
-In order to submit this python job to the queuing system we use the SLURM script <html><span style="color:#cc3300;font-size:100%;">&dzigrarr;</span> </html> ''pi.slrm'':
+Assuming that the script is saved to ''$HOME/pi.py''' we can use the following SLURM script ''pi.slrm'' to run the code on VSC-4:
-<code>
+<code bash>
 #!/bin/bash
 #SBATCH --nodes=1
@@ Line 83: / Line 86: @@
 #SBATCH --error=err_spark-yarn-pi-%A
 #SBATCH --output=out_spark-yarn-pi-%A
+#SBATCH --partition=skylake_0096
+#SBATCH --qos=skylake_0096
 module purge
-module load python/3.8.0-gcc-9.1.0-wkjbtaa
-module load openjdk/11.0.2-gcc-9.1.0-ayy5f5t
+# you can choose one of the following python versions
+# with the current hadoop/spark installation on VSC-4
+# note: the current hadoop version does NOT work with python 3.11
+module load python/3.10.7-gcc-12.2.0-5a2kkeu
+# module load python/3.9.13-gcc-12.2.0-ctxezzj
+# module load python/3.8.12-gcc-12.2.0-tr7w5qy
+module load openjdk
 module load hadoop
 module load spark
+export PDSH_RCMD_TYPE=ssh
 prolog_create_key.sh
 . vsc_start_hadoop.sh
 . vsc_start_spark.sh
-export SPARK_EXAMPLES=${HOME}/BigData/SparkDistributionExamples/
 spark-submit --master yarn --deploy-mode client --num-executors 140 \
-      --executor-memory 2G $SPARK_EXAMPLES/src/main/python/pi.py 1000
+      --executor-memory 2G $HOME/pi.py 1000
+. vsc_stop_spark.sh
+. vsc_stop_hadoop.sh
 epilog_discard_key.sh
@@ Line 103: / Line 118: @@
 In this slurm script we have
-  * slurm commands, starting with #SBATCH to set the job name, the maximum execution time and where the output files will go,
+  * slurm commands, starting with #SBATCH to set the job name, the maximum execution time and where the output files will go, as well as the slurm qos and partition that should be used,
   * module commands, loading all the modules which are required by our job: python, Java (Hadoop is written in Java), Hadoop, and Spark,
   * setup scripts which create temporary ssh keys, and start Hadoop and Spark services in user context on the nodes of the job,
@@ Line 109: / Line 124: @@
   * a script to discard the temporary ssh keys.
-The script is submitted to the slurm scheduler by
+The script is submitted to the slurm scheduler by executing:
+<code>
-  sbatch pi.slrm
+$ sbatch pi.slrm
+</code>
 ==== MapReduce: Concepts and example ====
@@ Line 122: / Line 138: @@
 An example is
-<code>
+<code bash>
 #!/bin/bash
 #SBATCH --job-name=simple_mapreduce
@@ Line 130: / Line 146: @@
 module purge
-module load openjdk/11.0.2-gcc-9.1.0-ayy5f5t
+module load openjdk
 module load hadoop
+export PDSH_RCMD_TYPE=ssh
 prolog_create_key.sh
@@ Line 161: / Line 179: @@
 To use Big Data on VSC use those modules:
-  * python/3.8.0-gcc-9.1.0-wkjbtaa
+  * python (note: the current spark version does NOT work with python 3.11)
-  * openjdk/11.0.2-gcc-9.1.0-ayy5f5t
+    * python/3.10.7-gcc-12.2.0-5a2kkeu
-  * hadoop
+    * python/3.9.13-gcc-12.2.0-ctxezzj
-  * spark
+    * python/3.8.12-gcc-12.2.0-tr7w5qy
+  * openjdk/11.0.15_10-gcc-11.3.0-ih4nksq
+  * hadoop (hadoop/3.3.2-gcc-11.3.0-ypaxzwt)
+  * spark (spark/3.1.1-gcc-11.3.0-47oqksq)
   * r
 Depending on the application, choose the right combination: