Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
doku:bigdata [2021/06/23 09:25]
dieter [Big Data]
doku:bigdata [2023/05/11 08:52] (current)
katrin
Line 1: Line 1:
 ====== Big Data ====== ====== Big Data ======
  
-VSC has Big Data modules for running +VSC-4 has Big Data modules for running 
   * MapReduce jobs   * MapReduce jobs
     * in several programming languages, including Java, and      * in several programming languages, including Java, and 
Line 40: Line 40:
 === Example slurm & python scripts === === Example slurm & python scripts ===
  
-Let us calculate the value of pi by means of the following Python script <html><span style="color:#cc3300;font-size:100%;">&dzigrarr;</span> </html> ''pi.py'' (from the Spark distribution): +Let us calculate the value of pi by means of the following Python script ''pi.py'' (example originally taken from the Spark distribution):
-<code> +
-from __future__ import print_function+
  
 +<code python>
 +from operator import add
 +from random import random
 import sys import sys
-from random import random + 
-from operator import add +
 from pyspark.sql import SparkSession from pyspark.sql import SparkSession
 + 
 if __name__ == "__main__": if __name__ == "__main__":
     """     """
Line 58: Line 57:
         .appName("PythonPi")\         .appName("PythonPi")\
         .getOrCreate()         .getOrCreate()
 + 
     partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2     partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
     n = 100000 * partitions     n = 100000 * partitions
 + 
     def f(_):     def f(_):
         x = random() * 2 - 1         x = random() * 2 - 1
         y = random() * 2 - 1         y = random() * 2 - 1
         return 1 if x ** 2 + y ** 2 <= 1 else 0         return 1 if x ** 2 + y ** 2 <= 1 else 0
- +  
-    count = spark.sparkContext.parallelize(range(1, n + 1), partitions). +    count = spark.sparkContext
-            map(f).reduce(add) +        .parallelize(range(1, n + 1), partitions)\ 
-    print("Pi is roughly %f" % (4.0 * count / n)+        .map(f)
- +        .reduce(add) 
-    spark.stop()</code>+    pi = 4.0 * count / n 
 +    print(f"Pi is roughly {pi}"
 +  
 +    spark.stop() 
 +</code>
 The idea of the calculation is to check (very often) whether a random point in the area [-1,-1; 1,1] is within a unit circle (i.e. has a distance less than 1 from origin). Since we know formulas for the area of the square as well as the circle we can estimate pi. The idea of the calculation is to check (very often) whether a random point in the area [-1,-1; 1,1] is within a unit circle (i.e. has a distance less than 1 from origin). Since we know formulas for the area of the square as well as the circle we can estimate pi.
  
-In order to submit a batch job to the queuing system we use the SLURM script <html><span style="color:#cc3300;font-size:100%;">&dzigrarr;</span> </html> ''pi.slrm'':+Assuming that the script is saved to ''$HOME/pi.py''' we can use the following SLURM script ''pi.slrm'' to run the code on VSC-4:
  
-<code>+<code bash>
 #!/bin/bash #!/bin/bash
 #SBATCH --nodes=1 #SBATCH --nodes=1
Line 83: Line 86:
 #SBATCH --error=err_spark-yarn-pi-%A #SBATCH --error=err_spark-yarn-pi-%A
 #SBATCH --output=out_spark-yarn-pi-%A #SBATCH --output=out_spark-yarn-pi-%A
 +#SBATCH --partition=skylake_0096
 +#SBATCH --qos=skylake_0096
  
 module purge module purge
-module load python/3.8.0-gcc-9.1.0-wkjbtaa + 
-module load openjdk/11.0.2-gcc-9.1.0-ayy5f5t+# you can choose one of the following python versions 
 +# with the current hadoop/spark installation on VSC-4 
 +# note: the current spark version does NOT work with python 3.11 
 +module load python/3.10.7-gcc-12.2.0-5a2kkeu 
 +module load python/3.9.13-gcc-12.2.0-ctxezzj 
 +# module load python/3.8.12-gcc-12.2.0-tr7w5qy 
 + 
 +module load openjdk
 module load hadoop module load hadoop
 module load spark module load spark
 +
 +export PDSH_RCMD_TYPE=ssh
  
 prolog_create_key.sh prolog_create_key.sh
 . vsc_start_hadoop.sh . vsc_start_hadoop.sh
 . vsc_start_spark.sh . vsc_start_spark.sh
- 
-export SPARK_EXAMPLES=${HOME}/BigData/SparkDistributionExamples/ 
  
 spark-submit --master yarn --deploy-mode client --num-executors 140 \ spark-submit --master yarn --deploy-mode client --num-executors 140 \
-      --executor-memory 2G $SPARK_EXAMPLES/src/main/python/pi.py 1000+      --executor-memory 2G $HOME/pi.py 1000 
 + 
 +. vsc_stop_spark.sh 
 +. vsc_stop_hadoop.sh
  
 epilog_discard_key.sh epilog_discard_key.sh
Line 103: Line 118:
  
 In this slurm script we have  In this slurm script we have 
-  * slurm commands, starting with #SBATCH to set the job name, the maximum execution time and where the output files will go,+  * slurm commands, starting with #SBATCH to set the job name, the maximum execution time and where the output files will go, as well as the slurm qos and partition that should be used,
   * module commands, loading all the modules which are required by our job: python, Java (Hadoop is written in Java), Hadoop, and Spark,   * module commands, loading all the modules which are required by our job: python, Java (Hadoop is written in Java), Hadoop, and Spark,
   * setup scripts which create temporary ssh keys, and start Hadoop and Spark services in user context on the nodes of the job,   * setup scripts which create temporary ssh keys, and start Hadoop and Spark services in user context on the nodes of the job,
Line 109: Line 124:
   * a script to discard the temporary ssh keys.   * a script to discard the temporary ssh keys.
  
-The script is submitted to the slurm scheduler by  +The script is submitted to the slurm scheduler by executing: 
- +<code> 
-  sbatch pi.slrm+sbatch pi.slrm 
 +</code>
  
 ==== MapReduce: Concepts and example ==== ==== MapReduce: Concepts and example ====
Line 122: Line 138:
 An example is  An example is 
  
-<code>+<code bash>
 #!/bin/bash #!/bin/bash
 #SBATCH --job-name=simple_mapreduce #SBATCH --job-name=simple_mapreduce
Line 130: Line 146:
  
 module purge module purge
-module load openjdk/11.0.2-gcc-9.1.0-ayy5f5t+module load openjdk
 module load hadoop module load hadoop
 +
 +export PDSH_RCMD_TYPE=ssh
  
 prolog_create_key.sh prolog_create_key.sh
Line 161: Line 179:
 To use Big Data on VSC use those modules: To use Big Data on VSC use those modules:
  
-  * python/3.8.0-gcc-9.1.0-wkjbtaa +  * python (note: the current spark version does NOT work with python 3.11) 
-  * openjdk/11.0.2-gcc-9.1.0-ayy5f5t +    * python/3.10.7-gcc-12.2.0-5a2kkeu 
-  * hadoop +    * python/3.9.13-gcc-12.2.0-ctxezzj 
-  * spark+    * python/3.8.12-gcc-12.2.0-tr7w5qy 
 +  * openjdk/11.0.15_10-gcc-11.3.0-ih4nksq 
 +  * hadoop (hadoop/3.3.2-gcc-11.3.0-ypaxzwt) 
 +  * spark (spark/3.1.1-gcc-11.3.0-47oqksq)
   * r   * r
 +
  
 Depending on the application, choose the right combination: Depending on the application, choose the right combination:
Line 177: Line 199:
  
 What is VSC actually doing to run a Big Data job using Hadoop and/or Spark? What is VSC actually doing to run a Big Data job using Hadoop and/or Spark?
 +
 +== HPC ==
  
 User jobs on VSC are submitted using Slurm, thus making sure that a job gets a well defined execution environment which is similar in terms of resources and performance for each similar job execution. User jobs on VSC are submitted using Slurm, thus making sure that a job gets a well defined execution environment which is similar in terms of resources and performance for each similar job execution.
 +
 +== Big Data ==
  
 Big Data jobs on the other hand are often executed on Big Data clusters, which make sure that  Big Data jobs on the other hand are often executed on Big Data clusters, which make sure that 
Line 187: Line 213:
 Much of this work is done by Yarn (Yet another resource negotiator), which is a scheduler (and runs usually at a similar level on Big Data clusters as Slurm is running on VSC). Much of this work is done by Yarn (Yet another resource negotiator), which is a scheduler (and runs usually at a similar level on Big Data clusters as Slurm is running on VSC).
  
-In our setup on VSC we combine those two worlds of HPC and Big data by starting Yarn and other services in user context within a job on the nodes belonging to the job. The other services are HDFS (Hadoop distributed file systemand Spark.+== Combination of HPC and Big Data == 
 + 
 +In our setup on VSC we combine those two worlds of HPC and Big data by starting Yarn, HDFS and spark in user context within a job (on the nodes belonging to the job).
 User data can be accessed  User data can be accessed 
   * from the standard VSC paths using Spectrum Scale (formerly GPFS), or   * from the standard VSC paths using Spectrum Scale (formerly GPFS), or
-  * from HDFS, which is started locally on the local SSDs of the compute nodes. In this case the data has to be copied in and/or out before using it. After a job run, the local HDFS instance is destroyed and remaining data is lost.+  * from HDFS, which is started locally on the local SSDs of the compute nodes. In this case the data has to be copied in and/or out at the beginning/end of the job. After a job run, the local HDFS instance is destroyed and remaining data is lost.
  
-VSC has several scriptsmost are optionalwhich are added to the path using '''module load hadoop''' and '''module load spark''':+VSC has several scripts most are optional which are added to the path using '''module load hadoop''' and '''module load spark''':
   * prolog_create_key.sh and epilog_discard_key.sh, which are required to start services on other nodes, but also to access the local (master) node.   * prolog_create_key.sh and epilog_discard_key.sh, which are required to start services on other nodes, but also to access the local (master) node.
   * vsc_start_hadoop.sh, which starts Yarn (scheduler) and HDFS (distributed file system) on the nodes of the job. (If only one node is used this is optional, since Spark can also run with its own scheduler and without HDFS.)   * vsc_start_hadoop.sh, which starts Yarn (scheduler) and HDFS (distributed file system) on the nodes of the job. (If only one node is used this is optional, since Spark can also run with its own scheduler and without HDFS.)
  • doku/bigdata.1624440309.txt.gz
  • Last modified: 2021/06/23 09:25
  • by dieter