Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revisionBoth sides next revision | ||
doku:bigdata [2021/05/27 16:21] – dieter | doku:bigdata [2022/12/20 09:46] – [MapReduce: Concepts and example] dieter | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== Big Data ====== | ====== Big Data ====== | ||
- | VSC has Big Data modules for running | + | VSC-4 has Big Data modules for running |
* MapReduce jobs | * MapReduce jobs | ||
* in several programming languages, including Java, and | * in several programming languages, including Java, and | ||
- | * executables which can stream data from standard input to standard output (e.g. grep, cut, tr, sed, and many more), and | + | * using executables which can stream data from standard input to standard output (e.g. grep, cut, tr, sed, ...), and |
* Spark jobs using | * Spark jobs using | ||
* Python, | * Python, | ||
Line 13: | Line 13: | ||
either on standard file systems (using Spectrum Scale, formerly GPFS) or HDFS (Hadoop distributed file system) locally on the nodes of a job. | either on standard file systems (using Spectrum Scale, formerly GPFS) or HDFS (Hadoop distributed file system) locally on the nodes of a job. | ||
- | === Big Data and HPC === | + | ==== Big Data and HPC ==== |
Among the advantages of Big Data also on HPC environments like VSC, we want to mention | Among the advantages of Big Data also on HPC environments like VSC, we want to mention | ||
- | * very easy coding, in many languages including data centric ones like R or SQL, | + | * very easy coding, in many languages, including data centric ones like R or SQL, |
* automatic parallelization without any changes in the user program, and | * automatic parallelization without any changes in the user program, and | ||
* extremely good scaling behaviour. | * extremely good scaling behaviour. | ||
Users have to be aware of difficulties for running Big Data jobs on HPC clusters, namely of shared resources and granularity. | Users have to be aware of difficulties for running Big Data jobs on HPC clusters, namely of shared resources and granularity. | ||
- | * File systems are shared by all users. Users with high file system demand should use the local HDFS, if the same data is read more than once in a job. | + | * File systems are shared by all users. Users with high file system demand should use the local HDFS, if the same data is read more than once in a job. Please, do not submit many Big Data jobs at the same time, e.g. slurm array jobs! |
- | * Login nodes and internet connections are shared by all users. Bringing huge amounts of data to VSC takes some time. Please make sure that other users of VSC are not affected! | + | * Login nodes and internet connections are shared by all users. Bringing huge amounts of data to VSC takes some time. Please make sure that other users of VSC are not affected! |
- | * Only very coarse grained parallelization will work with Big Data frameworks. Typically parallelization is done by partitioning input data and doing the majority of the calculation independently by separated processes. | + | * Only very coarse grained parallelization will work with Big Data frameworks. Typically parallelization is done by partitioning input data (automatically) |
- | In the context of high performance computing a typical application of Big Data methods is pre- and/or postprocessing of large amounts of data, e.g. filtering. | + | In the context of high performance computing a typical application of Big Data methods is pre- and/or postprocessing of large amounts of data, e.g. |
+ | * filtering, | ||
+ | * data cleaning, | ||
+ | * discarding input fields which are not required, | ||
+ | * counting, | ||
+ | * ... | ||
Frameworks and applications using Big Data methods are shared among many scientific communities. | Frameworks and applications using Big Data methods are shared among many scientific communities. | ||
- | === Apache Spark on VSC === | + | ==== Apache Spark on VSC ==== |
- | There are many frameworks in Data Science that are using < | + | There are many frameworks in Data Science that are using < |
- | == Example slurm & python scripts == | + | === Example slurm & python scripts |
Let us calculate the value of pi by means of the following Python script < | Let us calculate the value of pi by means of the following Python script < | ||
- | < | + | < |
from __future__ import print_function | from __future__ import print_function | ||
Line 62: | Line 67: | ||
return 1 if x ** 2 + y ** 2 <= 1 else 0 | return 1 if x ** 2 + y ** 2 <= 1 else 0 | ||
- | count = spark.sparkContext.parallelize(range(1, | + | count = spark.sparkContext.parallelize(range(1, |
+ | | ||
print(" | print(" | ||
Line 68: | Line 74: | ||
The idea of the calculation is to check (very often) whether a random point in the area [-1,-1; 1,1] is within a unit circle (i.e. has a distance less than 1 from origin). Since we know formulas for the area of the square as well as the circle we can estimate pi. | The idea of the calculation is to check (very often) whether a random point in the area [-1,-1; 1,1] is within a unit circle (i.e. has a distance less than 1 from origin). Since we know formulas for the area of the square as well as the circle we can estimate pi. | ||
- | In order to submit | + | In order to submit |
- | < | + | < |
#!/bin/bash | #!/bin/bash | ||
#SBATCH --nodes=1 | #SBATCH --nodes=1 | ||
Line 90: | Line 96: | ||
export SPARK_EXAMPLES=${HOME}/ | export SPARK_EXAMPLES=${HOME}/ | ||
- | spark-submit --master yarn --deploy-mode client --num-executors 140 --executor-memory 2G $SPARK_EXAMPLES/ | + | spark-submit --master yarn --deploy-mode client --num-executors 140 \ |
+ | | ||
epilog_discard_key.sh | epilog_discard_key.sh | ||
Line 106: | Line 113: | ||
sbatch pi.slrm | sbatch pi.slrm | ||
- | === MapReduce: Concepts and example === | + | ==== MapReduce: Concepts and example |
MapReduce splits work into the phases | MapReduce splits work into the phases | ||
Line 115: | Line 122: | ||
An example is | An example is | ||
- | < | + | < |
#!/bin/bash | #!/bin/bash | ||
#SBATCH --job-name=simple_mapreduce | #SBATCH --job-name=simple_mapreduce | ||
#SBATCH --nodes=1 --time=00: | #SBATCH --nodes=1 --time=00: | ||
- | #SBATCH --error=slurm_output/ | + | #SBATCH --error=simple_mapreduce_err |
- | #SBATCH --output=slurm_output/ | + | #SBATCH --output=simple_mapreduce_out |
module purge | module purge | ||
Line 136: | Line 143: | ||
-output tmp_out -mapper /bin/cat -reducer '/ | -output tmp_out -mapper /bin/cat -reducer '/ | ||
- | # check output in slurm_output/ | + | # check output in simple_mapreduce_out and simple_mapreduce_err |
# check job output on HDFS | # check job output on HDFS | ||
hdfs dfs -ls tmp_out | hdfs dfs -ls tmp_out | ||
Line 146: | Line 153: | ||
- | === Further examples === | + | ==== Further examples |
For examples in Scala, R, and SQL, including slurm scripts to run on VSC, we want to refer to the course material in [[ https:// | For examples in Scala, R, and SQL, including slurm scripts to run on VSC, we want to refer to the course material in [[ https:// | ||
- | === Available modules === | + | ==== Available modules |
To use Big Data on VSC use those modules: | To use Big Data on VSC use those modules: | ||
Line 167: | Line 174: | ||
* R => openjdk + hadoop + spark + r | * R => openjdk + hadoop + spark + r | ||
- | === Background === | + | ==== Background |
What is VSC actually doing to run a Big Data job using Hadoop and/or Spark? | What is VSC actually doing to run a Big Data job using Hadoop and/or Spark? | ||
+ | |||
+ | == HPC == | ||
User jobs on VSC are submitted using Slurm, thus making sure that a job gets a well defined execution environment which is similar in terms of resources and performance for each similar job execution. | User jobs on VSC are submitted using Slurm, thus making sure that a job gets a well defined execution environment which is similar in terms of resources and performance for each similar job execution. | ||
- | Big Data jobs are often executed on Big Data clusters, which make sure that | + | == Big Data == |
+ | |||
+ | Big Data jobs on the other hand are often executed on Big Data clusters, which make sure that | ||
* programs are started on the nodes where data is stored, thus reducing communication overhead, | * programs are started on the nodes where data is stored, thus reducing communication overhead, | ||
* load balancing is done automatically, | * load balancing is done automatically, | ||
Line 180: | Line 191: | ||
Much of this work is done by Yarn (Yet another resource negotiator), | Much of this work is done by Yarn (Yet another resource negotiator), | ||
- | In our setup on VSC we combine those two worlds of HPC and Big data by starting Yarn and other services | + | == Combination of HPC and Big Data == |
+ | |||
+ | In our setup on VSC we combine those two worlds of HPC and Big data by starting Yarn, HDFS and spark in user context within a job (on the nodes belonging to the job). | ||
User data can be accessed | User data can be accessed | ||
* from the standard VSC paths using Spectrum Scale (formerly GPFS), or | * from the standard VSC paths using Spectrum Scale (formerly GPFS), or | ||
- | * from HDFS, which is started locally on the local SSDs of the compute nodes. In this case the data has to be copied in and/or out before using it. | + | * from HDFS, which is started locally on the local SSDs of the compute nodes. In this case the data has to be copied in and/or out at the beginning/ |
- | VSC has several scripts, most are optional, which are added to the path using ''' | + | VSC has several scripts |
* prolog_create_key.sh and epilog_discard_key.sh, | * prolog_create_key.sh and epilog_discard_key.sh, | ||
- | * vsc_start_hadoop.sh, | + | * vsc_start_hadoop.sh, |
* vsc_start_spark.sh, | * vsc_start_spark.sh, | ||
* vsc_stop_hadoop.sh, | * vsc_stop_hadoop.sh, | ||
* vsc_stop_spark.sh, | * vsc_stop_spark.sh, | ||
- | === References === | + | ==== References |
Line 200: | Line 213: | ||
* {{ : | * {{ : | ||
- | === Hadoop | + | ==== Hadoop |
At the Technische Universität Wien there is a [[doku:lbd | Big Data cluster]], mainly for teaching purposes, which is also used by researchers. | At the Technische Universität Wien there is a [[doku:lbd | Big Data cluster]], mainly for teaching purposes, which is also used by researchers. | ||