User documentation
General
Software & Services
under construction
Queuing system
Systems
VSC 4
VSC 3
VSC 2
Decomissioned
VSC 1
Decomissioned
Decomissioned
Decomissioned
There are many frameworks in Data Science that are using ⟿ Apache SparkTM , for instance, ADAM for genomic analysis, GeoTrellis for geospatial data analysis, or Koalas (distributed Pandas).
Let us calculate the value of pi by means of the following Python script ⟿ pi.py
:
import sys from random import random from operator import add from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession\ .builder\ .appName("PythonPi")\ .getOrCreate() partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2 n = 100000 * partitions def f(_): x = random() * 2 - 1 y = random() * 2 – 1 return 1 if x ** 2 + y ** 2 <= 1 else 0 count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add) print("Pi is roughly %f" % (4.0 * count / n)) spark.stop()
In order to submit a batch job to the queuing system we take advantage of the SLURM script below ⟿ pi.slrm
:
#!/bin/bash #SBATCH --job-name=spark-wordcount #SBATCH --nodes=4 #SBATCH --time=00:10:00 #SBATCH --partition=binf module purge module load anaconda3 hadoop/2.7 apache_spark/2.4.4 . vsc_start_spark.sh # runs on all nodes command="$SPARK_HOME/bin/spark-submit -- master spark://${SPARK_MASTER_HOST}:${SPARK_MASTER _PORT} --total-executor-cores $NUMCORES -- executor-memory 2G $WDIR/pi.py 10000" echo $command $command
LBD has the following hardware setup:
The namenode c100 is also called lbd and it is reachable from within the tuwien domain under lbd.zserv.tuwien.ac.at. Each of the 20 nodes (c100-c118, h1) has
Apart from two extra Ethernet devices for external connections on h1 and on c100, all nodes have the same hardware configurations. All ethernet connections (external and inter-node) support a speed of 10Gb/s.
Name | Status | Kommentar |
---|---|---|
Centos 7 | Betriebssystem | OK |
XCAT | Deploymentumgebung | OK |
Cloudera Manager | Big Data Deployment | OK |
Cloudera HDFS | Hadoop distributed file system | OK |
Cloudera Accumulo | Key/value store | OK |
Cloudera HBase | Database on top of HDFS | OK |
Cloudera Hive | Data warehouse using SQL | OK |
Cloudera Hue | Hadoop user experience, web gui, SQL analytics workbench | OK |
Cloudera Impala | SQL query engine, used by Hue | OK |
Oozie | Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Used by Hue | OK |
Cloudera Solr | open source enterprise search platform, used by Hue, used by Key-Value Store Indexer | OK |
Cloudera Key-Value Store Indexer | The Key-Value Store Indexer service uses the Lily HBase NRT Indexer to index the stream of records being added to HBase tables. Indexing allows you to query data stored in HBase with the Solr service. | OK |
Cloudera Spark (Spark 2) | cluster-computing framework mit Scala 2.10 (2.11) | OK |
Cloudera YARN (MR2 Included) | Yet Another Resource Negotiator (cluster management) | OK |
Cloudera ZooKeeper | ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. | OK |
Java 1.8 | Programmiersprache | OK |
Python 3.6.3 (python3.6), Python 3.4.5 (python3.4) Python 2.7.5 (python2) | Programmiersprache | OK |
Anaconda Python (python) | export PATH=/home/anaconda3/bin/:$PATH | OK |
Jupyter | Notebook, benötigt anaconda | OK |
MongoDB | | benötigt Plattenplatz, nicht alle Knoten | Beta testing |
Kafka | Verarbeitung von Datenströmen | Beta testing |
Cassandra | benötigt Plattenplatz, nicht alle Knoten | TODO |
Storm | Eher Spark Streaming? | auf weitere Anfrage |
Drill | - | |
Flume | - | |
Kudu | - | |
Zeppelin | - | |
Giraph | TODO |
To use a Jupyter notebook, connect to https://lbd.zserv.tuwien.ac.at:8000, and login with your user's credentials.
Start a new notebook, e.g. Python3, PySpark3, a terminal, …
A short example: new → PySpark3
import pyspark import random sc = pyspark.SparkContext(appName="Pi") num_samples = 10000 def inside(p): x, y = random.random(), random.random() return x*x + y*y < 1 count = sc.parallelize(range(0, num_samples)).filter(inside).count() pi = 4 * count / num_samples print(pi) sc.stop()