==== The Big Data Cluster "LBD" ==== The Big Data Cluster of the TU Wien is used for teaching and research and is using an (extended) **Hadoop** Software Stack. It is made to exploit easily available parallelism with **automatic parallelization** of programs written in **Python**, **Java**, **Scala** and **R**. Typical programs make use of either * MapReduce, or * Spark for parallelization. ==== Available software ==== * Rocky Linux * Openstack * Jupyter * Rstudio Server * Parallel file sytem: HDFS, Ceph * Scheduler: YARN * HBase * Hive * Hue * Kafka * Oozie * Solr * Spark * Zookeeper === Access === * **usage on request:** hadoop@tuwien.ac.at * **support:** hadoop-support@tuwien.ac.at. === Hardware === LBD consists of * 1 namenode * 18 datanodes * login nodes * support nodes The login nodes are reachable from TUnet, the internal net of the TU Wien via [[https://lbd.tuwien.ac.at]] or [[ssh://login.tuwien.ac.at]]. The hardware -- which is then virtualized using Openstack -- consists of * two XeonE5-2650v4 CPUs with 24 virtual cores (total of 48 cores per node, 864 total worker cores) * 256GB RAM (total of 4.5TB memory available to worker nodes) * four hard disks, each with a capacity of 4TB (total of 16TB per node, 288TB total for worker nodes) === HDFS configuration === * current version: Hadoop 3 * block size: 128 MiB * default replication factor: 3 === Jupyter Notebook === Most users use the LBD cluster via Jupyter Notebooks. === Example code === A short example using Spark and Python: import pyspark import random sc = pyspark.SparkContext(appName="Pi") num_samples = 10000 def inside(p): x, y = random.random(), random.random() return x*x + y*y < 1 count = sc.parallelize(range(0, num_samples)).filter(inside).count() pi = 4 * count / num_samples print(pi) sc.stop()