==== The Big Data Cluster "LBD" ====

The Big Data Cluster of the TU Wien is used for teaching and research and is using an (extended) **Hadoop** Software Stack.
It is made to exploit easily available parallelism with **automatic parallelization** of programs written in **Python**, **Java**, **Scala** and **R**.

Typical programs make use of either
  * MapReduce, or
  * Spark
for parallelization.

==== Available software ====

  * Rocky Linux
  * Openstack
  * Jupyter
  * Rstudio Server
  * Parallel file sytem: HDFS, Ceph
  * Scheduler: YARN
  * HBase
  * Hive
  * Hue
  * Kafka
  * Oozie
  * Solr
  * Spark
  * Zookeeper

=== Access ===
  * **usage on request:** hadoop@tuwien.ac.at
  * **support:** hadoop-support@tuwien.ac.at.

=== Hardware ===

LBD consists of 
  * 1 namenode
  * 18 datanodes
  * login nodes
  * support nodes

The login nodes are reachable from TUnet, the internal net of the TU Wien via [[https://lbd.tuwien.ac.at]] or [[ssh://login.tuwien.ac.at]].

The hardware -- which is then virtualized using Openstack -- consists of
  * two XeonE5-2650v4 CPUs with 24 virtual cores (total of 48 cores per node, 864 total worker cores)
  * 256GB RAM (total of 4.5TB memory available to worker nodes)
  * four hard disks, each with a capacity of 4TB (total of 16TB per node, 288TB total for worker nodes)

=== HDFS configuration ===

  * current version: Hadoop 3
  * block size: 128 MiB
  * default replication factor: 3


=== Jupyter Notebook ===

Most users use the LBD cluster via Jupyter Notebooks.

=== Example code ===

A short example using Spark and Python:

  import pyspark
  import random
  sc = pyspark.SparkContext(appName="Pi")
  num_samples = 10000
  def inside(p):     
    x, y = random.random(), random.random()
    return x*x + y*y < 1
  count = sc.parallelize(range(0, num_samples)).filter(inside).count()
  pi = 4 * count / num_samples
  print(pi)
  sc.stop()