User Tools

Site Tools


"Little" Big Data (LBD) Cluster


  • usage on request:
  • support:



LBD has the following hardware setup:

  • 2 namenodes
  • 18 datanodes
  • an administrative server as
    • Cloudera Manager server
    • backup of administrative data
  • a ZFS file server for /home with 300TB of storage space

The login node is reachable from within the tuwien domain under Each of the nodes has

  • two XeonE5-2650v4 CPUs with 24 virtual cores (total of 48 cores per node, 864 total worker cores)
  • 256GB RAM (total of 4.5TB memory available to worker nodes)
  • four hard disks, each with a capacity of 4TB (total of 16TB per node, 288TB total for worker nodes)

Apart from two extra Ethernet devices for external connections, all nodes have the same hardware configurations. All ethernet connections (external and inter-node) support a speed of 10Gb/s.

HDFS configuration

  • current version: Hadoop 3
  • block size: 128 MiB
  • default replication factor: 3

Available software

Name Status Kommentar
Centos 7 Betriebssystem OK
XCAT Deploymentumgebung OK
Cloudera Manager Big Data Deployment OK
Cloudera HDFS Hadoop distributed file system OK
Cloudera Accumulo Key/value store OK
Cloudera HBase Database on top of HDFS OK
Cloudera Hive Data warehouse using SQL OK
Cloudera Hue Hadoop user experience, web gui, SQL analytics workbench OK
Cloudera Impala SQL query engine, used by Hue OK
Oozie Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Used by Hue OK
Cloudera Solr open source enterprise search platform, used by Hue, used by Key-Value Store Indexer OK
Cloudera Key-Value Store Indexer The Key-Value Store Indexer service uses the Lily HBase NRT Indexer to index the stream of records being added to HBase tables. Indexing allows you to query data stored in HBase with the Solr service. OK
Cloudera Spark (Spark 2) cluster-computing framework mit Scala 2.10 (2.11) OK
Cloudera YARN (MR2 Included) Yet Another Resource Negotiator (cluster management) OK
Cloudera ZooKeeper ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. OK
Java 1.8 Programmiersprache OK
Python 3.6.3 (python3.6), Python 3.4.5 (python3.4) Python 2.7.5 (python2) Programmiersprache OK
Anaconda Python (python) export PATH=/home/anaconda3/bin/:$PATH OK
Jupyter Notebook, benötigt anaconda OK
MongoDB | benötigt Plattenplatz, nicht alle Knoten Beta testing
Kafka Verarbeitung von Datenströmen Beta testing
Cassandra benötigt Plattenplatz, nicht alle Knoten TODO
Storm Eher Spark Streaming? auf weitere Anfrage
Drill -
Flume -
Kudu -
Zeppelin -
Giraph TODO

Jupyter Notebook

Most users use the LBD cluster via Jupyter Notebooks.

To use a Jupyter notebook, connect to, and login with your user's credentials.

Start a new notebook, e.g. Python3, PySpark3, a terminal, …

A short example: new → PySpark3

import pyspark
import random
sc = pyspark.SparkContext(appName="Pi")
num_samples = 10000
def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
doku/lbd.txt · Last modified: 2021/05/27 07:05 by dieter