Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
doku:lbd [2021/05/27 07:02] – created dieter | doku:lbd [2024/02/29 09:32] (current) – [Hardware] dieter | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ==== "Little" Big Data (LBD) Cluster | + | ==== The Big Data Cluster |
+ | |||
+ | The Big Data Cluster of the TU Wien is used for teaching and research and is using an (extended) **Hadoop** Software Stack. | ||
+ | It is made to exploit easily available parallelism with **automatic parallelization** of programs written in **Python**, **Java**, **Scala** and **R**. | ||
+ | |||
+ | Typical programs make use of either | ||
+ | * MapReduce, or | ||
+ | * Spark | ||
+ | for parallelization. | ||
+ | |||
+ | ==== Available software | ||
+ | |||
+ | * Rocky Linux | ||
+ | * Openstack | ||
+ | * Jupyter | ||
+ | * Rstudio Server | ||
+ | * Parallel file sytem: HDFS, Ceph | ||
+ | * Scheduler: YARN | ||
+ | * HBase | ||
+ | * Hive | ||
+ | * Hue | ||
+ | * Kafka | ||
+ | * Oozie | ||
+ | * Solr | ||
+ | * Spark | ||
+ | * Zookeeper | ||
=== Access === | === Access === | ||
* **usage on request:** hadoop@tuwien.ac.at | * **usage on request:** hadoop@tuwien.ac.at | ||
* **support: | * **support: | ||
- | |||
- | {{: | ||
=== Hardware === | === Hardware === | ||
- | LBD has the following hardware setup: | + | LBD consists of |
- | * 2 namenodes (on c100: primary, on c101 secondary | + | * 1 namenode |
- | * 18 datanodes | + | * 18 datanodes |
- | * an administrative server h1 as | + | * login nodes |
- | * Cloudera Manager server | + | * support nodes |
- | * backup of administrative data | + | |
- | * a ZFS file server lbdnfs01 for /home with 300TB of storage space | + | The login nodes are reachable from TUnet, |
- | The namenode c100 is also called lbd and it is reachable from within | + | |
- | * two XeonE5-2650v4 CPUs with 24 cores (total of 48 cores per node, 864 total worker cores) | + | The hardware |
- | * 256GB RAM (total of 4.5TB memory available to worker nodes of the whole cluster) | + | * two XeonE5-2650v4 CPUs with 24 virtual |
+ | * 256GB RAM (total of 4.5TB memory available to worker nodes) | ||
* four hard disks, each with a capacity of 4TB (total of 16TB per node, 288TB total for worker nodes) | * four hard disks, each with a capacity of 4TB (total of 16TB per node, 288TB total for worker nodes) | ||
- | Apart from two extra Ethernet devices for external connections on h1 and on c100, all nodes have the same hardware configurations. All ethernet connections (external and inter-node) support a speed of 10Gb/s. | ||
=== HDFS configuration === | === HDFS configuration === | ||
Line 30: | Line 53: | ||
* default replication factor: 3 | * default replication factor: 3 | ||
- | ---------------- | ||
- | |||
- | ==== Available software ==== | ||
- | |||
- | < | ||
- | |||
- | < | ||
- | table { | ||
- | border-collapse: | ||
- | width: 100%; | ||
- | } | ||
- | |||
- | td, th { | ||
- | border: 1px solid #dddddd; | ||
- | text-align: left; | ||
- | padding: 8px; | ||
- | } | ||
- | |||
- | tr: | ||
- | background-color: | ||
- | } | ||
- | </ | ||
- | |||
- | < | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | <td> cluster-computing framework mit Scala 2.10 (2.11)</ | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </ | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | <tr> | ||
- | < | ||
- | < | ||
- | < | ||
- | </tr> | ||
- | </ | ||
- | </ | ||
- | |||
- | |||
- | ===== Jupyter Notebook | + | === Jupyter Notebook === |
+ | Most users use the LBD cluster via Jupyter Notebooks. | ||
- | To use a Jupyter notebook, connect to https:// | + | === Example code === |
- | Start a new notebook, e.g. Python3, PySpark3, a terminal, ... | + | A short example using Spark and Python: |
- | A short example: new -> PySpark3 | + | |
- | < | + | import random |
- | import pyspark | + | sc = pyspark.SparkContext(appName=" |
- | import random | + | num_samples = 10000 |
- | sc = pyspark.SparkContext(appName=" | + | def inside(p): |
- | num_samples = 10000 | + | x, y = random.random(), |
- | def inside(p): | + | return x*x + y*y < 1 |
- | x, y = random.random(), | + | count = sc.parallelize(range(0, |
- | return x*x + y*y < 1 | + | pi = 4 * count / num_samples |
- | count = sc.parallelize(range(0, | + | print(pi) |
- | pi = 4 * count / num_samples | + | sc.stop() |
- | print(pi) | + | |
- | sc.stop() | + | |
- | </ | + | |