Notes on Hadoop
Jun. 18th, 2015 09:37 pmMuch processing is done on relational DBs: data -> SQL DB -> reporting
Limitations:
- the structure must be relational
 - I/O heavy to ingest
 - data warehousing and reporting often doesn't care about ACID
 - storage is expensive
 - vertical scaling: to get faster, use faster storage or CPU
 - slow for large amounts of data
 
- HDFS file system: files are blocks scattered over multiple servers
 - MapReduce: job tracker farms out work to nodes in a cluster
 - Flume: a custom ingester for logs
 - Sqoop: ODBC import and ingestion
 - the hadoop command has various subcommands: can store/retrieve HDFS files
 - Hive: SQL-like query language that is compiled to a multiprocessing job
 - Pig: scripting language for dataflow
 - Impala: SQL-like query language, but runs as an agent at the same level as MapReduce
 - HBase: key/value store built atop HDFS
 - Spark: in-memory Hadoop; it tries to avoid hitting the disk
 - Oozie: workflow manager/scheduler; define a DAG for workflow
 
Our cluster: 10 nodes; 100Tb raw space -> 30Tb of HDFS space.