Notes on Hadoop
Jun. 18th, 2015 09:37 pmMuch processing is done on relational DBs: data -> SQL DB -> reporting
Limitations:
- the structure must be relational
- I/O heavy to ingest
- data warehousing and reporting often doesn't care about ACID
- storage is expensive
- vertical scaling: to get faster, use faster storage or CPU
- slow for large amounts of data
- HDFS file system: files are blocks scattered over multiple servers
- MapReduce: job tracker farms out work to nodes in a cluster
- Flume: a custom ingester for logs
- Sqoop: ODBC import and ingestion
- the hadoop command has various subcommands: can store/retrieve HDFS files
- Hive: SQL-like query language that is compiled to a multiprocessing job
- Pig: scripting language for dataflow
- Impala: SQL-like query language, but runs as an agent at the same level as MapReduce
- HBase: key/value store built atop HDFS
- Spark: in-memory Hadoop; it tries to avoid hitting the disk
- Oozie: workflow manager/scheduler; define a DAG for workflow
Our cluster: 10 nodes; 100Tb raw space -> 30Tb of HDFS space.