akuchling | Entries tagged with hadoop

Much processing is done on relational DBs: data -> SQL DB -> reporting

Limitations:

Hadoop components:

HDFS file system: files are blocks scattered over multiple servers
MapReduce: job tracker farms out work to nodes in a cluster
Flume: a custom ingester for logs
Sqoop: ODBC import and ingestion
the hadoop command has various subcommands: can store/retrieve HDFS files
Hive: SQL-like query language that is compiled to a multiprocessing job
Pig: scripting language for dataflow
Impala: SQL-like query language, but runs as an agent at the same level as MapReduce
HBase: key/value store built atop HDFS
Spark: in-memory Hadoop; it tries to avoid hitting the disk
Oozie: workflow manager/scheduler; define a DAG for workflow

Cloudera is a distro for Hadoop.

Our cluster: 10 nodes; 100Tb raw space -> 30Tb of HDFS space.