akuchling: Sherlock Hemlock (Default)

Much processing is done on relational DBs: data -> SQL DB -> reporting


Limitations:
  • the structure must be relational
  • I/O heavy to ingest
  • data warehousing and reporting often doesn't care about ACID
  • storage is expensive
  • vertical scaling: to get faster, use faster storage or CPU
  • slow for large amounts of data
Hadoop components:
  • HDFS file system: files are blocks scattered over multiple servers
  • MapReduce: job tracker farms out work to nodes in a cluster
  • Flume: a custom ingester for logs
  • Sqoop: ODBC import and ingestion
  • the hadoop command has various subcommands: can store/retrieve HDFS files
  • Hive: SQL-like query language that is compiled to a multiprocessing job
  • Pig: scripting language for dataflow
  • Impala: SQL-like query language, but runs as an agent at the same level as MapReduce
  • HBase: key/value store built atop HDFS
  • Spark: in-memory Hadoop; it tries to avoid hitting the disk
  • Oozie: workflow manager/scheduler; define a DAG for workflow
Cloudera is a distro for Hadoop. 

Our cluster: 10 nodes; 100Tb raw space -> 30Tb of HDFS space.


Profile

akuchling: Sherlock Hemlock (Default)
akuchling

September 2016

S M T W T F S
    123
45678910
11121314151617
18192021222324
2526272829 30 

Syndicate

RSS Atom

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags

No cut tags
Page generated Jun. 22nd, 2017 02:05 pm
Powered by Dreamwidth Studios