akuchling: Sherlock Hemlock (Default)

Much processing is done on relational DBs: data -> SQL DB -> reporting


Limitations:
  • the structure must be relational
  • I/O heavy to ingest
  • data warehousing and reporting often doesn't care about ACID
  • storage is expensive
  • vertical scaling: to get faster, use faster storage or CPU
  • slow for large amounts of data
Hadoop components:
  • HDFS file system: files are blocks scattered over multiple servers
  • MapReduce: job tracker farms out work to nodes in a cluster
  • Flume: a custom ingester for logs
  • Sqoop: ODBC import and ingestion
  • the hadoop command has various subcommands: can store/retrieve HDFS files
  • Hive: SQL-like query language that is compiled to a multiprocessing job
  • Pig: scripting language for dataflow
  • Impala: SQL-like query language, but runs as an agent at the same level as MapReduce
  • HBase: key/value store built atop HDFS
  • Spark: in-memory Hadoop; it tries to avoid hitting the disk
  • Oozie: workflow manager/scheduler; define a DAG for workflow
Cloudera is a distro for Hadoop. 

Our cluster: 10 nodes; 100Tb raw space -> 30Tb of HDFS space.


Profile

akuchling: Sherlock Hemlock (Default)
akuchling

September 2021

S M T W T F S
   1234
567891011
1213 1415161718
19202122232425
2627282930  

Syndicate

RSS Atom

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags

No cut tags
Page generated Aug. 10th, 2025 02:44 pm
Powered by Dreamwidth Studios