akuchling: Sherlock Hemlock (Default)
[personal profile] akuchling

Much processing is done on relational DBs: data -> SQL DB -> reporting


Limitations:
  • the structure must be relational
  • I/O heavy to ingest
  • data warehousing and reporting often doesn't care about ACID
  • storage is expensive
  • vertical scaling: to get faster, use faster storage or CPU
  • slow for large amounts of data
Hadoop components:
  • HDFS file system: files are blocks scattered over multiple servers
  • MapReduce: job tracker farms out work to nodes in a cluster
  • Flume: a custom ingester for logs
  • Sqoop: ODBC import and ingestion
  • the hadoop command has various subcommands: can store/retrieve HDFS files
  • Hive: SQL-like query language that is compiled to a multiprocessing job
  • Pig: scripting language for dataflow
  • Impala: SQL-like query language, but runs as an agent at the same level as MapReduce
  • HBase: key/value store built atop HDFS
  • Spark: in-memory Hadoop; it tries to avoid hitting the disk
  • Oozie: workflow manager/scheduler; define a DAG for workflow
Cloudera is a distro for Hadoop. 

Our cluster: 10 nodes; 100Tb raw space -> 30Tb of HDFS space.


From:
Anonymous( )Anonymous This account has disabled anonymous posting.
OpenID( )OpenID You can comment on this post while signed in with an account from many other sites, once you have confirmed your email address. Sign in using OpenID.
User
Account name:
Password:
If you don't have an account you can create one now.
Subject:
HTML doesn't work in the subject.

Message:

 
Notice: This account is set to log the IP addresses of everyone who comments.
Links will be displayed as unclickable URLs to help prevent spam.

Profile

akuchling: Sherlock Hemlock (Default)
akuchling

September 2016

S M T W T F S
    123
45678910
11121314151617
18192021222324
2526272829 30 

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Sep. 23rd, 2017 03:39 am
Powered by Dreamwidth Studios