akuchling: Sherlock Hemlock (Default)
2016-09-30 04:18 pm
Entry tags:

Notes on Apache Zookeeper

 Apache Zookeeper (http://zookeeper.apache.org/) is a specialized filesystem-like database.
Database features
The database consists of znodes, which are identified by a filesystem-like path, e.g. "/prod", "/solr/status".  Each znode can contain a chunk of content, represented as a string.  It's expected that this content is relatively small, e.g. a few KB and not 600MB.   Unlike a filesystem, a znode can have content and still have children, i.e. /solr can have the value 'active', yet /solr/status, /solr/nodes, etc. can also exist.
It's possible to create an ACL controlling who can read/write a particular node, and also to create a watch that's triggered when a particular node is modified.  This can be used to create data structures inside Zookeeper such as locks or queues.  For example, to create a barrier you might pick a znode path such as /barrier/barrier-id and issue an exists(path) call that also sets a watch.  If exists() returns False, the barrier isn't held and you can create it and then go do your computation.  If exists() returns True, wait for a watch event that tells you the barrier has
been deleted and then try the exists() again.
The database is kept in memory while the server is running.  Changes are logged to disk, so the server can be stopped and restarted as necessary.
Distributed operation
Zookeeper supports distributed operation.  You can configure N zookeeper nodes into a set. Updates made on one node will become visible to clients of all the other nodes in the set.  This is coordinated by a leader node that's elected when the nodes are started.  If the leader node suddenly dies, a new leader is automatically elected.
It was straightforward to set up a 3-node set.  Each node has its own data directory and its own config file.  All the config files have a bunch of settings describing the nodes within the set: 
server.1 = node1:2888:3888
server.2 = node2:2888:3888
server.3 = node3:2888:3888
Each data directory contains a "myid" file that gives the index (1, 2, 3) of which node this is.  Once you start all the servers, they'll elect a leader that will propagate write operations to the other servers.  Clients can connect to any of the servers and should see the same database.
There's a command-line tool, zkCli.sh, that allows running commands on a Zookeeper database.  Example invocation:

zkCli.sh -server zknodeN:2181 ls /
You could use Zookeeper's database to store configuration info, storing data in znodes instead of a set of nested dictionaries as you would in JSON.  This would be primarily a read-only database, but Zookeeper can also be used for applications that need frequent updates.  For example, I've seen references to  using Zookeeper as a status board for a set of servers.  You might run 'apachectl status' on server prod1 every minute and store the output in the znode /apache/prod1.  A monitoring process could then set a watch on /apache and its children, get triggered every time a server updates its status, and then react when one changes from 'running' to 'stopped'.
Zookeeper comes with Java and C APIs.  There's a Python wrapper for the C API, but a more complete package seems to be kazoo (http://kazoo.readthedocs.org/en/latest/).  The README for kazoo says memory management for the C API is difficult, so kazoo is pure Python and supports several different event loops such as gevent and Twisted.  kazoo also has implementations of some common usage patterns such as locks, queues, and elections.
akuchling: Sherlock Hemlock (Default)
2015-06-18 09:37 pm
Entry tags:

Notes on Hadoop

Much processing is done on relational DBs: data -> SQL DB -> reporting

  • the structure must be relational
  • I/O heavy to ingest
  • data warehousing and reporting often doesn't care about ACID
  • storage is expensive
  • vertical scaling: to get faster, use faster storage or CPU
  • slow for large amounts of data
Hadoop components:
  • HDFS file system: files are blocks scattered over multiple servers
  • MapReduce: job tracker farms out work to nodes in a cluster
  • Flume: a custom ingester for logs
  • Sqoop: ODBC import and ingestion
  • the hadoop command has various subcommands: can store/retrieve HDFS files
  • Hive: SQL-like query language that is compiled to a multiprocessing job
  • Pig: scripting language for dataflow
  • Impala: SQL-like query language, but runs as an agent at the same level as MapReduce
  • HBase: key/value store built atop HDFS
  • Spark: in-memory Hadoop; it tries to avoid hitting the disk
  • Oozie: workflow manager/scheduler; define a DAG for workflow
Cloudera is a distro for Hadoop. 

Our cluster: 10 nodes; 100Tb raw space -> 30Tb of HDFS space.

akuchling: Sherlock Hemlock (Default)
2015-02-03 03:08 pm
Entry tags:

Notes on Redis

Work has resolved the question of what NoSQL database I should learn: I'm being given a project that will involve Redis as well as PostgreSQL.

Available operations:

  • get the value corresponding to a string-valued key
  • MGET key1 key2 key3: get multiple values, returning them as an array
  • set the value corresponding to a key. Variants: set the value for a key, only if the key is already present or isn't already present.
  • DEL key: delete the value for a key.
  • EXPIRE key delay: cause the key to expire and disappear after 'delay' seconds. Resolution is 1msec. (? correct ?) Can also specify delay when adding the key: SET key value EX delay.
  • TTL key: returns the # of second the key has left before it expires.
  • TYPE key: returns the type of the corresponding value
  • EXISTS key: returns 1/0 based on whether the key exists.
  • Increment/decrement counters atomically. INCR key increments the value by 1; INCRBY key n increments the value by n.
  • GETSET key value: stores a new value and returns the previous value.

Data types for values:

Linked lists:

  • Written as 1,2,3,4.
  • LLEN key returns the length.
  • LPUSH key value adds a value as the head of the list, and runs in O(1) time.  RPUSH key value  adds the value as the tail and is O(n).
  • There are corresponding LPOP and RPOP commands.
  • LRANGE key start end returns an array containing the corresponding values, and is also O(n).
  • Capped lists: LTRIM key start end keeps only the specified range and discards everything else. To keep the 10 most recent items, do LTRIM key 0 9.
  • Blocking pops: BLPOP and BRPOP block if the list is empty, or can take a timeout. Useful for implementing queues. There's even BRPOPLPUSH srckey destkey will pop the rightmost item from list 1 and push it onto list 2.


  • HSET key attr1 value1: set a single key in a hash. For multiple keys, use HMSET key attr1 value1 attr2 value2 ...
  • HGET key attrname: retrieve a key within a hash. HGETALL key returns a flat array of [attr1, value1, attr2, ...].
  • Other commands also have hash variants: HINCRBY, HEXISTS.


  • Are unordered collections of items. Commands include SADD key elem, SISMEMBER key elem to check membership, SPOP key removes a random element and returns it, SCARD key returns the number of elements ('cardinality').
  • Intersection: SINTER key1 key2 ....

Sorted sets: each element has an FP value called its score, and the set is sorted by this score in ascending order.

  • ZADD key score1 elem1 score2 elem2 .... Updates are O(log N).
  • ZRANGE key start end and ZREVRANGE key start end return slices in sorted or reversed order, returning just the elements. Add "WITHSCORES" to also return the scores.
  • To search by score, ZRANGEBYSCORE key min max. min and max are inclusive; you can say -inf or +inf. ZREMRANGEBYSCORE would remove the matching elements.
  • Ranking: ZRANK key elem returns the position of the element in the sorted order.
  • Redis 2.8 can search by the elements as well: ZRANGEBYLEX key [B [P returns elements between B and P.

Bitmaps: bit operations are defined on strings. Size limit is 512Mb = 2**32 bits.

  • SETBIT key bitnum 0/1 sets a bit, and GETBIT key bitnum returns it.
  • Bit operations over entire string: BITOP and|or|xor|not destkey srckey1 srckey2 ... - for NOT, only one srckey can be supplied.
  • BITCOUNT key counts set bits, and BITPOS key 1/0 returns the index of the first 1 or 0 bit (you can also constrain this to a range).

HyperLogLogs: probabilistic data structure for counting # of unique items seen. Has a certain amount of error on the resulting value, but is compact and doesn't grow unboundedly.

  • PFADD key elem1 elem2 elem3 ... stores the elements.
  • PFCOUNT key returns the # of unique elements seen.

akuchling: Sherlock Hemlock (Default)
2015-01-01 01:28 pm
Entry tags:

Notes on Backbone.js

There are many JS frameworks out there, and I want to learn one. I chose Backbone.js. (Angular is probably more popular, but also facing a Python 3-ish transition, so I let it slide.)

Backbone.js: one of a bunch of MVC-ish frameworks. Others include Angular, Spine, JavaScript MVC, Ember, Knockout.

TodoMVC is a to-do list app built in several different frameworks so you can compare them.

Backbone provides:

Models: represent domain objects.

  Wine = Backbone.Model.extend();   // Creates a model
  w = new Wine({prop1: val1, prop2: val2, ...})
Collections: a bunch of Model objects. Will apparently infer a REST-ish backend automatically.
 WineList = Backbone.Collection.extend({
   Model: Wine,
   url: "http:..."     // URL for the REST back-end
  wines = new WineList([list of objects used to create Wine instances]);
  wines.each(function (w) {...})
  HomeView = Backbone.View.extend();
  view = new HomeView();

Templates: Backbone uses Underscore for templating by default.

Routes: recording app state in the URL, and moving between different states.

  wineApp = new AppRouter();
  wineApp.routes = dict of string to method name


View.events = dict of event name to method name called when event is triggered.

akuchling: Sherlock Hemlock (Default)
2014-11-24 09:19 pm
Entry tags:

Notes on go

Every Go file starts with "package <pkgname>".

package main

func main () {

QUESTION: are braces ever optional?

Variables are declared as: var <name> <type> = <value>, e.g. var a string = "foo".

Or, automatically infer the type: a := "foo"

Basic types: int, uint.  int{8,16,32,64}, uint{8,16,32,64}.  uintptr.  float32, float64, complex32, complex64.  string.  There are no default numeric conversions.  (QUESTION: how do you do conversions?)  (QUESTION: is there a library for a decimal type?)

Data structures: array (fixed-size; like C arrays; rarely seen in APIs); slice (views into arrays; used like Python lists; widely used); map (like a dict); struct; pointer; function; interface; channel.

Type aliases:  type Prefix string

Structures: type Prefix struct { content string, n int }

Creating a structure: S = &Prefix("s", 1) or Prefix{content: "s", n: 1}.

Types also follow the function name:

func <name>(<params>) <type> { ... }

Defining methods: 
func (self *Prefix) Fix(...) { ... }

Invoked as prefix.Fix(...).  Go convention isn't to use 'self', but to use something relevant to the type, e.g. 'pref' for 'Prefix'.  Can attach methods to any user-defined type, even ones that are just alias for 'int32'.

Visibility: methods beginning w/ capital letters (e.g. Printf) are public.  Lowercase letters signal private,  (QUESTION: is this just methods, or also struct variables?)

OO-like features: there is no idea of inheritance.  Instead there's a short-cut for doing composition:

type User3 {
  uid Uid   // regular field
  PrivilegeSet // anonymous field

All of the methods of the anonymous PrivilegeSet will be available as methods of User3; Go will automatically construct wrappers or forward them.

Interface definitions:

type PrivilegeSet interface {
  HasPrivilege (name string) bool
  AddPrivilege (name string)

Control flow

C-like statements: if, for, switch.

No 'while'; write 'for { ... }' instead.  No exceptions: use multiple return values (see below).

Functions are first-class and support closures.

No try...finally: instead, use 'defer'.  e.g.  f, err := os.Open(...) ; defer f.close() .  f.close() will be executed on leaving the function.

Panic/recover: more like exception-raising, but only to be used for serious problems (e.g. out of memory).  panic(value) triggers a panic; calls deferred functions up the call stack.  These functions can call recover() and will get the value passed to panic().

Error handling: functions can return multiple values (QUESTION: just 2-tuples, or arbitrary N-tuples?)

passwd, err := io.ReadFile(...)
if (err != nil) {
  return false, fmt.Errorf("Impossible error: %s", err)
return true, nil  // No error

Can use '_' as the don't-care value, e.g. _, err = something().

Channels: act like typed pipes:

function Flying(res chan bool) {
 res <- true     // Sends the value 'true' down the pipe.
 res <- false

pipe := make(chan bool, 5)   // type of channel is 'bool'; 5 is size of queue or something
go Flying(pipe)       // Starts a goroutine running the function Flying().
result <- pipe   // Reads a value from the channel; will block if value isn't ready.

Data structures: maps:  

  m := make(map[int]string)  // Maps integers to strings
  m[1] = "a"
  m[2] = "b"
  m[3] = "c"
  // Checking for an element
  S, ok := m[4]
  if !ok { ... record not found ... }
  // Looping over the content
  for k,v := range m { ... }


There's a 'testing' package that provides a testing.T interface with logging methods.  The "go test" subcommand will run tests.  Example:

package s_test
import ("testing" "string" ...)

func TestIndex(t *testing.T) {
  value = string.index("bogus", "foo")
  if (value != expected) {
     t.Errorf("Problem with index")

Other methods: t.Fatalf("...") ; t.Logf("...") to print a message with '-v'.  t.Skip("Doesn't work on ARM")

t.Parallel() marks a test as safe for parallel execution.

"go test" reports pass/fail.  "go test -cover" reports coverage.  "go test -coverprofile=cover.out" writes output to a file, and then "go tool cover -func=cover.out" will report on it.  "go cover -html=cover.out" generates an HTML visualization.

"go test -race <pkgname>" adds instrumentation for checking race conditions, and will report on problems.  "go run -race source.go" will execute a file.  "go build -race" and "go install -race" also work.

"go vet <pkgname>" is a static analyzer that reports problems.

Testing HTTP servers

There's a net/http/httptest package in the stdlib.  Usage is something like:

ts := httptest.NewServer(handler_func)
defer ts.Close()
res, err := http.Get(ts.URL)    // Accesses this test server