Review Board 1.7.22


HBASE-6573: Generic Three-Phase Commit framework

Review Request #6592 - Created Aug. 13, 2012 and updated

Jesse Yates
trunk
HBASE-6573
Reviewers
hbase
jmhsieh, lhofhansl, mbertozzi, tedyu
hbase-git
Distributed 3-Phase Commit framework that leverages ZK to coordinate between the commit coordinator (e.g. master) and the cohort members.

The HBase 3PC is based on using 2PC on either side of the channel (both on the coordinator and cohort members) along with an operation specific error listener to keep track of operation failure.  Implementation-wise, there are really just a couple of main classes: the coordinator, cohortMember and controllers for each. Currently, there controllers are hidden behind an interface with a ZooKeeper implementation. One of the nice things that is adds is the ability to pass an arbitrary exception from one of the cohort members back to the the coordinator and each of the other cohort members (which gets you meaningful error messages on the coordinator and, in the long view, back to the client).

This abstracts out a lot of the work and complexity in snapshots (HBASE-6055) into a small, (relatively) easily tested change. 

3PC is nice because "The basic observation is that in 2PC, while one site is in the “prepared to commit” state, the other may be in either the “commit” or the “abort” state. From this analysis, they developed 3PC to avoid such states and it is thus resilient to such failures." - Wikipedia (http://en.wikipedia.org/wiki/Three-phase_commit_protocol). 
Ran added test 10x locally without failure.
Total:
47
Open:
25
Resolved:
17
Dropped:
5
Status:
From:
Description From Last Updated Status
Why do we need this one line method ? Ted Yu Aug. 14, 2012, 5:38 p.m. Open
@InterfaceAudience.Public @InterfaceStability.Evolving ? Jonathan Hsieh Aug. 17, 2012, 9:59 p.m. Open
Can you add docs summary at this level where you say specify the what the implied order of method calls ... Jonathan Hsieh Aug. 17, 2012, 9:59 p.m. Open
@InterfaceAudience.Public @InterfaceStability.Evolving ? Jonathan Hsieh Aug. 17, 2012, 9:59 p.m. Open
Feels like prepare/commit should manage the latches instead of exposing them. This seems to leave a lot of work for ... Jonathan Hsieh Aug. 17, 2012, 9:59 p.m. Open
should others have timeouts as well? Jonathan Hsieh Aug. 17, 2012, 9:59 p.m. Open
Finsih -> Finish Jonathan Hsieh Aug. 17, 2012, 9:59 p.m. Open
why generics? Why not just an instance? Jonathan Hsieh Oct. 1, 2012, 10:19 p.m. Open
why have this extra layer if only the "3pc" is used? Jonathan Hsieh Oct. 1, 2012, 10:19 p.m. Open
I feel that ExceptionCheckable does not belong in this class. Util classes should be at the bottom of the package ... Jonathan Hsieh Oct. 1, 2012, 10:19 p.m. Open
ExceptionCheckable does not belong in this class. Jonathan Hsieh Oct. 1, 2012, 10:19 p.m. Open
Why not fold in the "TwoPhaseCommit" class? Jonathan Hsieh Oct. 1, 2012, 11:53 p.m. Open
this is not 3pc. There are prepare, pre-commit and commit messages in 3pc and here we only have prepare, commit, ... Jonathan Hsieh Oct. 1, 2012, 11:53 p.m. Open
Why is this a Listen*able* when extending a Listen*er*? Also merge with D3PCErrListener interface? Jonathan Hsieh Oct. 1, 2012, 11:53 p.m. Open
Why not fold in TwoPhaseCommitable? (which already extends Callable and Runnable). Jonathan Hsieh Oct. 1, 2012, 11:53 p.m. Open
Do we need generics here? Jonathan Hsieh Oct. 1, 2012, 11:53 p.m. Open
Merge with 3PCErrListenable? Jonathan Hsieh Oct. 1, 2012, 11:53 p.m. Open
why not merge them into one? Jonathan Hsieh Oct. 1, 2012, 11:53 p.m. Open
again, do we need these generics? Jonathan Hsieh Oct. 1, 2012, 11:53 p.m. Open
You don't return the L anywhere so you should be able to get away with normal polymorphism. No need for ... Jonathan Hsieh Oct. 1, 2012, 11:53 p.m. Open
fold this in to D3PCErrorListener Jonathan Hsieh Oct. 1, 2012, 11:53 p.m. Open
We might be able to merge D3PCCoordinator, coordinator task, coordinator task builder Jonathan Hsieh Oct. 1, 2012, 11:53 p.m. Open
Why the diamond shaped inheritance graph? ZK2PCCohortMemberController implements DCCohortMemberController which implements DCController ZK2PCCohortMemberController extends ZK2PCController whihc implements DCController Does the ... Jonathan Hsieh Oct. 1, 2012, 11:53 p.m. Open
Why do you need a generic here? Doesn't normal polymorphism handle this? Jonathan Hsieh Oct. 1, 2012, 11:53 p.m. Open
Diamond inheritance: ZK2PCCoordinatorController -> ZK2PCCommitController -> ZK2PCController -> DistributedCommitController ZK2PCCoodrinatorController -> DistributedCommitCoordinatorController -> DistributedCommitController Jonathan Hsieh Oct. 1, 2012, 11:53 p.m. Open
Review request changed
Updated (Sept. 22, 2012, 12:23 a.m.)
Rebasing onto the commmitted (to snapshots) error handling framework. Also, includes some simplifications from snapshots that hopefully makes the inheritance a bit simpler (though by no means great).

This is still a bit in flux, but can be considered a stable first cut.
Posted (Oct. 1, 2012, 10:19 p.m.)
Two big points from skimming before going into the details:

1) There again is over engineering in here -- 25 classes and multiple implementations in one patch implementing on algorithm.  I can tell when I have to spend more than an hour drawing out and piecing together the class hierarchy. Inheritance and interfaces where it is not necessary (why have the 3pc extending 2pc?, 2pcErrListenable->3pcErrListenable, etc?).  Please just start by including the concrete implementations that will be used.  I'm not sure what is client and server and see no hooks or docs for where RPC is.  I'm fairly confident that we could have the same function less "distributed" across all these classes.

2) I've spent some time really understanding 2pc and I've concluded that this doesn't really implement 2pc or its common variants (as described here http://www.cs.cmu.edu/~natassa/courses/15-823/F02/papers/p378-mohan.pdf).  The fact that we are using zookeeper as a single place of truth seems fundamentally different than standard 2pc where each cohort/resourcemanager and controller/transactionmanager has its own local wal).  While the flow of the "success" path is similar to 2pc, this has different assumptions and lacks the logging and recovery paths normally associated with traditional 2pc.   Since this isn't 2pc/3pc, lets call it something else so we don't confuse folks.

I'll do another pass after I piece together the hierarchy with more actionable comments.
  1. (1) Agree that its too complex - working on bringing down the level a bit. 
    
    (2) I'd disagree on the single source of truth - the Coordinator keeps track of who should be in/out. ZK is just used as an RPC mechanism and is nice for later doing master-failover and continuing the operation.
  2. (2) I haven't spent a lot of time understanding the use of ZK yet, so I'll get to that when I do my next pass.
hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java (Diff revision 2)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
note: this is only used in one place... 
why generics?  Why not just an instance?
why have this extra layer if only the "3pc" is used?
I feel that ExceptionCheckable does not belong in this class.  Util classes should be at the bottom of the package dependency list.
ExceptionCheckable does not belong in this class.
Posted (Oct. 1, 2012, 11:53 p.m.)
I haven't gotten to the functional code yet, but worked out a rough class inheritance/structure diagram.  

Ideally, an interface is something that has a suite of related functions.  Here the tendency is to have a plethora of interfaces with single methods.  I consider this an antipattern.  It is better to have a only few interface and leave the implementations empty or throwing something like a NotImplementedException.

Some high level suggestions:
* Interfaces that are just extended by another interface should be merged. 
* Generics that could be replaced with plain polymorphism should be.
* Favor composition over inheritance.  Things that have diamond inheritance relations seem like possible candidates.  

I'm going to take a thwack at trimming this down tomorrow.  
  1. Generally agree with the above. There was some consideration to the difference between 2PC and 3PC though - you don't want to have to deal with the 3PC error handling just for a local 2PC operation. 
    
    This comes into play for snapshots with local operations; they are orchestrated via a local cohort member that is getting notifications from the coordiantor, but running a handful of local 2PC operations that are progressed when all local ops have reached each barrier. However, there are clearly cases where just using a local 2PC would be preferable anyways.
Why not fold in the "TwoPhaseCommit" class?
this is not 3pc.  There are prepare, pre-commit and commit messages in 3pc and here we only have prepare, commit, with a timeout.
Why is this a Listen*able* when extending a Listen*er*?

Also merge with D3PCErrListener interface?
Why not fold in TwoPhaseCommitable? (which already extends Callable and Runnable).
Do we need generics here? 
note. this looks like cohort/resource manager side, not the controller side.  Hopefully I got this right.
Merge with 3PCErrListenable?
why not merge them into one?
again, do we need these generics?
You don't return the L anywhere so you should be able to get away with normal polymorphism.  No need for generics here.
fold this in to D3PCErrorListener
We might be able to merge D3PCCoordinator, coordinator task, coordinator task builder
Why the diamond shaped inheritance graph?

ZK2PCCohortMemberController implements DCCohortMemberController which implements DCController
ZK2PCCohortMemberController extends ZK2PCController whihc implements DCController

Does the 2PCCohortMemberController having a Distributed3PCCohortMember as a generic seems strange to you?
Why do you need a generic here?  Doesn't normal polymorphism handle this?
Diamond inheritance:

ZK2PCCoordinatorController -> ZK2PCCommitController -> ZK2PCController -> DistributedCommitController

ZK2PCCoodrinatorController -> DistributedCommitCoordinatorController -> DistributedCommitController
Posted (Oct. 2, 2012, 5:06 p.m.)
I've started folding things together here: https://github.com/jmhsieh/hbase/tree/3pc  

Posted (Oct. 3, 2012, 12:59 a.m.)
Commenting on bug (and fix) in implementation of CoordinatorTask, found while working on snapshotting. 

Hoping to look at Jon's comments embedded comments the rest of the code soon.
This is missing:
+      // wait for all servers to finish the prepare phase
+      try {
+        this.waitForLatch(this.allServersPreparedOperation, "cohort prepared");
+      } catch (InterruptedException e) {
+        Thread.currentThread().interrupt();
+        throw new DistributedCommitException(e);
+      }

Found it while working through some snapshot stuff and realized that the prepare phase on the coordinator just writes out the prepare node, but then doesn't actually wait for the nodes to prepare. When nodes prepare, prepared(String) counts down the allServersPreparedOperationLatch, so I think this was lost in a rebase.
  1. Actually, I'm wrong there - its just a bit confusing (and worrisome that after a couple weeks of not looking at it, I was confused). Instead the prepare phase is really just there for writing the prepare node, which then counts down the 'allow commit' latch when all the nodes have prepared. When that happens we trigger the commit phase, which posts the commit node in zk, and then waits for all the nodes to finish. Its just the hiding of the latch that got a little funky - probably better naming would resolve this.
Posted (Oct. 4, 2012, 12:08 a.m.)
Google Spanner, a globally-distributed database uses two phase commit.
Two phase commit should suffice in designing this framework, right ?
  1. Yes, for the general case. However, considering the case of globally consistent snapshots, we need to provide an SLA for the regionservers being unavailable, which brings the timers into play. To support the timers, we end up with 3PC.
  2. Spanner I believe uses many instances of paxos for cross data center synchronization consensus, and then uses 2pc to do transactions on top of different paxos decision groups.  
    
    Timers are actually part of normal 2pc's presumed abort optimization.  I've started a deck that almost exhaustively goes through the error handling scenarios, will try to post tomorrow.  I've gotten to the meat of this code now, will hopefully post another review.  
    
    I haven't reviewed the 3pc algorithm/paper with the same rigor, but I'm thinking that the wikipedia entry isn't complete (my bad, may have lead astray), the code labeled 3pc currently is actually closer to the 2pc presumed abort optimization in spirit.  3pc has 3 messaging phases (at first glance the 3pc code above doesn't seem to have) and has a non-blocking property while 2pc may end up in a blocking situation.  
    
    At first glance what is in this patch is essentially using ZK as rpc (seems reasonable) and log (which is a little goofy since ZK is remote), and is thus different from traditional 2pc where each participant (the coordinator/tx mgr and the cohorts/resource mgrs) has its own log.  
    
    Things I going to look for in this implementation is how this infrastructure provide the hooks for triggering cleanup on the cohort abort/no vote cases, coordinator failure scenarios.  It is interesting that ZK is a centralized place for the commit protocols state.  any failed process could recover by looking there, but i'm not sure if we have recovery (I believe we "cleanup"/restart instead)
    
  3. The use of ZK here was with the expectation that in the future we could do master failover and maintain the state of a running operation. All the new master needs to do is check progress in ZK and rebuild any data structures. ZK was/is a bit of a pain to deal with for just straight RPC (both in terms of complexity and latency), but once setup did provide some nice potential for the recovery described above. The current implementation is far more of an RPC mechanism than logging - there is in fact no other logging going on anywhere of progress. 
    
    However, the idea was to put it all behind an interface so people could implement a direct RPC mechanism rather than going through zookeeper. 
    
    It is definitely more similar to 2PC with the abort optimization. I think that 3PC's initial phase would have been way overkill for this, but generally people take 2PC to not have the timeout and its a much larger mouthful as a class name.
    
    WRT recovery, there really isn't any, though you could add some logging, etc. to make it potentially recoverable. The general presumption was that if something crashes it takes a long time to recover, so the global operation should fail if any of the commit members fails; that doesn't take into account cleanup of local changes (resource commitment, etc) that might be necessary in terms of recovering from a crash.
    
    This is more correctly described then as "Distributed, In-Memory Two Phase Commit with a Timeout Optimization".