hbase-6060 Regions's in OPENING state from failed regionservers takes a long time to recover
Review Request #5796 - Created July 6, 2012 and updated
This patch works on the issue where SSH and the single region assign can clash; in particular, the single region assign retries along w/ SSH running could result in double-assign. The basic idea is a region that is in the OFFLINE state is a region that can be retried by single-assign and is not to be assigned by SSH as part of its bulk assign set. To that end, on open region, in RS, we set znode to OPENING before returning to master. On master-side, PENDING_OPEN now is just that narrow window post return from open region while we are waiting on the znode callback to set RegionState to OPENING. We add synchronize all of RegionState and add at least one conditional state setting: i.e. don't set a RegionState to PENDING_OPEN if state is currently OPENING or OPENED. Would like to add more so RegionState is where we set what state can follow from current condition. Also tried to remove and rename of methods on RegionState. Did some cleanup of its use throughout AssignmentManager. TODO: More tests and corraling of our setting regions to OFFLINE; it happens too often but also from too many different angles... makes it hard to follow whats going on. I am also afraid that a region could be OFFLINE and not be in a state of being assigned (I suppose the timeout monitor will find it but I need to spend more time looking at OFFLINE setters to see if any OFFLINEs being left aside). Also, need to work on bulk assign. It should do same setting of znode on open. Need to study socket timeout more, especially around bulk assign where its more likely. M hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java Rename of getRegionState, a RegionState creating method to createRegionState. This is probably the bulk of the change in this class. Move setting of PENDING_OPEN to after we call open on regionserver. (processServerShutdown): Clears out all state and returns a list of RegionStates that pertain to the dead server whether their being carried by the server at time of expiration or if they are regions in RIT that were assigned it and in process of being opened. Synchronized all setting in RegionState in prep for our checking state for legal transitions all in here rather than all over the place. Added RegionState#setPendingOpen which will set PendingOpen IFF it is in OFFLINE state. M hbase-server/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java Fixed up some formatting. Moved bulk of process method out of process into private methods that do one thing rather than have process do it all. M hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java Before queuing a OpenRegionHandler in the executor, set its state up in znode as OPENING. Will fail faster if can't set znode. (transitionZookeeperOfflineToOpening) Added from OpenRegionHandler. M hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java (transitionZookeeperOfflineToOpening) Moved out of here. M hbase-server/src/test/java/org/apache/hadoop/hbase/master/Mocking.java Changed what we wait on now PENDING_OPEN does not mean same thing. M hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java (TestSSHWhenSourceRSandDestRSInRegionPlanGoneDown) Add Rajesh's test.