Review Board 1.7.22


HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry

Review Request #3292 - Created Dec. 21, 2011 and discarded

Jimmy Xiang
0.92
HBASE-5081
Reviewers
hbase
lhofhansl, stack, tedyu
hbase-git
In this patch, after a task is done, we don't delete the node if the task is failed.  So that when it's retried later on, there won't be race problem.

It used to delete the node always.
mvn -Dtest=TestDistributedLogSplitting clean test
Review request changed
Updated (Dec. 22, 2011, 12:31 a.m.)
Removed whitespaces.
Posted (Dec. 22, 2011, 12:58 a.m.)

   

  
Did the first approach not work?
  1. The first approach will not hang the splitLog method, but the failed tasks won't be actually tried again since the state stays in TaskState.TASK_ERR.  We do need to delete those nodes unless we put different data in zookeeper as Stack suggested.
It seems we can do this unconditionally now (no need for the safeToDeleteNodeAsync flag). The worst scenario is trying to remove a node that has already been removed
  1. If the failed node is not removed at the beginning, we could run into the same race issue again.
Posted (Dec. 22, 2011, 2:23 a.m.)

   

  
The deletion is immediate. Should this counter be incremented ?
  1. Yes, so that we can track how many deletions succeed, how many fail.
I think we should be more cautious because RecoverableZooKeeper has attempted retry.
ke should be rethrown.
  1. In this case, we should not re-throw it actually.  In the corresponding asynchronous deleteNode method, it doesn't throw KeeperException either. It just logs the failure.
    
Posted (Dec. 22, 2011, 3:13 a.m.)

   

  
Looking at current code:
  private void deleteNodeFailure(String path) {
    LOG.fatal("logic failure, failing to delete a node should never happen " +
        "because delete has infinite retries");

since the above failure should never happen, the correct action would be to rethrown the exception - for both synchronous and async cases.
Posted (Dec. 22, 2011, 3:08 p.m.)
This patch seems to be raising too many questions.  Should we try going other route of ensuring the async delete removes the 'right' task?
  1. Due to the race issue, we have to put more than the "filename" in the node and the hashmap, so as to removes the right task.
    That's much bigger change and will raise more questions.
Why check for success up here rather than down inside the synchronize on task.batch?   Why not do this safeToDeleteNodeAsync in there in the else clause where we up the count of errors?  Is it not safe to do the delete of zk node NOW under the synchronize block?
  1. It is safe to do the delete under the synchronize block. The reason is that I don't want to hold the lock on task.batch while delete the node synchronously.  
Are we duplicating the code from deleteNode here?  Should we have sync/async versions?
  1. deleteNode is the async version.  deleteNodeNow is the sync version.  The async version can have unlimited retries.  The sync version can retry up to certain configured number (3 by default).
    So the sync version doesn't guarantee it will be deleted.  The code wise, it's hard to reuse.
Posted (Dec. 22, 2011, 3:24 p.m.)

   

  
Since the value of status won't change, I think it is better to call deleteNodeNow() here.
If we call deleteNodeNow() at line 360, we hold the lock much longer.
  1. Right.
Posted (Dec. 22, 2011, 3:34 p.m.)

   

  
DeleteAsyncCallback is only used by deleteNode().
I think we should simplify logic by removing deleteNode() and DeleteAsyncCallback - deleteNodeNow() uses RecoverableZooKeeper which has the retry logic.
  1. The difference is that deleteNode has unlimited retries.  RecoverableZooKeeper doesn't.  It has only 3 retries by default.
Posted (Dec. 22, 2011, 10:55 p.m.)
I feel that the proper fix should go in the method createTaskIfAbsent()

before attempting to delete a task in zk, task.deleted is set to true. The task is not removed from tasks array until task is successfully removed from zk.

In createTaskIfAbsent() when you find a deleted task we should do the following
* If the task had completed successfully then return null. (It is as if the task is completed right away).
* if the task had completed unsuccessfully then block (with timeouts) until the task is removed from the tasks array.

Without fixing anything, the problem, I think is present only in the following scenario
- at startup the master acquires orphan tasks  listed in zookeeper. One of these orphan tasks fails. Before that orphan task could be deleted some master thread asks for that task to be completed. As  things currently stand, the SplitLogManager will reply with SUCCESS immediately. (This is because of the logic in createTaskIfAbsent())

The common case where  this race happens should work ...
- a master thread asks for a log dir to be split. That task fails but it has not been deleted from zk yet nor removed from tasks yet. The log-dir-split is retried and the retry finds the old, soon to be deleted task. But the retry will also see that task.batch is set and it will immediately throw an error saying 'someone else is waiting for this task'. And the next time log-dir-split is retried the tasks map might have been cleared and things will work.
  1. "The task is not removed from tasks array until task is successfully removed from zk."
    
    This seems not correct.  stopTrackingTasks() will remove all tasks even if the task is not removed from zk.
    That's why createTaskIfAbsent() can put a new task in the set.
    
    If we remove stopTrackingTasks(), then the task should be still in tasks, then this alternative will work.
    Will removing stopTrackingTasks() cause other issues?  For the second *, how long should we block?  If
    the task is still not removed from the tasks array after the timeout, what should we do?
    
    Can you come up a patch? I am very open to any fix.
    
The task corresponding to this path has to be removed from the tasks map (as in deleteNodeSuccess())
  1. It is removed in the stopTrackingTasks() methods, since this one is failed, so batch.installed != batch.done.
I guess this should be considered an error that the delete did not go through?