Review Board 1.7.22


HBASE-5196 Failure in region split after PONR could cause region hole

Review Request #3488 - Created Jan. 13, 2012 and submitted

Jimmy Xiang
HBASE-5196
Reviewers
hbase
hbase-git
When the master starts up, this patch tries to scan all offline split parents and fix up missing daughters as the ServerShutdownHandler does.
I test the fix in my real cluster and it does fix the problem.

I am working on a unit test now.
Posted (Jan. 13, 2012, 7:18 p.m.)
+1 on patch so far.  In issue when you say 'if master does not get a chance to fix it', when is that?  Doesn't master do it when it comes on line?  Good stuff Jimmy.
  1. There are only 3 threads to do the clean up.  If there are lots of (most in the cluster) region servers died, the shutdown handler may stuck in log splitting for quite sometime. During this period,
    if the master died somehow, it won't be able to finish the clean up.  In my case, I ran testLoadAndVerify and it brings the HDFS down to knee. So I restart the cluster and
    end up with lots of holes in the region chain.
  2. Makes sense.
Posted (Jan. 13, 2012, 7:26 p.m.)

   

  
Should read 'parents found. See if we can fix any'
If an enum is returned, we can get three counters which would be used in the log statement below.
I prefer an enum here:
daughter not missing,
daughter missing and fixed,
daughter missing but not fixed
  1. I'd say that if you are interested, look in logs?
    
    I think we should get the basic patch in first.  Can do the fancy stuff in another issue?
Posted (Jan. 14, 2012, 4:29 a.m.)
Looks good to me