Review Board 1.7.22


Speedup LoadIncrementalHFiles by parallelizing HFile splitting

Review Request #704 - Created May 9, 2011 and updated

Ted Yu
trunk
HBASE-3871
Reviewers
hbase
stack
hbase
This JIRA complements HBASE-3721 by parallelizing HFile splitting which was done in the main thread.

From Adam w.r.t. HFile splitting:
There's actually a good number of messages of that type (HFile no longer fits inside a single region), unfortunately I didn't take a timestamp on just when I was running with the patched jars vs the regular ones, however from the logs I can say that this is occurring fairly regularly on this system. The cluster I tested this on is our backup cluster, the mapreduce jobs on our production cluster output HFiles which are copied to the backup and then loaded into HBase on both. Since the regions may be somewhat different on the backup cluster I would expect it to have to split somewhat regularly.
TestHFileOutputFormat and TestLoadIncrementalHFiles passed with this patch.
Review request changed
Updated (July 9, 2011, 1:54 a.m.)
Patch version 2 addresses Andrew's comment.
Entering the loop starting at line 202, queue is empty, waiting for HFile splitting to feed item(s) into the queue.
The previous patch may wait inappropriately long for the first HFile to complete splitting.
The second version limits the amount of time waiting for any particular HFile to complete splitting.
Posted (July 12, 2011, 2:51 a.m.)
Request for review.
Posted (July 12, 2011, 4:49 a.m.)
Patch looks fine to me but are you addressing Andrew's comment that perhaps futures not needed?  Good stuff.
  1. CountDownLatch ctor is passed the total number of items (HFiles in our case). tryLoad() decides which HFile's to split, making number of items dynamic.
    This is why I didn't use CountDownLatch.
    
    With patch v2, we wouldn't spend much time waiting for any HFile to finish splitting.
    
  2. OK.  +1 ship it.
  3. +1 thanks Ted.