Review Board 1.7.22

MultiThreaded Table Mapper analogous to MultiThreaded Mapper in hadoop

Review Request #3995 - Created Feb. 22, 2012 and updated

Jai Singh
stack, tedyu
There is no MultiThreadedTableMapper in hbase currently just like we have a MultiThreadedMapper in Hadoop for IO Bound Jobs. 
UseCase, webcrawler: take input (urls) from a hbase table and put the content (urls, content) back into hbase. 
Running these kind of hbase mapreduce job with normal table mapper is quite slow as we are not utilizing CPU fully (N/W IO Bound).

Moreover, I want to know whether It would be a good/bad idea to use HBase for these kind of usecases ?.

Review request changed
Updated (Feb. 23, 2012, 4:22 a.m.)
White spaces remove
Ship it!
Posted (Feb. 23, 2012, 4:32 a.m.)
This looks great.  Does it work?  Have you tried it?  +1 on commit if it works.  Would be nice in things like PE putting up more load.
  1. This works fine. I've tested it in the usecase  I mentioned on jira HBASE-5166.
  2. So works nicely for your crawling then?  Mind writing a sweet release note for this?  I'll go commit it.
  3. Oh, mind uploading the final version of the patch to the issue itself then we can run hadoopqa on the patch and make sure it plays well w/ rest of hbase (should be fine given its standalone).  Thanks Jai.
  4. Yes, It works great with web crawling scenario. 
    "MultiThreadedTableMapper for [N/W] IO bound jobs"
    Updated the patch on jira.