Review Board 1.7.22


HIVE-1803: Implement bitmap indexing in Hive

Review Request #466 - Created March 4, 2011 and updated

John Sichi
trunk
HIVE-1803
Reviewers
hive
hive-git
Review by JVS.

 
Posted (March 4, 2011, 2:32 p.m.)

   

  
Suggested rename:  TableBasedIndexHandler.  No need to include Abstract in the class name.
I know this is copied from Yongqiang's existing code, but I noticed that it doesn't deal correctly with partition values with oddball characters.

Safe version is:

sb.append(HiveUtils.escapeString(p.getValue()));
package goes after the file header (before the first import)
Let's just call this HiveIndexedInputFormat.
Let's just call this HiveIndexResult
The point of factoring out the compact index handler base class was so that you could eliminate most of the code in this class, right?  :)
  1. I don't know what happened with the refactoring here. I swear I did it right. Anyways, I'll fix it in the next patch.
Where is the code for requiring that map-side aggregation is in use (to ensure the desired ordering)?
Need Apache header.
Change all these TODO's into

throw new UnsupportedOperationException();
You need to make this extend the new abstract class, and then get rid of the duplicated code.
Always use curly braces for the "if" statement body, even if it's a single line.
Thinking about it more, it seems like these UDAF's should all be generically useful, so we could just name them EWAH_BITMAP etc to indicate their representation (and then get rid of the "Do not use" warnings).
Couldn't you avoid this copying by having the BitmapObjectInput/Output already work in terms of LongWritable?
  1. When we convert the BitmapObjectInput/Output to work with LongWritables, we then need to worry about the bitmap_and, bitmap_or, (and the new bitmap_empty) udfs, since they take the hive datatype array<bigint> as an argument, which is never represented as an Array<LongWritable>.
    
    For the next patch I'm preparing, I think I'm just going to keep the copying code in those UDFs except copy stuff into an Array<LongWritable> to pass to the BitmapObjectInput constructor. Do you think it would be better to pass the entire array object to BitmapObjectInput along with the ListObjectInspector and other classes required to read from the array?
Couldn't you make this a single-parameter UDF which just tests whether a bitmap is empty or not?  Then use the existing UDAF collect_set to collect the distinct block offsets.
  1. Duh. Should have thought of that.
Hmmm...looking at the EWAH code, we could actually make our decision by reading just the header to avoid having to deserialize the whole thing.
  1. I don't think we can. The header shows the actual size in bits, but all of those bits can be zero. I think we still need to deserialize the entire bitmap in order to decide.
When copy-and-paste goes bad:  this class keeps referring to BITMAP_AND.

This is usually a "code smell" that you could have factored 90% out of this into an abstract class and just parameterized it.
Why are you grouping by key and ds here?  That will cause lots of duplicates to be written out.  Don't you want to collapse across all keys?
For this test, it would be better to have a table with two different columns rather than faking it by creating two indexes on the same column.

Also, since you have a test for bitmap_or, you should also have one for bitmap_and.
Can we leave in redundant support for hive.index.compact.file property and (almost empty) inputformat class for a transition period?  Otherwise it will be more difficult to get this deployed at Facebook.

Then you can also leave these tests unmodified.
Posted (March 5, 2011, 12:06 p.m.)

   

  
It would be better if the declaration of the loop-private variables is done just once outside the loop and their values re-initialized in every iteration of the loop. Re-declaration in every iteration isn't such a good idea. This need be taken at many places i guess.
Please don't hard-code the class name. It becomes difficult in future if the code-refactoring is to be carried out. A better way would be as below:
public static final Log l4j = LogFactory.getLog(HiveIndexTableIndexResult.class.getSimpleName());
If an exception is thrown, then the LineReader may not be closed at all, lead to a resource leakage. It would be a safe bet to protect the code in try {} block and close it in finally block.
Using IndexMetadataChangeTask.class.getSimpleName() would be a better way, I suppose.
A generated unique serialVersionUID is preferrable.