Review Board 1.7.22

HIVE-1694: Accelerate GROUP BY execution using indexes

Review Request #392 - Created Feb. 3, 2011 and updated

John Sichi
Preliminary review.

Posted (Feb. 27, 2011, 3:49 p.m.)


I suggest hive.optimize.index.groupby for this property.
Typo:  Unknown
I don't understand why these state variables are maintained as conf variables rather than just data members of a class.  Could you explain why that is necessary?
  1. The intent was to have these variables available as part of the current configuration. I think that was unnecessary. We will change it to data members of RewriteCanApplyCtx class.
Why is the exception being ignored here?
Need Apache headers on all new files.
Use "expr" instead of engfd
Typo:  groub-by
What about nested function invocations?
  1. Yes we missed that. We will upload a new patch with this change.
Eliminate commented-out code
Need to abstract out this dependency on _c convention.
  1. Can you suggest a way using which we can get rid of this dependency?
  2. Actually, isn't just checking internalToAlias.get != null already good enough?  If so we can just eliminate the _c check.
For HIVE-1644, the HMC team is creating a new package org.apache.hadoop.hive.ql.optimizer.index.  I think any optimizer code related to indexing should go in there.
  1. I understand we just need to place our code in the 'org.apache.hadoop.hive.ql.optimizer.index' package. Right?
  2. Correct.  But for generic optimizer code (e.g. RewriteParseContextGenerator, which is unrelated to indexing) can stay where it is.
This code is currently tied to the compact index representation.

We mentioned earlier that we'll need a new index representation (summary) instead in order to implement the counts correctly (we should leave the compact representation as is).


* until the summary representation is added, we can't enable this

* in general, it would be good to find a way to make this pluggable; for example, the bitmap index representation can also be utilized by counting the bits, but the rewrite expression would be slightly different
  1. We have a couple of designs ready for the new index handler. I will post them on JIRA. 
    When you say "pluggable" does it mean that we should be able to change the rewrite expression depending on which index handler is used? For example, we will get different rewrite expressions if we use either the new index handler or the bitmap index handler. Our optimization code should work with any rewrite expression. Is this understanding correct?
  2. Correct.  If you look at what's going on in the latest HIVE-1644 patch, they are extending the HiveIndexHandler interface with an optional method for dealing with a WHERE clause.  You can do the same for aggregation.  It may take a few iterations to get it generic enough, but it helps that we already have the bitmap work in progress so that's two cases to generalize from.
Might want to use a finer level than
Just a sanity check to avoid huge payloads coming back from thrift.
Hmm, no, I think we should fail hard here.  If the underlying problem is fatal (e.g. the metastore went down), we should not try to hide it.