HIVE-1644 Use filter pushdown for automatically accessing indexes
Review Request #558 - Created April 7, 2011 and updated
Review request for HIVE-1644.12.patch
For consistency with my review in HIVE-1694, I suggest hive.optimize.index.filter as the name for this configuration parameter. (In HIVE-1694 I suggested hive.optimize.index.groupby, and we want it to be possible to enable/disable them independently)
In line with the previous comment, suggest hive.optimize.index.filter.compact.minSize/maxSize. Namit's suggestion for minSize was 5G. I think the default for maxSize should be infinity (I can't think of a case where we want it in effect by default).
HIVE-1803 is changing this to hive.index.blockfilter.file. Assuming that gets committed first, we should use that, since it's generic rather than tied to the index type.
What are the units here? Also, don't use colon after parameter name.
The non-functional changes in this file are gonna conflict with HIVE-1803, so get rid of them.
Use HiveUtils.unparseIdentifier for quoting table names in generated SQL.
Isn't it incorrect to set properties on the original table scan here since this is only tentative?
Likewise, modifying inputs is incorrect before we have a definite plan. Some more work on the new HiveIndexHandler interface method is required for resolving this plus the residuals.
If searchConditions.size() == 0, it means we didn't find anything which could be handled by the index. In that case, we should bail out immediately and not try to do anything more with this index.
We collect the residual here, but we don't do anything with it. Don't we need to pass it back so that Hive can decide what to leave in the Filter operator?
The list actually contains index objects, not index table names. Also typo: "is exists"
Only cast once.
Indentation is wrong here.
In my review for HIVE-1694, I noted that we should not be swallowing exceptions. I think some of this code was copied from there. If we can't access the metastore during optimization, it should be treated as a fatal error.
The plan still looks wrong (there are two Stage-0's, one for the index scan, one for the final fetch), so the relabeling is still not quite working correctly.
no space after !
Suggested rename for method: arePartitionsCoveredByIndex
This checks that the metadata matches. But it does not actually check that the index partitions exist.
Is that true? Why couldn't index be used to optimize a map-only task?
As noted above, we DO want to fail here.
Could you add more tests for cases where automatic indexing should decide not to kick in: * index can't be used because of min/max size config * index can't be used because predicate isn't supported * index can't be used columns aren't covered Also: * case where multiple indexes apply and we pick one (currently arbitrarily, but make sure it's at least deterministic so that test doesn't become flaky)
Could you add a test which shows the index NOT being used in the case where required partitions haven't been built yet?
Here's the duplicate Stage-0 I referred to in the code.
I would have liked to just make a copy of pctx before I called rewriteForIndex(...) for every index, and then just use whichever of those corresponded to the index rewrite we chose. However, the pctx did not seem to have an easy way to copy it.
Do we need to propagate the residual predicate any further?
I'm kind of confused about how to check the actual table and not the metadata. When we call indexTable.getPartitionKeys() and part.getTable.getPartitionKeys(), that method calls getPartitionKeys() on the underlying Thrift Tables. Is there a way besides getPartitionKeys() that we should be using?
I have not yet added the additional unit tests
I fixed the labeling for this case, but would it make sense to label our stages differently for indexing? We only relabel correctly as long as we're overwriting the highest numbered stage, since we only relabel a single task. Or, should it relabel all tasks in the whole plan? We only have easy access to the context.currentTask when we iterate through in IndexWhereProcessor (line 153)
A few comments here. 1) Rather than passing in the entire table scan object and letting the handler set properties on it, I think we should just have the handler pass back the necessary information (input format and intermediate file). 2) The generateIndexQuery method's parameter list is growing. For plugin interfaces, a good pattern we've been using in other places is to introduce a new context class (say HiveIndexQueryContext) with getters and setters for the information to be communicated in both directions. Then the caller instantiates one of these and passes in an instance. The plugin reads and writes to the context. On return, the caller gets the modified information out. The main benefit is that in the future, if we need to pass more information, we just add new members to the context class, and none of the existing plugin implementations break. In this case, you could also put the context objects in a map (instead of having to keep multiple maps indexQueryTasks/additionalInputs etc).
Just put it as a TODO for now; create the followup JIRA issue and reference it in the TODO.
Look in Hive.java; there are methods like public List<Partition> getPartitionsByNames(Table tbl, List<String> partNames) which look up the actual partitions for a table from the metastore. You can pass in indexTable.
Hmm...what if we could avoid relabeling altogether? If you look in Driver.java, there's a method compile which calls TaskFactory.resetId(). This is what causes us to start back over from 0. If you add an optional parameter resetTaskIds=true, and then pass false for the Driver instance used for compiling the reentrant query, that might do it.
Still need to change hive.index.compact.file to hive.index.blockfilter.file , but hopefully bitmap gets committed soon.
I'm not sure the way I'm doing it currently will work with partitions. I don't take them into account when generating the index query.
see later comment about why this abort needs to be skipped for anything to run.
This doesn't seem to work (it always returns false here). This checks whether the partitions equal each other, which I don't think can happen since they're on different tables. What information in a partition do I need to be checking?
Is there a multiple column table? Or, what's the best way to create a multi-column table and populate it with data? I can't figure out a good way to query the value column, so the src table seems less than ideal.
How do unbuilt partitions work? I didn't see any way to delay the building, so I don't know how to have an index with unbuilt partitions.
BTW, these property names should be all-lowercase.
When you add an overload, add Javadoc as well (including the new param's meaning).
Could you explain the usage interaction better (along the lines of what I explained in my review comment)?
You're right. Either we need to treat them as index columns (so that the predicates on them will automatically be collected by the predicate analyzer), or we need to explicitly generate corresponding equality predicates based on the partition values which have already been identified by partition pruning.
From an efficiency perspective, you certainly don't want to be doing this over and over inside the outer for loop; just do it once first outside. Also, for a table with a huge number of partitions, fetching all of them is a bad idea; it's better to selectively query the partitions of interest (but batching them if possible).
This doesn't work because the Partition class does not override the default Java equals method (which is based on object identity rather than value), and different metastore queries return different object instances for the same underlying entity.
I don't understand what you mean? src has two columns, key and value.
From the index design doc, there's an optional PARTITION clause when rebuilding an index which allows you to build just one specific partition, leaving the others unbuilt. I think there are some examples in the unit tests. ALTER INDEX index_name ON table_name [ PARTITION (...) ] REBUILD
Oh, and reading your original comment more carefully: yeah, they are two separate entities (one for the table partition, and one for the index partition), so even if the equals method were tied to metastore object identity, it still wouldn't work. The getSpec() method on the Partition class is what gives you the actual key/value pairs for the partition, suitable for comparison.
When we run a query on a non-partitioned table, we get a single partition in queryPartitions of the whole table (with an empty partSpec). Then, when we add the partition columns to the list of indexed column, we end up adding all the columns in the src table, instead of just the partitioned ones. If we make sure the partSpec isn't empty, this doesn't happen.
We need to cast the work in this task to MapredWork in order to get the input size out (line 176). I'm not sure if this is the best place to do that checking.
See above comment about MapredWork
Create a followup task for dealing with jobs which access multiple tables. For that, we need to associate the index formats/files with specific tables, and that requires modifying the way the index input format works.
Create a followup task for displaying these in the plan (to indicate that a table scan's input is being filtered by the intermediate file). We only want to do that when they are non-null (to avoid upsetting all the existing test reference files).
When logging errors being propagated, use the two-arg version of the method and pass e as the second arg. Same thing in a few other places.
curly bracket placement
create a followup for this one
This is not an error, just a condition that prevents usage of the index, so it should be logged as info rather than error.
I thought this might what caused the original table to be used, instead of the stale index. By adding the index table, we keep the original table around. However, clearing the inputs before adding the index table didn't change anything.
We shouldn't be seeing this output. We're still generating the right plan, but something is wrong when we run it.
This does not work because the GenMRTableScan1.process() does not get called on these objects after we set these, as far as I can tell.
At this point, the inputFormat for this work is actually null. I do not know which work I need to change the inputFormat for, and I can't think of an easy way to get to other work.