Review Board 1.7.22

PIG-3562 Implement combiner optimizations for DISTINCT in Tez

Review Request #16717 - Created Jan. 8, 2014 and updated

Alex Bain
cheolsoo, daijy, mwagner, rohini
Implement DISTINCT combiner optimizations in Tez

1. Use a combiner with normal uses of DISTINCT. In MR Pig, there are some global variables and a special DistinctCombiner class that throws away the duplicate tuples. We could hack this into Pig-on-Tez, but instead I just reused the reduce plan as the combiner plan, which does the same thing (through a POPackage->POProject->POForEach with the setDistinct property set to true).

I'm a little bit concerned that this combiner plan could somehow be slower than the special DistinctCombiner class, but I don't see how.

There is also a special CombinerPackager packager that I did NOT use for this. I think that packager is really intended for use with the algebraic UDF combiner optimizations only.

2. I carefully verified that DISTINCT nested inside a FOREACH code block is optimized by the CombinerOptimizer into an algebraic UDF version of DISTINCT. I added TestTezCompiler and e2e tests for this. Cheolsoo already made all the combiner changes for this to work correctly - I didn't make any code changes here.
Updated golden file for existing TestTezCompiler DISTINCT test to include combiner plan
Added TestTezCompiler test and golden file for DISTINCT algebraic udf combiner
Added e2e test that runs DISTINCT with algebraic udf combiner
I am getting some test-e2e-tez failures in ORDER BY tests, but I am also getting these in a clean Tez branch. My new e2e test passes.
Ship it!
Posted (Jan. 8, 2014, 10:53 p.m.)
I will commit it after running tests.