Review Board 1.7.22

SchemaTuple in Pig

Review Request #4651 - Created April 5, 2012 and updated

Jonathan Coveney
This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge.

Need to clean up the code and add tests.

Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge.

Needs tests and comments, but I want the code to settle a bit.

Description From Last Updated Status
any reason you decided to extend HashMap as opposed to just convert when inserting? Julien Le Dem July 2, 2012, 10:50 p.m. Open
Review request changed
Updated (June 29, 2012, 9:55 p.m.)
Posted (July 2, 2012, 10:50 p.m.)
Great work!
some minor comments.
This is getting really good!
  1. Thanks Julien!
longer term we should probably have a DistributedCacheManager to centralize those things (not now)
maybe this instead:
SchemaTupleBackend.initialize(job, pigContext.getExecType());
and check inside
  1. agreed
is this needed ?
  1. confusingly (and I did not notice until now), but this is actually a different Pair than the one I implemented. That's...funny
  1. I generated the patch to not affect the existing whitespace. We don't really have any clear guidelines in the project around fixing whitespace formatting (afaik the current consensus is "don't fix whitespace")
if you override get, you should really override containsKey as well, or this could hide some hard to debug side effects.
  1. k
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/ (Diff revision 10)
any reason you decided to extend HashMap as opposed to just convert when inserting?
  1. I originally converted when inserting. The nice thing about this pattern is that you centralize the conversion (at insertion), instead of having to have an if-then-else at any point where you actually use the HashMap.
factor this out in a method.

Is there a case when this is already a SchemaTuple?
  1. It depends. In some cases, it is's hard to know since SchemaTuple could be integrated deeper in the pipeline. I factored it out and added checks.
why not just convert the tuple here, instead of extending ArrayList?
It would seem a little more obvious.
If you want a strategy pattern, it does not have to be in List.
  1. See above. Give thoughts w.r.t. that and I'll go with it.
  2. In this vein, I could create a side interface (kind of like TupleMaker) that would encapsulate the proper datatype, and not have the potential pitfalls of "oh the containsKey was or wasn't implement" or something like that?
are there cases where the tuple is already a SchemaTuple?
  1. see above
what is this for ?
  1. It's old code that can go. whoops!
trunk/src/org/apache/pig/data/ (Diff revision 10)
what are those for ?
It's unlikely we want UDFs to be dependent on SchemaTuples (or their absence)
  1. There is a comment above that comment unclear? Basically, this is the mechanism by which the code generation step can communicate the context in which a given SchemaTuple should be used. Based on the contexts in which a SchemaTuple was registered, we annotate the generated class. This way, on the backend when a given piece of code "requests" a SchemaTuple for a given Schema, we can inspect the SchemaTuple that matches the Schema to see if it was intended to be used in that context (imagine the case where we have SchemaTuple turned on for merge join, but turned off for UDF's, but there is a merge join and udf tuple that both have the same Schema. This is the mechanism that is used to give a SchematupleFactory to the former but not the latter)
trunk/src/org/apache/pig/newplan/logical/relational/ (Diff revision 10)
is there some code somewhere that does this already ?
  1. I didn't find any, but I didn't look too hard.
  1. TDD!