Review Board 1.7.22

HCatalog-42 review request on behalf of Sushanth

Review Request #1184 - Created July 22, 2011 and updated

Ashutosh Chauhan
HCatalog-42 review request on behalf of Sushanth
Unit tests are included
Posted (July 22, 2011, 6:01 p.m.)


trunk/build.xml (Diff revision 1)
Instead of checking it in lib, is it possible to pull this lib through ivy ?
  1. I don't know if this is exported by any particular artifact. It was part of hadoop-core, but as of 0.20.203/CDH3, it is not present there. Back in hadoop trunk and 0.20.204 onwards, it's supposed to have made a comeback. For now, better to keep it here for now and then pull it out once the primary hadoop version we depend on has it, I think.
Do we need these keys? Looks like you are propagating the token already acquired in the first job to the second one.
  1. This is done to mimic our usage of the hive delegation token, because we need to refer to it to be able to cancel this.
    It might be possible to experiment with pulling the credentials out of the jobcontext and after checking .getKind() to see if it's a MR token, and checking whether or not we set it as opposed to it having been set by oozie or other external clients, but this was the easy way of it. It's worth experimenting with a secure cluster to optimize this.
Will it be better to externalize this property? Otherwise one has to recompile hcatalog turn this off.
  1. The current idea is that this check is not really necessary, and it has to be an architectural discussion before we decide to enable it again. I used a boolean setting and if-guards instead of outright commenting out the code or not having it there so as to reverse it later on. If we discuss this and decide to have this feature enabled or configurable, we can make it configurable.
Do we need all these methods ? Looks like you no longer need not to acquire token from JT.
  1. I do need to acquire the token from the JobTracker and I do use it - it is being used even if not explicitly set - that's what the exporting of the tokens dir does.
trunk/src/java/org/apache/hcatalog/common/ (Diff revision 1)
Looks little unnatural to do logging this way. We should take advantage of the logging library to accomplish these tasks.
  1. Agreed on that. :)
    We should pull out a bunch of these log functions into a proper log class, and also start using log4j proper instead of of only when we debug. Not immediately in the scope of this patch, I'd suggest.
Commented code. Either remove it or uncomment it. I think these logging statements should be enabled.
  1. (refer above to needing to pull logging out to a separate class, then we can stop instantiating static LOGs altogether)
trunk/src/java/org/apache/hcatalog/har/ (Diff revision 1)
Commented code.
  1. I figured it might be useful in the future if we need to debug whether or not the har is valid. This I can live without.
    (TODO:remove commented code)
Looks like you are not using this map later on. We should probably remove it.
  1. Agreed, the functionality was replaced by discoverPartitions(), since we needed more info than this provided. This I'll remove and update the patch.
    (TODO:remove baseCommitters)
This I think is because you don't have a list of base committers initially. But it may be possible to get that list from HCatRecordWriter at this point and then call commitTask() on them. Calling commitTask from close() looks awkward, if there is a way we should avoid that.
  1. There isn't a really good way for HCatRecordWriter to communicate this to the HCatOutputCommitter. The current approach comes closest to what we consider the equivalent of the underlying committer's commitTask(), and the RecordWriter is a good place to put it, because that is very much in the task scope.
Instead of special casing for dynamic partitioning we should always call baseComitter.setupTask() here. In case of dynamic partitions, this will become noop from here and latter call it again from write() to do the real work.
  1. Again, as above, there is no way for us to know the baseCommitter in HCatOutputCommitter in a dynamic partitioning usecase.
Can you explain the purpose of this boolean? Looks like it can be derived from existing variables.
  1. It serves as multiple-call guard for discoverPartitions() which can be a costly call to make, and calling multiple times should be avoided if it can be helped.
    One thing I will agree with - it's cleaner to put this callguard inside discoverPartitions and have it be in only one place rather than in other functions that call it. I can make that change.
    (TODO: move partitionsDiscovered check to inside discoverPartitions() )
Looks like this isn't needed anymore.
  1. This is still needed and used. Without getting the jobclient token, we cannot launch the har task.
Why we depend on this variable? Shouldn't we always do fs.exists() check?
  1. There are two issues here.
    a) The first is needing to check before a move, and fs.rename will fail in cases where fs.exists would have alerted us. So yes, we do always do a check, and we're good.
    b) The second is an ability to do a check for all the moves that we would do, to see if any of them would have failed, before we even begin trying to move. This is needed especially for the case where we har, because that will not have the luxury of falling back on the add_partitions() failing before we do the move, and so, if there are any existing partitions in the directory, then it's needed that we know before we start trying to copy and fail mid-way. The dryRun flag gives us that ability to pre-check before we begin moving.
I wonder if we are too early in introducing this interface. There is only one impl which is always called (in case of dynamic partitions). I think we should introduce interfaces only if there is a known use case for it. Further when other use cases do come up, good likelihood that we need to tweak the interface to accomodate them.
  1. This I agree with, and that's why I didn't attempt to make the usage so generic as to allow other PostProcessors beyond har. The reason it's there at all is because I wanted to keep the har code pluggable if we wanted to modify later. I'm okay with removing the interface and just using the HarPostProcessor itself as is without the @Overrides
    (TODO: discuss and potentially remove)
trunk/src/java/org/apache/hcatalog/pig/ (Diff revision 1)
Looks like some of these are not required.
  1. No, as in comments above, we still do use it.
trunk/src/java/org/apache/hcatalog/pig/ (Diff revision 1)
retrieve.. do you want to rename it to getFromUDF..
  1. True. Unnecessary obfuscation. Sorry! :)
In all of these test cases, you should also assert metadata, by retrieving them from metastore and verifying as you are expecting them.
  1. Hmm... the metadata isn't affected as much by this patch as much as the presence of it at all - if a read succeeds, the metadata is in order. That would be tested more by add_partitions() tests than by dynamic publish tests. But yes, you're right in that we should add in more explicit test coverage, we can open another jira for that - that's across the board