Review Board 1.7.22

PIG-3642 Direct HDFS access for small jobs (fetch)

Review Request #16507 - Created Dec. 29, 2013 and updated

Lorand Bendig
With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script:

    it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc
    no scalar aliases
    no SampleLoader
    single leaf job
    DUMP (no STORE)

The feature is enabled by default and can be toggled with:

    -N or -no_fetch
    set opt.fetch true/false;

There's no STORE support because I wanted to make it explicit that this "optimization" is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's LoadMetadata#getStatistic ?)
- new testcase added:  TestFetch
- the patch was checked against test-commit and test-core
- Because opt.fetch is set by default, the testcases were using fetch instead of MR jobs wherever it was possible
Description From Last Updated Status
Review request changed
Updated (Jan. 3, 2014, 10:57 p.m.)
Updated patch: PIG-3642-3.patch
Ship it!
Posted (Jan. 5, 2014, 12:49 a.m.)
Looks good to me. I will commit it after running unit tests and e2e tests.

I found a minor bug below. Let me fix it when I commit it.
I think "return" is omitted here. The explain still outputs the MR plan even if the plan is fetchable.
  1. This is ugly. Thanks for fixing it!