Review Board 1.7.22


PIG-1702. Fix for task output logs for streaming jobs containing null input-split information.

Review Request #547 - Created April 4, 2011 and submitted

Adam Warrington
Reviewers
pig
pig
This is a patch for PIG-1702, which describes an issue where the task output logs for PIG streaming jobs contains null input-split information. The ability to query the input-split information through the JobConf went away with the new MR API. We must now gain a reference to the underlying FiletSplit, and query this reference for that information.
To test this, I wrote a very simple python script to pass data through using PIG. After checking the task logs of the completed task, the stderr logs now contain valid input split information. Below are the scripts and test data used.

### PIG commands run ###
DEFINE testpy `test.py` SHIP ('test.py');
raw_records = LOAD '/test.txt2'; 
T1 = STREAM raw_records THROUGH testpy;
dump T1;

### test.py ###
#!/usr/bin/python
import sys

cnt = 0
for line in sys.stdin:
    print line.strip() + " " + str(cnt)
    cnt += 1

### contents of /test.txt on hdfs ###
one line
two line
three line
four line
Review request changed
Updated (May 19, 2011, 4:27 p.m.)
  • 
    	  

    To test this, I wrote a very simple python script to pass data through using PIG. After checking the task logs of the completed task, the stderr logs now contain valid input split information. Below are the scripts and test data used.
    
    ### PIG commands run ###
    DEFINE testpy `test.py` SHIP ('test.py');
    raw_records = LOAD '/test.txt2'; 
    T1 = STREAM raw_records THROUGH testpy;
    dump T1;
    
    ### test.py ###
    #!/usr/bin/python
    import sys
    
    cnt = 0
    for line in sys.stdin:
        print line.strip() + " " + str(cnt)
        cnt += 1
    
    ### contents of /test.txt on hdfs ###
    one line
    two line
    three line
    four line
    
Sigh...I edited this a while back, but didn't publish what I wrote.
Posted (May 31, 2011, 11:10 p.m.)

   

  
Referencing PigMapReduce.sJobContext may cause a race condition in local Pig jobs, similar to what is described in PIG-1831. Should a similar fix be applied where the context in PigMapReduce is in thread local storage?