Review Board 1.7.22


PIG-3059 Global configurable minimum 'bad record' thresholds

Review Request #8765 - Created Dec. 26, 2012 and updated

Cheolsoo Park
PIG-3059
Reviewers
pig
jadler, jcoveney, sms
pig-git
This patch implements configurable bad records thresholds based on work done by Jonathan in PIG-2614.

The changes include:
- Adds new Pig properties - pig.load.bad.record.threshold and pig.load.bad.record.min.
- Removes 'ignore_bad_files' option from AvroStorage since it's no longer needed.
- Incorporates InputErrorTracker class written by Jonathan in PIG-2614.
- Adds a try-catch block to nextKeyValue() method in PigRecordReader.
- Adds new test cases to TestAvroStorage for these new properties.
ant clean commit-test
ant clean compile-test jar-withouthadoop
cd contrib/piggybank/java
ant clean test -Dtestcase=TestAvroStorage
Review request changed
Updated (Dec. 31, 2012, 1:56 a.m.)
- The error rate is printed as part of job stats.
- The error message is improved. Now the location of the bad split that causes the run-time exception is printed.
- InputErrorTracker counts the number of splits instead of records.
- For backward compatibility, ignore_bad_files is not removed. When the ignore_bad_files option is enabled in AvroStorage, it is equivalent to setting pig.load.bad.split.threshold to 1.0.