Review Board 1.7.22


SchemaTuple in Pig

Review Request #4651 - Created April 5, 2012 and updated

Jonathan Coveney
PIG-2632
Reviewers
pig
julien
pig
This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily depends on the data, though). Need to do get/set tests, but assuming that it's on par (or even faster) than Tuple, the memory gain is huge.

Need to clean up the code and add tests.

Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's. The next step is to make a SchemaBag, where I think the serialization savings will be really huge.

Needs tests and comments, but I want the code to settle a bit.

 

Changes between revision 1 and 8

1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10

  1. trunk/src/docs/src/documentation/content/xdocs/perf.xml: Loading...
  2. trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java: Loading...
  3. trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java: Loading...
  4. trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigGenericMapBase.java: Loading...
  5. trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigGenericMapReduce.java: Loading...
  6. trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigTupleDefaultRawComparator.java: Loading...
  7. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/PhysicalOperator.java: Loading...
  8. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java: Loading...
  9. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POMergeJoin.java: Loading...
  10. trunk/src/org/apache/pig/data/AppendableSchemaTuple.java: Loading...
  11. trunk/src/org/apache/pig/data/BinInterSedes.java: Loading...
  12. trunk/src/org/apache/pig/data/BinSedesTupleFactory.java: Loading...
  13. trunk/src/org/apache/pig/data/DataByteArray.java: Loading...
  14. trunk/src/org/apache/pig/data/FieldIsNullException.java: Loading...
  15. trunk/src/org/apache/pig/data/PBooleanTuple.java: Loading...
  16. trunk/src/org/apache/pig/data/PDoubleTuple.java: Loading...
  17. trunk/src/org/apache/pig/data/PFloatTuple.java: Loading...
  18. trunk/src/org/apache/pig/data/PIntTuple.java: Loading...
  19. trunk/src/org/apache/pig/data/PLongTuple.java: Loading...
  20. trunk/src/org/apache/pig/data/PStringTuple.java: Loading...
This diff has been split across 2 pages: 1 2 >
trunk/src/docs/src/documentation/content/xdocs/perf.xml
Revision 1351931 New Change
[20] 1092 lines
[+20]
1093
This provides a significant performance improvement compared to passing all of the data through 
1093
This provides a significant performance improvement compared to passing all of the data through 
1094
unneeded sort and shuffle phases. 
1094
unneeded sort and shuffle phases. 
1095
</p>
1095
</p>
1096

    
   
1096

   
1097
<p>
1097
<p>
1098
Pig has implemented a merge join algorithm, or sort-merge join, although in this case the sort is already 
1098
Pig has implemented a merge join algorithm, or sort-merge join. It works on pre-sorted data, and does not
1099
assumed to have been done (see the Conditions, below). 
1099
sort data for you. See Conditions, below, for restrictions that apply when using this join algorithm.
1100

    
   
1100

   
1101
Pig implements the merge join algorithm by selecting the left input of the join to be the input file for the map phase, 
1101
Pig implements the merge join algorithm by selecting the left input of the join to be the input file for the map phase, 
1102
and the right input of the join to be the side file. It then samples records from the right input to build an
1102
and the right input of the join to be the side file. It then samples records from the right input to build an
1103
 index that contains, for each sampled record, the key(s) the filename and the offset into the file the record 
1103
 index that contains, for each sampled record, the key(s) the filename and the offset into the file the record 
1104
 begins at. This sampling is done in the first MapReduce job. A second MapReduce job is then initiated, 
1104
 begins at. This sampling is done in the first MapReduce job. A second MapReduce job is then initiated, 
[+20] [20] 12 lines
[+20]
1117
<section>
1117
<section>
1118
<title>Conditions</title>
1118
<title>Conditions</title>
1119
<p><strong>Condition A</strong></p>
1119
<p><strong>Condition A</strong></p>
1120
<p>Inner merge join (between two tables) will only work under these conditions: </p>
1120
<p>Inner merge join (between two tables) will only work under these conditions: </p>
1121
<ul>
1121
<ul>
1122
<li>Between the load of the sorted input and the merge join statement there can only be filter statements and 
1122
<li>Data must come directly from either a Load or an Order statement.
1123
foreach statement where the foreach statement should meet the following conditions: 
1123
<li>There may be filter statements and foreach statements between the sorted data source and the join statement. The foreach statement should meet the following conditions: 
1124
<ul>
1124
<ul>
1125
<li>There should be no UDFs in the foreach statement. </li>
1125
<li>There should be no UDFs in the foreach statement. </li>
1126
<li>The foreach statement should not change the position of the join keys. </li>
1126
<li>The foreach statement should not change the position of the join keys. </li>
1127
<li>There should be no transformation on the join keys which will change the sort order. </li>
1127
<li>There should be no transformation on the join keys which will change the sort order. </li>
1128
</ul>
1128
</ul>
1129
</li>
1129
</li>
1130
<li>Data must be sorted on join keys in ascending (ASC) order on both sides.</li>
1130
<li>Data must be sorted on join keys in ascending (ASC) order on both sides.</li>
1131
<li>Right-side loader must implement either the {OrderedLoadFunc} interface or {IndexableLoadFunc} interface.</li>
1131
<li>If sort is provided by the loader, rather than an explicit Order operation, the right-side loader must implement either the {OrderedLoadFunc} interface or {IndexableLoadFunc} interface.</li>
1132
<li>Type information must be provided for the join key in the schema.</li>
1132
<li>Type information must be provided for the join key in the schema.</li>
1133
</ul>
1133
</ul>
1134
<p></p>
1134
<p></p>
1135
<p>The PigStorage loader satisfies all of these conditions.</p>
1135
<p>The PigStorage loader satisfies all of these conditions.</p>
1136
<p></p>
1136
<p></p>
[+20] [20] 64 lines
trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
Diff Revision 1 Diff Revision 8
 
trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
Diff Revision 1 Diff Revision 8
 
trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigGenericMapBase.java
Diff Revision 1 Diff Revision 8
 
trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigGenericMapReduce.java
Diff Revision 1 Diff Revision 8
 
trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigTupleDefaultRawComparator.java
Diff Revision 1 Diff Revision 8
 
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/PhysicalOperator.java
Diff Revision 1 Diff Revision 8
 
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
Diff Revision 1 Diff Revision 8
 
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POMergeJoin.java
Diff Revision 1 Diff Revision 8
 
trunk/src/org/apache/pig/data/AppendableSchemaTuple.java
Diff Revision 1 Diff Revision 8
 
trunk/src/org/apache/pig/data/BinInterSedes.java
Diff Revision 1 Diff Revision 8
 
trunk/src/org/apache/pig/data/BinSedesTupleFactory.java
Diff Revision 1 Diff Revision 8
 
trunk/src/org/apache/pig/data/DataByteArray.java
Diff Revision 1 Diff Revision 8
 
trunk/src/org/apache/pig/data/FieldIsNullException.java
Diff Revision 1 Diff Revision 8
 
trunk/src/org/apache/pig/data/PBooleanTuple.java
Diff Revision 1 Diff Revision 8
 
trunk/src/org/apache/pig/data/PDoubleTuple.java
Diff Revision 1 Diff Revision 8
 
trunk/src/org/apache/pig/data/PFloatTuple.java
Diff Revision 1 Diff Revision 8
 
trunk/src/org/apache/pig/data/PIntTuple.java
Diff Revision 1 Diff Revision 8
 
trunk/src/org/apache/pig/data/PLongTuple.java
Diff Revision 1 Diff Revision 8
 
trunk/src/org/apache/pig/data/PStringTuple.java
Diff Revision 1 Diff Revision 8
 
  1. trunk/src/docs/src/documentation/content/xdocs/perf.xml: Loading...
  2. trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java: Loading...
  3. trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java: Loading...
  4. trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigGenericMapBase.java: Loading...
  5. trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigGenericMapReduce.java: Loading...
  6. trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigTupleDefaultRawComparator.java: Loading...
  7. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/PhysicalOperator.java: Loading...
  8. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java: Loading...
  9. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POMergeJoin.java: Loading...
  10. trunk/src/org/apache/pig/data/AppendableSchemaTuple.java: Loading...
  11. trunk/src/org/apache/pig/data/BinInterSedes.java: Loading...
  12. trunk/src/org/apache/pig/data/BinSedesTupleFactory.java: Loading...
  13. trunk/src/org/apache/pig/data/DataByteArray.java: Loading...
  14. trunk/src/org/apache/pig/data/FieldIsNullException.java: Loading...
  15. trunk/src/org/apache/pig/data/PBooleanTuple.java: Loading...
  16. trunk/src/org/apache/pig/data/PDoubleTuple.java: Loading...
  17. trunk/src/org/apache/pig/data/PFloatTuple.java: Loading...
  18. trunk/src/org/apache/pig/data/PIntTuple.java: Loading...
  19. trunk/src/org/apache/pig/data/PLongTuple.java: Loading...
  20. trunk/src/org/apache/pig/data/PStringTuple.java: Loading...
This diff has been split across 2 pages: 1 2 >