Review Board 1.7.22


HIVE-4113: Optimize select count(1) with RCFile and Orc

Review Request #11770 - Created June 9, 2013 and updated

Brock Noland
trunk
HIVE-4113
Reviewers
hive
hive-git
Modifies ColumnProjectionUtils such there are two flags. One for the column ids and one indicating whether all columns should be read. Additionally the patch updates all locations which uses the old method of empty string indicating all columns should be read.

The automatic formatter generated by ant eclipse-files is fairly aggressive so there are some unrelated import/whitespace cleanup.
All unit tests pass with the patch. ColumnProjectionUtils has new unit tests covering it's functionality. Additionally I verified manually the select count(1) from RCFile/Orc resulted in less IO after the change.

Before:

hive> select count(1) from users_orc;
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 17.75 sec   HDFS Read: 28782851 HDFS Write: 9 SUCCESS

hive> select count(1) from users_rc; 
Job 0: Map: 3  Reduce: 1   Cumulative CPU: 23.72 sec   HDFS Read: 825865962 HDFS Write: 9 SUCCESS

After:


hive> select count(1) from users_orc;
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 9.9 sec   HDFS Read: 67325 HDFS Write: 9 SUCCESS

hive> select count(1) from users_rc; 
Job 0: Map: 3  Reduce: 1   Cumulative CPU: 16.96 sec   HDFS Read: 96045618 HDFS Write: 9 SUCCESS
Review request changed
Updated (July 15, 2013, 7:51 p.m.)
Test was missed, included it now.