Review Board 1.7.22


HIVE-1362: Support for column statistics in Hive

Review Request #6878 - Created Aug. 31, 2012 and updated

Shreepadma Venugopalan
column-statistics
HIVE-1362
Reviewers
hive
carl
hive-git
This patch implements version 1 of the column statistics project in Hive. It adds support for computing and persisting statistical summary of column values in Hive Tables and Partitions. In order to support column statistics in Hive, this patch does the following,

* Adds a new compute stats UDAF to compute scalar statistics for all primitive Hive data types. In version 1 of the project, we support the following scalar statistics on primitive types - estimate of number of distinct values, number of null values, number of trues/falses for boolean typed columsn, max and avg length for string and binary typed columns, max and min value for long and double typed columns. Note that version 1 of the column stats project includes support for column statistics both at the table and partition level.

* Adds Metastore schema tables to persist the newly added statistics both at table and partition level.
* Adds Metastore Thrift API to persist, retrieve and delete column statistics at both table and partition level. 
Please refer to the following wiki link for the details of the schema and the Thrift API changes - https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive

* Extends the analyze table compute statistics statement to trigger statistics computation and persistence for one or more columns. Please note that statistics for multiple columns is computed through a single scan of the table data. Please refer to the following wiki link for the syntax changes - https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive

One thing missing from the patch at this point is the metastore upgrade scrips for MySQL/Derby/Postgres/Oracle. I'm waiting for the review to finalize the metastore schema changes before I go ahead and add the upgrade scripts.

In a follow on patch, as part of version 2 of the column statistics project, we will add support for computing, persisting and retrieving histograms on long and double typed column values.

Generated Thrift files have been removed for viewing pleasure. JIRA page has the patch with the generated Thrift files.
All the existing hive tests pass. Additionally this patch adds the following unit tests,

* Tests to TestHiveMetaStore.java to test the Metastore schema and Thrift API changes,
* Tests to exercise compute_stats UDAF for all primitive types,
* End to end test both at table and partition level for computing stats on multiple columns. Note that these tests use the extended syntax of the analyze command.

Total:
32
Open:
32
Resolved:
0
Dropped:
0
Status:
From:
Description From Last Updated Status
This javadoc should go in IMetaStoreClient, and in place of it here we should use "/** {@inheritDoc} */" Carl Steinbach Sept. 13, 2012, 9:58 p.m. Open
Does it make sense to add a thrift API for updating statistics ? There doesn't exist a interface for updating ... namit jain Oct. 3, 2012, 11:49 a.m. Open
can you use full variable name instead of Rwt namit jain Oct. 3, 2012, 11:49 a.m. Open
LHS should not be an arraylist Please fix all such occurences namit jain Oct. 3, 2012, 11:49 a.m. Open
I'll replace LHS with generic java types. Shreepadma Venugopalan Oct. 3, 2012, 4:41 p.m. Open
Will remove this change to MapRedTask.java. Sorry abt this. Shreepadma Venugopalan Oct. 3, 2012, 6:51 p.m. Open
It looks like most of the classes that extend BaseSemanticAnalyzer are overriding init() with a NoOp method. If that's the ... Carl Steinbach Oct. 6, 2012, 1:12 a.m. Open
Please change this back. Static imports decrease noise, and it's easy to figure out where a token is defined using ... Carl Steinbach Oct. 6, 2012, 1:12 a.m. Open
Please change this back. Carl Steinbach Oct. 6, 2012, 1:12 a.m. Open
s/setups/sets up/ Carl Steinbach Oct. 6, 2012, 1:12 a.m. Open
StatsSemanticAnalyzer should catch these exceptions and convert them to the propert SemanticException before rethrowing them. Carl Steinbach Oct. 6, 2012, 1:12 a.m. Open
Might be a good idea to add a comment explaining that the table stats are implemented elsewhere. Carl Steinbach Oct. 6, 2012, 1:12 a.m. Open
s/Lvl/Level/ Carl Steinbach Oct. 6, 2012, 1:12 a.m. Open
Formatting Carl Steinbach Oct. 6, 2012, 1:12 a.m. Open
Formatting Carl Steinbach Oct. 6, 2012, 1:12 a.m. Open
Formatting Carl Steinbach Oct. 6, 2012, 1:12 a.m. Open
Please use a StringBuilder instead of doing lots of String concatenation. Carl Steinbach Oct. 6, 2012, 1:12 a.m. Open
Formatting Carl Steinbach Oct. 6, 2012, 1:12 a.m. Open
StringBuilder. Carl Steinbach Oct. 6, 2012, 1:12 a.m. Open
Done. Shreepadma Venugopalan Oct. 22, 2012, 6:27 a.m. Open
Done. Shreepadma Venugopalan Oct. 22, 2012, 6:27 a.m. Open
Done Shreepadma Venugopalan Oct. 22, 2012, 6:27 a.m. Open
Done Shreepadma Venugopalan Oct. 22, 2012, 6:27 a.m. Open
Done Shreepadma Venugopalan Oct. 22, 2012, 6:27 a.m. Open
Done Shreepadma Venugopalan Oct. 22, 2012, 6:27 a.m. Open
Done Shreepadma Venugopalan Oct. 22, 2012, 6:27 a.m. Open
Done Shreepadma Venugopalan Oct. 22, 2012, 6:27 a.m. Open
Done Shreepadma Venugopalan Oct. 22, 2012, 6:27 a.m. Open
Done Shreepadma Venugopalan Oct. 22, 2012, 6:27 a.m. Open
Done Shreepadma Venugopalan Oct. 22, 2012, 6:27 a.m. Open
Done Shreepadma Venugopalan Oct. 22, 2012, 6:27 a.m. Open
Done Shreepadma Venugopalan Oct. 22, 2012, 6:27 a.m. Open
Review request changed
Updated (Oct. 30, 2012, 6:39 p.m.)
Fixes the lint problems from the previous revision.