Review Board 1.7.22


HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde

Review Request #12480 - Created July 11, 2013 and updated

Mohammad Islam
trunk
HIVE-4732
Reviewers
hive
ashutoshc, jghoman
hive-git
From our performance analysis, we found AvroSerde's schema.equals() call consumed a substantial amount ( nearly 40%) of time. This patch intends to minimize the number schema.equals() calls by pushing the check as late/fewer as possible.

At first, we added a unique id for each record reader which is then included in every AvroGenericRecordWritable. Then, we introduce two new data structures (one hashset and one hashmap) to store intermediate data to avoid duplicates checkings. Hashset contains all the record readers' IDs that don't need any re-encoding. On the other hand, HashMap contains the already used re-encoders. It works as cache and allows re-encoders reuse. With this change, our test shows nearly 40% reduction in Avro record reading time.
 
   

 
Total:
1
Open:
1
Resolved:
0
Dropped:
0
Status:
From:
Description From Last Updated Status
And this would indicate a bug. Jakob Homan Aug. 26, 2013, 5:35 a.m. Open
Review request changed
Updated (Aug. 30, 2013, 6:49 p.m.)
Updated with Jakob's comments
Ship it!
Posted (Sept. 12, 2013, 10:33 p.m.)
Ship It!