As first step, I would like to take a closer look at collecting meta data on the column level. In issue HIVE-33, five different statistics are described (# distinct values, # null values, 3 min values, 3 max values, avg size of column) that have been proposed as column meta data. As reference, I would take the implementation of the table/partition meta data collection.
As far as I can tell, deriving histograms is a little bit more complex than obtaining column information, which is why I want to start out with that.
Is there an up-to-date MetaStore DDL script or an E/R model?