Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
It would be very useful to have a built in Summarizer that computes summary information about field lengths. Specifically key length, row length, family length, qualifier length, visibility length, and value length. Whatever stats are computed must be able to computed incrementally. For example can incrementally compute min, max, count, sum, and log2 histogram. I think these would be good stats to start with. Count and sum can be used to compute the average. There is an example of computing a log2 histogram in the Summarizer javadoc.
The Summarizer could be named EntryLenghtSummarizer and possibly produce summaries like the following.
count=XXX //do not need to track this per field, its the same for all key.min=XXX key.max=XXX key.sum=XXX key.logHist.8=XXX //only output non zero exponents key.logHist.9=XXX row.min=XXX row.max=XXX row.sum=XXX row.logHist.7=XXX row.logHist.8=XXX row.logHist.10=XXX family.min=XXX family.max=XXX family.sum=XXX family.logHist.6=XXX family.logHist.7=XXX etc...
This new summarizer would be placed in the summarizers package.
Attachments
Issue Links
- links to