Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-27224

HFile tool statistic sampling produces misleading results

    XMLWordPrintableJSON

Details

    • Hide
      Fixes HFilePrettyPrinter's calculation of min and max size for an HFile so that it will truly be the min and max for the whole file. Previously was based on just a sampling, as with the histograms. Additionally adds a new argument to the tool '-d' which prints detailed range counts for each summary. The range counts give you the exact count of rows/cells that fall within the pre-defined ranges, useful for giving more detailed insight into outliers.
      Show
      Fixes HFilePrettyPrinter's calculation of min and max size for an HFile so that it will truly be the min and max for the whole file. Previously was based on just a sampling, as with the histograms. Additionally adds a new argument to the tool '-d' which prints detailed range counts for each summary. The range counts give you the exact count of rows/cells that fall within the pre-defined ranges, useful for giving more detailed insight into outliers.

    Description

      HFile tool uses codahale metrics for collecting statistics about key/values in an HFile. We recently had a case where the statistics printed out that the max row size was only 25k. This was confusing because I was seeing bucket cache allocation failures for blocks as large as 1.5mb. 

      Digging in, I was able to find the large row using the "-p" argument (which was obviously very verbose). Once I found the row, I saw the vlen was listed as ~1.5mb which made much more sense.

      First thing I notice here is that default codahale metrics histogram is using ExponentiallyDecayingReservoir. This probably makes sense for a long-lived histogram, but the HFile tool is run at a point in time. It might be best to use UniformReservoir instead.

      Secondly, we do not need sampling for min/max. Let's supplement the histogram with our own calculation which is guaranteed to be accurate for the entirety of the file.

      Attachments

        Activity

          People

            bbeaudreault Bryan Beaudreault
            bbeaudreault Bryan Beaudreault
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: