Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-11160 Auto-gather column stats
  3. HIVE-18149

Stats: rownum estimation from datasize underestimates in most cases


    • Type: Sub-task
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.0.0
    • Component/s: Statistics
    • Labels:


      rownum estimation is based on the following fact as of now:

      • datasize being used from the following sources:
        • basicstats aggregates the loaded "on-heap" row sizes ; other readers are able to give "raw size" estimation - I've checked orc; but I'm sure others will do the same....api docs are a bit vague about the methods purpose...
        • if the basicstats level info is not available; the filesystem level "file-size-sums" are used as the "raw data size" ; which is multiplied by the deserialization ratio ; which is currently 1.

      the problem with all of this is that deser factor is 1; and that rowsize counts in the online object headers..

      example; 20 rows are loaded into a partition columnstats_partlvl_dp.q

      after HIVE-18108 this explain will estimate the rowsize of the table to be 404 bytes; however the 20 rows of text is only 169 bytes...so it ends up with 0 rows...


        1. HIVE-18149.03wip02.patch
          2.61 MB
          Zoltan Haindrich
        2. HIVE-18149.03wip01.patch
          2.61 MB
          Zoltan Haindrich
        3. HIVE-18149.03.patch
          2.64 MB
          Zoltan Haindrich
        4. HIVE-18149.02.patch
          2.60 MB
          Zoltan Haindrich
        5. HIVE-18149.01wip01.patch
          36 kB
          Zoltan Haindrich
        6. HIVE-18149.01.patch
          2.58 MB
          Zoltan Haindrich

          Issue Links



              • Assignee:
                kgyrtkirk Zoltan Haindrich
                kgyrtkirk Zoltan Haindrich
              • Votes:
                0 Vote for this issue
                3 Start watching this issue


                • Created: