Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-7177

percentile_approx very inaccurate with high multiplicities in the data

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.12.0
    • Fix Version/s: None
    • Component/s: UDF
    • Labels:
      None
    • Environment:

      Redhat 5.10 running Cloudera 5.0.1

      Description

      To reproduce:
      1) create a table with a single integer column
      2) with values: 1 million, 2 million, 3 million, and 4 million each repeated a quarter million times.
      3) percentile_approx(cast(col_0 as double), array(0.33,0.34),1000000)

      Expected results: [2000000.0,2000000.0]

      Actual results: [1280000.0,1320000.0] (I might be off by 40000 here)

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              tom.temple Tom Temple
            • Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: