Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-23530

Use SQL functions instead of compute_stats UDAF to compute column statistics

    XMLWordPrintableJSON

Details

    Description

      Currently we compute column statistics by relying on the compute_stats UDAF. For instance, for a given table tbl, the query to compute statistics for columns is translated internally into:

      SELECT compute_stats(c1),
             compute_stats(c2),
             ...
      FROM tbl;
      

      compute_stats produces data for the stats available for each column type, e.g., struct<"max":long,"min":long,"countnulls":long,...>.

      This issue is to produce a query that relies purely on SQL functions instead:

      SELECT max(c1), min(c1), count(case when c1 is null then 1 else null end),
             ...
      FROM tbl;
      

      This will allow us to deprecate the compute_stats UDAF since it mostly duplicates functionality found in those other functions. Additionally, many of those functions already provide a vectorized implementation so the approach can potentially improve the performance of column stats collection.

      Attachments

        1. HIVE-23530.01.patch
          249 kB
          jcamachorodriguez
        2. HIVE-23530.02.patch
          5.46 MB
          jcamachorodriguez
        3. HIVE-23530.03.patch
          5.46 MB
          jcamachorodriguez
        4. HIVE-23530.04.patch
          5.43 MB
          jcamachorodriguez
        5. HIVE-23530.05.patch
          5.43 MB
          jcamachorodriguez
        6. HIVE-23530.patch
          76 kB
          jcamachorodriguez

        Issue Links

          Activity

            People

              jcamacho Jesús Camacho Rodríguez
              jcamacho Jesús Camacho Rodríguez
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h
                  2h