Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.15.0
    • Component/s: None
    • Labels:
      None

      Description

      Profiling looks at a data set and infers characteristics and constraints about the data.

      Some applications:

      • it helps users understand their data,
      • inferred constraints may allow additional optimizations (e.g. a foreign key allows semi-join removal),
      • column statistics help the optimizer estimate the selectivity of filters and joins,
      • joint cardinalities drive the algorithm that chooses which tiles of a lattice to materialize.

      Imagine you ran a profiler on a data set of 1 million rows and columns [orderId, gender, state, zipcode, productId, productName, brand]. Here is some sample output:

      {"type": "distribution", "columns": [], "cardinality": 1000000)},
      {"type": "distribution", "columns": ["gender"], "cardinality": 2, nullCount: 0, values: ["F", "M"])},
      {"type": "distribution", "columns": ["state"], "cardinality": 50)},
      {"type": "distribution", "columns": ["zipcode"], "cardinality": 43000, "nullCount": 0)},
      {"type": "distribution", "columns": ["state", "zipcode"], "cardinality": 43419)},
      {"type": "unique", "columns": ["orderId"]},
      {"type": "fd", "columns": ["productId"], "depend": ["brand", "productName"]},
      

      Note:

      • the cardinality of 0 columns is the count of the data set;
      • "nullCount" and "values" are only present for distributions of 1 column;
      • "nullCount" may be is omitted if 0;
      • "values" is only present if there are N or fewer values
      • "distribution" of 2 or more columns is only output if it is "interesting"; in the case of ["state", "zipcode"] it is interesting because the joint cardinality 43,419 is fewer than the cardinality 999,982 that would be expected if they were uniform and independent;
      • "fd" and "unique" are minimal. For example, we don't output "unique(orderId, productId)" if we have "unique(orderId)".

      Other ideas:

      • Some measure of skewedness. Does one value occur many more times than others?
      • Don't compute joint distributions for the power set of columns. This requires memory exponential in the number of columns. In pass 1 compute single-column distributions. In pass N compute N-column distributions only for the combinations of columns that the previous pass indicates will be interesting.
      • Use HyperLogLog to compute cardinalities.
      • Add low-cardinality columns to joint distributions. Rather than computing cardinality(zipcode, state) compute cardinality(zipcode, state, gender="F") and cardinality(zipcode, state, gender="M"). Because HLL rolls up losslessly, with 2x the memory you can compute 3 results: cardinality(zipcode, state), cardinality(zipcode, gender), cardinality(state, gender).
      • Approximate histograms: approximate quartiles? Or buckets with exact counts in each range?
      • Allow passing previous results into the algorithm. If you know the previous histogram of the orderDate column it is easier to compute its new histogram than if you start from scratch.
      • HyperLogLog is inaccurate for low cardinalities, so keep all values until the number of values exceeds a threshold, then transition to buckets.

        Issue Links

          Activity

          Hide
          julianhyde Julian Hyde added a comment -

          Resolved in release 1.15.0 (2017-12-11).

          Show
          julianhyde Julian Hyde added a comment - Resolved in release 1.15.0 (2017-12-11).
          Hide
          julianhyde Julian Hyde added a comment -

          Fixed in dad58186.

          Does not work on JDK 7 (due to com.yahoo.data-sketches requiring JDK 8); applicable tests no-op on JDK 7.

          Show
          julianhyde Julian Hyde added a comment - Fixed in dad58186 . Does not work on JDK 7 (due to com.yahoo.data-sketches requiring JDK 8); applicable tests no-op on JDK 7.
          Hide
          julianhyde Julian Hyde added a comment -

          I spoke about this work at DataWorks Summit: https://www.slideshare.net/julianhyde/data-profiling-with-apache-calcite. I tried to explain the algorithm and motivating cases.

          Show
          julianhyde Julian Hyde added a comment - I spoke about this work at DataWorks Summit: https://www.slideshare.net/julianhyde/data-profiling-with-apache-calcite . I tried to explain the algorithm and motivating cases.
          Hide
          julianhyde Julian Hyde added a comment -

          Calcite stays out of the business of persisting things. (Otherwise it would be a database.) But a system using Calcite could definitely use the profiler to populate statistics tables (or statistics files) and use them in the planner and also to implement LatticeStatisticProvider. When running LatticeTest we could generate a profile on the fly – no need to persist.

          Show
          julianhyde Julian Hyde added a comment - Calcite stays out of the business of persisting things. (Otherwise it would be a database.) But a system using Calcite could definitely use the profiler to populate statistics tables (or statistics files) and use them in the planner and also to implement LatticeStatisticProvider. When running LatticeTest we could generate a profile on the fly – no need to persist.
          Hide
          vladimirsitnikov Vladimir Sitnikov added a comment -

          Julian Hyde, do you mean to persist this statistics and reuse it in query planning?

          Show
          vladimirsitnikov Vladimir Sitnikov added a comment - Julian Hyde , do you mean to persist this statistics and reuse it in query planning?

            People

            • Assignee:
              julianhyde Julian Hyde
              Reporter:
              julianhyde Julian Hyde
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development