Hive
  1. Hive
  2. HIVE-4435

Column stats: Distinct value estimator should use hash functions that are pairwise independent

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.10.0, 0.11.0
    • Fix Version/s: 0.12.0
    • Component/s: Statistics
    • Labels:
      None

      Description

      The current implementation of Flajolet-Martin estimator to estimate the number of distinct values doesn't use hash functions that are pairwise independent. This is problematic because the input values don't distribute uniformly. When run on large TPC-H data sets, this leads to a huge discrepancy for primary key columns. Primary key columns are typically a monotonically increasing sequence.

      1. chart_1(1).png
        19 kB
        Shreepadma Venugopalan
      2. HIVE-4435.1.patch
        5 kB
        Shreepadma Venugopalan
      3. HIVE-4435.2.patch
        7 kB
        Shreepadma Venugopalan

        Activity

        Shreepadma Venugopalan created issue -
        Shreepadma Venugopalan made changes -
        Field Original Value New Value
        Status Open [ 1 ] Patch Available [ 10002 ]
        Shreepadma Venugopalan made changes -
        Attachment HIVE-4435.1.patch [ 12581033 ]
        Shreepadma Venugopalan made changes -
        Description The current implementation of Flajolet-Martin estimator to estimate the number of distinct values doesn't use hash functions that are pairwise independent. This is problematic because the input values don't distribute uniformly. When run on large TPC-H data sets, this leads to a huge discrepancy for primary key columns. Primary key columns are typically a monotonically increasing sequence.
        Hide
        Shreepadma Venugopalan added a comment -

        The fix is to use hash functions that are pairwise independent. More on pairwise independence and family of hash functions - http://people.csail.mit.edu/ronitt/COURSE/S12/handouts/lec5.pdf

        Show
        Shreepadma Venugopalan added a comment - The fix is to use hash functions that are pairwise independent. More on pairwise independence and family of hash functions - http://people.csail.mit.edu/ronitt/COURSE/S12/handouts/lec5.pdf
        Hide
        Shreepadma Venugopalan added a comment -
        Show
        Shreepadma Venugopalan added a comment - review board: https://reviews.apache.org/r/10841/
        Shreepadma Venugopalan made changes -
        Attachment chart_1(1).png [ 12581038 ]
        Hide
        Shreepadma Venugopalan added a comment -

        Attached plot of relative error vs. number of distinct values after the fix.
        Dataset: TPC-H of varying sizes up to 10TB
        hive.stats.ndv.error = 5% (standard error for the estimator)
        Column types: String, Long, Double

        Show
        Shreepadma Venugopalan added a comment - Attached plot of relative error vs. number of distinct values after the fix. Dataset: TPC-H of varying sizes up to 10TB hive.stats.ndv.error = 5% (standard error for the estimator) Column types: String, Long, Double
        Hide
        Shreepadma Venugopalan added a comment -

        Can a committer take a look at this?

        Show
        Shreepadma Venugopalan added a comment - Can a committer take a look at this?
        Hide
        Shreepadma Venugopalan added a comment -

        Ping

        Show
        Shreepadma Venugopalan added a comment - Ping
        Hide
        Ashutosh Chauhan added a comment -

        Sorry for the delay. +1 Will commit if tests pass.

        Show
        Ashutosh Chauhan added a comment - Sorry for the delay. +1 Will commit if tests pass.
        Hide
        Shreepadma Venugopalan added a comment -

        Thanks Ashutosh!

        Show
        Shreepadma Venugopalan added a comment - Thanks Ashutosh!
        Hide
        Ashutosh Chauhan added a comment -

        Following tests failed:

        • compute_stats_double.q
        • compute_stats_long.q
        • compute_stats_string.q

        I am assuming since we have better estimates now, we just need to update .q.out files for these. Can you verify and if so can you update the patch with it?

        Show
        Ashutosh Chauhan added a comment - Following tests failed: compute_stats_double.q compute_stats_long.q compute_stats_string.q I am assuming since we have better estimates now, we just need to update .q.out files for these. Can you verify and if so can you update the patch with it?
        Hide
        Ashutosh Chauhan added a comment -

        Canceling patch since current patch is resulting in test failures.

        Show
        Ashutosh Chauhan added a comment - Canceling patch since current patch is resulting in test failures.
        Ashutosh Chauhan made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Affects Version/s 0.11.0 [ 12323587 ]
        Hide
        Shreepadma Venugopalan added a comment -

        Ashutosh Chauhan: I've updated the .q files in the patches. Thanks!

        Show
        Shreepadma Venugopalan added a comment - Ashutosh Chauhan : I've updated the .q files in the patches. Thanks!
        Shreepadma Venugopalan made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Shreepadma Venugopalan made changes -
        Attachment HIVE-4435.2.patch [ 12586388 ]
        Hide
        Ashutosh Chauhan added a comment -

        Committed to trunk. Thanks, Shreepadma!

        Show
        Ashutosh Chauhan added a comment - Committed to trunk. Thanks, Shreepadma!
        Ashutosh Chauhan made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Fix Version/s 0.12.0 [ 12324312 ]
        Resolution Fixed [ 1 ]
        Hide
        Hudson added a comment -

        Integrated in Hive-trunk-h0.21 #2132 (See https://builds.apache.org/job/Hive-trunk-h0.21/2132/)
        HIVE-4435 : Column stats: Distinct value estimator should use hash functions that are pairwise independent (Shreepadma Venugopalan via Ashutosh Chauhan) (Revision 1490323)

        Result = FAILURE
        hashutosh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1490323
        Files :

        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumDistinctValueEstimator.java
        • /hive/trunk/ql/src/test/results/clientpositive/compute_stats_double.q.out
        • /hive/trunk/ql/src/test/results/clientpositive/compute_stats_long.q.out
        • /hive/trunk/ql/src/test/results/clientpositive/compute_stats_string.q.out
        Show
        Hudson added a comment - Integrated in Hive-trunk-h0.21 #2132 (See https://builds.apache.org/job/Hive-trunk-h0.21/2132/ ) HIVE-4435 : Column stats: Distinct value estimator should use hash functions that are pairwise independent (Shreepadma Venugopalan via Ashutosh Chauhan) (Revision 1490323) Result = FAILURE hashutosh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1490323 Files : /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumDistinctValueEstimator.java /hive/trunk/ql/src/test/results/clientpositive/compute_stats_double.q.out /hive/trunk/ql/src/test/results/clientpositive/compute_stats_long.q.out /hive/trunk/ql/src/test/results/clientpositive/compute_stats_string.q.out
        Hide
        Ashutosh Chauhan added a comment -

        This issue has been fixed and released as part of 0.12 release. If you find further issues, please create a new jira and link it to this one.

        Show
        Ashutosh Chauhan added a comment - This issue has been fixed and released as part of 0.12 release. If you find further issues, please create a new jira and link it to this one.
        Ashutosh Chauhan made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Patch Available Patch Available Open Open
        36d 17h 41m 1 Ashutosh Chauhan 05/Jun/13 15:16
        Open Open Patch Available Patch Available
        2d 8h 48m 2 Shreepadma Venugopalan 05/Jun/13 22:22
        Patch Available Patch Available Resolved Resolved
        18h 1m 1 Ashutosh Chauhan 06/Jun/13 16:23
        Resolved Resolved Closed Closed
        131d 8h 5m 1 Ashutosh Chauhan 16/Oct/13 00:29

          People

          • Assignee:
            Shreepadma Venugopalan
            Reporter:
            Shreepadma Venugopalan
          • Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development