Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5816

Hash function produces skewed results on String values with same leading prefix

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.12.0
    • Component/s: None
    • Labels:

      Description

      Reported by Aman Sinha

      Hashing of string values (for the hash exchange) could produce substantial skew for certain types of strings that have the same leading prefix.
      Here's the sample data: (note all strings begin with 'mscId=' followed by numeric values)

      0: jdbc:drill:drillbit=10.10.103.111> select a from dfs.tmp.vv3 limit 20;
      ---------------------

      a

      ---------------------

      mscId=100139170495
      mscId=100103806655
      mscId=100229137840
      mscId=100362859440
      mscId=100032583600
      mscId=100125021360
      mscId=100243775920
      mscId=100152820405
      mscId=100084724405
      mscId=100297398970
      mscId=100059560890
      mscId=100106108090
      mscId=100032092090
      mscId=100029460410
      mscId=100110390995
      mscId=100019105235
      mscId=100354644435
      mscId=100288523475
      mscId=100214507475
      mscId=100296418515

      ---------------------
      20 rows selected (0.33 seconds)

      Here's the hash values using the hash function that Drill uses for the HashToRandomExchange (note that they are all even numbers):

      0: jdbc:drill:drillbit=10.10.103.111> select hash32AsDouble(a, 1301011) from dfs.tmp.vv3 limit 20;
      --------------

      EXPR$0

      --------------

      1180062632
      -1322734784
      2096701320
      2075007536
      -1970336592
      1614574192
      1592743936
      -1053691072
      -689805200
      1893061072
      1660328376
      1852126136
      1927731344
      616840056
      -1997249184
      1588717872
      193019624
      880839008
      1879415496
      1726850216

      --------------
      20 rows selected (0.311 seconds)

      Doing a mod 56 only produces 1 distinct value, which indicates the skew:
      0: jdbc:drill:drillbit=10.10.103.111> select distinct mod(hash32AsDouble(a, 1301011), 56) from dfs.tmp.vv3 limit 20;
      ---------

      EXPR$0

      ---------

      0

      ---------
      1 row selected (1.041 seconds)

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                shamirwasia Sorabh Hamirwasia
                Reporter:
                shamirwasia Sorabh Hamirwasia
                Reviewer:
                Aman Sinha
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: