Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Reported by amansinha100
Hashing of string values (for the hash exchange) could produce substantial skew for certain types of strings that have the same leading prefix.
Here's the sample data: (note all strings begin with 'mscId=' followed by numeric values)
0: jdbc:drill:drillbit=10.10.103.111> select a from dfs.tmp.vv3 limit 20;
---------------------
a |
---------------------
mscId=100139170495 |
mscId=100103806655 |
mscId=100229137840 |
mscId=100362859440 |
mscId=100032583600 |
mscId=100125021360 |
mscId=100243775920 |
mscId=100152820405 |
mscId=100084724405 |
mscId=100297398970 |
mscId=100059560890 |
mscId=100106108090 |
mscId=100032092090 |
mscId=100029460410 |
mscId=100110390995 |
mscId=100019105235 |
mscId=100354644435 |
mscId=100288523475 |
mscId=100214507475 |
mscId=100296418515 |
---------------------
20 rows selected (0.33 seconds)
Here's the hash values using the hash function that Drill uses for the HashToRandomExchange (note that they are all even numbers):
0: jdbc:drill:drillbit=10.10.103.111> select hash32AsDouble(a, 1301011) from dfs.tmp.vv3 limit 20;
--------------
EXPR$0 |
--------------
1180062632 |
-1322734784 |
2096701320 |
2075007536 |
-1970336592 |
1614574192 |
1592743936 |
-1053691072 |
-689805200 |
1893061072 |
1660328376 |
1852126136 |
1927731344 |
616840056 |
-1997249184 |
1588717872 |
193019624 |
880839008 |
1879415496 |
1726850216 |
--------------
20 rows selected (0.311 seconds)
Doing a mod 56 only produces 1 distinct value, which indicates the skew:
0: jdbc:drill:drillbit=10.10.103.111> select distinct mod(hash32AsDouble(a, 1301011), 56) from dfs.tmp.vv3 limit 20;
---------
EXPR$0 |
---------
0 |
---------
1 row selected (1.041 seconds)
Attachments
Issue Links
- links to