Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Reported by Philip Kromer here:
https://github.com/linkedin/datafu/issues/93
Details reported there by Philip:
---------------------
The Hash UDFs in 'hex' mode currently do not return always the same-length string, because BigInteger.toString() omits leading zeros. So amidst a stream of 94% strings the same length, 1/16th are shorter by one or more characters, 1/256th by two or more, and in the unlikely case that an MD5 hash's value was 124 bits of zeros and 4 bits of ones it would return the one-character-long string 'f'.
This is surprising behavior, and a trap for those practicing the frequent trick of generating a hash and chopping off just the number of bits you need:
-- returns one-fifteenth, not one-sixteenth, of the input.
sampled_lines = FILTER(FOREACH lines GENERATE MD5(val) AS digest, val) BY (STARTSWITH(digest, 'f'));
Attachments
Attachments
Issue Links
- is depended upon by
-
DATAFU-47 UDF for Murmur3 (and other) Hash functions
- Closed