Uploaded image for project: 'DataFu'
  1. DataFu
  2. DATAFU-46

Hash UDFs should return zero-padded strings of uniform length even when leading bits are zero

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Labels:
      None

      Description

      Reported by Philip Kromer here:

      https://github.com/linkedin/datafu/issues/93

      Details reported there by Philip:

      ---------------------

      The Hash UDFs in 'hex' mode currently do not return always the same-length string, because BigInteger.toString() omits leading zeros. So amidst a stream of 94% strings the same length, 1/16th are shorter by one or more characters, 1/256th by two or more, and in the unlikely case that an MD5 hash's value was 124 bits of zeros and 4 bits of ones it would return the one-character-long string 'f'.

      This is surprising behavior, and a trap for those practicing the frequent trick of generating a hash and chopping off just the number of bits you need:

      -- returns one-fifteenth, not one-sixteenth, of the input.
      sampled_lines = FILTER(FOREACH lines GENERATE MD5(val) AS digest, val) BY (STARTSWITH(digest, 'f'));
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                mrflip Flip Kromer
                Reporter:
                mhayes Matthew Hayes
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: