Uploaded image for project: 'DataFu'
  1. DataFu
  2. DATAFU-46

Hash UDFs should return zero-padded strings of uniform length even when leading bits are zero

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.3.0
    • None

    Description

      Reported by Philip Kromer here:

      https://github.com/linkedin/datafu/issues/93

      Details reported there by Philip:

      ---------------------

      The Hash UDFs in 'hex' mode currently do not return always the same-length string, because BigInteger.toString() omits leading zeros. So amidst a stream of 94% strings the same length, 1/16th are shorter by one or more characters, 1/256th by two or more, and in the unlikely case that an MD5 hash's value was 124 bits of zeros and 4 bits of ones it would return the one-character-long string 'f'.

      This is surprising behavior, and a trap for those practicing the frequent trick of generating a hash and chopping off just the number of bits you need:

      -- returns one-fifteenth, not one-sixteenth, of the input.
      sampled_lines = FILTER(FOREACH lines GENERATE MD5(val) AS digest, val) BY (STARTSWITH(digest, 'f'));
      

      Attachments

        Issue Links

          Activity

            People

              mrflip Flip Kromer
              mhayes Matthew Hayes
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: