Uploaded image for project: 'Commons Codec'
  1. Commons Codec
  2. CODEC-268

Deprecate methods in MurmurHash3 with no equivalent in the reference c++ source.

    XMLWordPrintableJSON

Details

    • Wish
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 1.13
    • None
    • None

    Description

      The following methods have no equivalent in the MurmurHash3 c++ source code

          public static long hash64(final long data) {
          public static long hash64(final int data) {
          public static long hash64(final short data) {
          public static long hash64(final byte[] data) {
          public static long hash64(final byte[] data, final int offset, final int length) {
          public static long hash64(final byte[] data, final int offset, final int length, final int seed) {
      

      They are documented to return the upper 64-bits of the 128-bit hash method. This is false as the code neglects to mix in alternating blocks of 64-bits with the corresponding other hash that would be built in the 128-bit method. Thus this passes using a NotEquals check:

      @Test
      public void testHash64InNotEqualToHash128() {
          for (int i = 0; i < 32; i++) {
              final byte[] bytes = Arrays.copyOf(RANDOM_BYTES, i);
              final long h1 = MurmurHash3.hash64(bytes);
              final long[] hash = MurmurHash3.hash128(bytes);
              Assert.assertNotEquals("Did not expect hash64 to match upper bits of hash128", hash[0], h1);
              Assert.assertNotEquals("Did not expect hash64 to match lower bits of hash128", hash[1], h1);
          }
      }
      

      The following methods are convenience methods that use an arbitrary default seed:

          public static final int DEFAULT_SEED = 104729;
          public static int hash32(final long data1, final long data2) {
          public static int hash32(final long data1, final long data2, final int seed) {
          public static int hash32(final long data) {
          public static int hash32(final long data, final int seed) {
      

      Although they match calling hash32() with the equivalent bytes this assumes big-endian format. The reference c++ code is designed for little endian. If you hash the corresponding values using a Google Guava implementation with the same seed then they do not match as Guava faithfully converts primitives as little endian. Also note that the available methods for hashing primitives are not the same as those in hash64.

      These methods thus make very little sense and should be removed as an unwanted legacy from the port from Apache Hive.

      The following methods are convenience methods that use default encoding via String.getBytes():

          public static int hash32(final String data) {
          public static long[] hash128(final String data) {
      

      These effectively do nothing and save 0 lines of code for the caller:

          String s;
          int hash1 = MurmurHash3.hash32(s);
          int hash2 = MurmurHash3.hash32(s.getBytes());
          assert hash1 == hash2;
      

      This is not even used:

          public static final long NULL_HASHCODE = 2862933555777941757L;
      

      The following methods for hash32, hash64 and hash128 are not consistent with the available arguments:

          public static int hash32(final byte[] data) {
          public static int hash32(final String data) {
          public static int hash32(final byte[] data, final int length) {
          public static int hash32(final byte[] data, final int length, final int seed) {
          public static int hash32(final byte[] data, final int offset, final int length, final int seed) {
      
          public static long hash64(final byte[] data) {
          // Two int arguments specify (offset, length) not (length, seed)
          public static long hash64(final byte[] data, final int offset, final int length) {
          public static long hash64(final byte[] data, final int offset, final int length, final int seed) {
      
          public static long[] hash128(final byte[] data) {
          public static long[] hash128(final String data) {
          // No other (offset, length, seed) combinations
          public static long[] hash128(final byte[] data, final int offset, final int length, final int seed) {
      

      I would suggest deprecating:

      • All hash64 methods
      • The default seed
      • The null hashcode
      • The helper methods that hash primitives
      • The helper methods that hash String using default encoding
      • Those methods that accept various combinations of offset, length and seed

      This would leave a cleaner API:

          public static int hash32(final byte[] data) {
          public static int hash32(final byte[] data, final int offset, final int length, final int seed) {
      
          public static long[] hash128(final byte[] data) {
          public static long[] hash128(final byte[] data, final int offset, final int length, final int seed) {
      

      There are outstanding bugs in CODEC-264 and CODEC-267 for sign extension errors in hash32 and hash128. These should be fixed using a minimal API with names that reference the x86 and x64 names from the MurmurHash3 reference c++ code:

      I would argue that the default seed should be zero (it is zero in Google Guava and Python mmh3 implementations of MurmurHash3), not an arbitrary random int.

          public static int hash32x86(final byte[] data) {
          public static int hash32x86(final byte[] data, final int offset, final int length, final int seed) {
      
          public static long[] hash128x64(final byte[] data) {
          public static long[] hash128x64(final byte[] data, final int offset, final int length, final int seed) {
      

      All deprecated methods should be referred to use these instead.

       

      Attachments

        Activity

          People

            aherbert Alex Herbert
            aherbert Alex Herbert
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: