Avro
  1. Avro
  2. AVRO-182

hashCode and equals are not consistent with compareTo

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Component/s: java
    • Labels:
      None

      Description

      Java's specific and generic APIs implement compareTo according to the schema, where some fields might be ignored. To be consistent, fields that are ignored when comparing for ordering should also be ignored when comparing for equality and for computing hashCodes.

      1. AVRO-182.patch
        14 kB
        Doug Cutting
      2. AVRO-182.patch
        13 kB
        Doug Cutting
      3. AVRO-182.patch
        8 kB
        Doug Cutting
      4. AVRO-182.patch
        7 kB
        Doug Cutting

        Activity

        Hide
        Doug Cutting added a comment -

        Here's a patch that fixes this.

        Show
        Doug Cutting added a comment - Here's a patch that fixes this.
        Hide
        Philip Zeyliger added a comment -

        A few comments. I think the equals() thing is a bug. Everything else is commentary.

            public boolean equals(Object that) {
              return this.compareTo((Record)that) == 0;
            }
        // also: return this.compareTo((Array)that) == 0;
        

        Doesn't this throw a runtime cast exception? equals() should check type-compatibility first. You should add a quick test ("assertFalse(o1.equals(new Object()))").

        public int hashCode(Object o, Schema s)

        Do you feel it beneficial to include the hashcode of the schema in the hashcode as well?

        hashCode = 31*hashCode + hashCode(e, elementType);

        The google collections library does something clever (I think) by delegating to java.util.Arrays.hashCode(). That hides the "31*foo + bar" logic. (Arrays.hashCode() in 1.5.0 is "31*foo + (bar ^ (bar >>> 32))" if I'm reading it right. Either way, it seems like extracting the logic for combining two hash codes into a function might be nice.

          /**
           * Generates a hash code for multiple values. The hash code is generated by
           * calling {@link Arrays#hashCode(Object[])}.
           *
           * <p>This is useful for implementing {@link Object#hashCode()}. For example,
           * in an object that has three properties, {@code x}, {@code y}, and
           * {@code z}, one could write:
           * <pre>
           * public int hashCode() {
           *   return Objects.hashCode(getX(), getY(), getZ());
           * }</pre>
           *
           * <b>Warning</b>: When a single object is supplied, the returned hash code
           * does not equal the hash code of that object.
           */
          public static int hashCode(Object... objects) {
            return Arrays.hashCode(objects);
          }
        

        GenericArray a = (GenericArray)o;

        You chose GenericArray to delegate to GenericData, versus the other way around. For my own edification (mostly to further my understanding of how the Generic* stuff works), why choose that direction and not the other?

        Show
        Philip Zeyliger added a comment - A few comments. I think the equals() thing is a bug. Everything else is commentary. public boolean equals(Object that) { return this.compareTo((Record)that) == 0; } // also: return this.compareTo((Array)that) == 0; Doesn't this throw a runtime cast exception? equals() should check type-compatibility first. You should add a quick test ("assertFalse(o1.equals(new Object()))"). public int hashCode(Object o, Schema s) Do you feel it beneficial to include the hashcode of the schema in the hashcode as well? hashCode = 31*hashCode + hashCode(e, elementType); The google collections library does something clever (I think) by delegating to java.util.Arrays.hashCode(). That hides the "31*foo + bar" logic. (Arrays.hashCode() in 1.5.0 is "31*foo + (bar ^ (bar >>> 32))" if I'm reading it right. Either way, it seems like extracting the logic for combining two hash codes into a function might be nice. /** * Generates a hash code for multiple values. The hash code is generated by * calling {@link Arrays#hashCode(Object[])}. * * <p>This is useful for implementing {@link Object#hashCode()}. For example, * in an object that has three properties, {@code x}, {@code y}, and * {@code z}, one could write: * <pre> * public int hashCode() { * return Objects.hashCode(getX(), getY(), getZ()); * }</pre> * * <b>Warning</b>: When a single object is supplied, the returned hash code * does not equal the hash code of that object. */ public static int hashCode(Object... objects) { return Arrays.hashCode(objects); } GenericArray a = (GenericArray)o; You chose GenericArray to delegate to GenericData, versus the other way around. For my own edification (mostly to further my understanding of how the Generic* stuff works), why choose that direction and not the other?
        Hide
        Doug Cutting added a comment -

        > I think the equals() thing is a bug.

        It was intentional. I think of it as a feature, but that's probably idiosyncratic. The canonical use of equals() is in HashMap. With TreeMap, if you try to mix classes, you'll probably get ClassCastException from the compareTo implementation. But, with a HashMap, if you get a ClassCastException from an equals() implementation it's a bug?

        We could add 'if (!(that instanceof Record)) return false;' to the front of all of these (I had it at one point and removed it). It makes the code bigger and slower and may make it harder to find uses of null pointers and incorrect types, but, on the other hand, it may permit some uses that would not otherwise be permitted. I lean towards smaller code that's finicky about types to greater flexibility. But if you still think it's a bug, I'll change it.

        > Do you feel it beneficial to include the hashcode of the schema in the hashcode as well?

        No. It would make hashCode slower. It would prevent false-positives for data that has the same leaf values in the same order but in different types, but I think such cases are rare. Most Java hashCode implementations do not, e.g., include the hash of the class name, just the instance data.

        > it seems like extracting the logic for combining two hash codes into a function might be nice.

        That would be nice. I'm leery of varrargs, since they'll allocate an array per call, and hashCode should be wicked fast. We might instead add a method like:

        public int incrementHash(int hashCode, Object o, Schema s) {
          return 31*hashCode + hashCode(o, s)
        }
        

        Do you like that? Do you think we should throw in an xor or a shift?

        > You chose GenericArray to delegate to GenericData [ ... ]

        I'm not sure what you mean. GenericArray is the interface that GenericDatumReader and GenericDatumWriter use. So one can change one's array implementation, and, as long as one implements GenericArray, one can use GenericDatumReader and GenericDatumWriter unchanged. The standard implementation of GenericArray is GenericData.Array. But if one does not like the GenericArray interface, then one can override methods in GenericDatumReader and GenericDatumWriter to use an arbitrary class to represent arrays and use GenericArray at all.

        GenericData has code that's shared by GenericDatumReader and GenericDatumWriter, or that's otherwise common to generic data. GenericData is a singleton, so that its methods may be overidden by singleton subclasses SpecificData and ReflectData. In this way, SpecificDatumReader and SpecificDatumWriter can, e.g., share much of the GenericDatumReader and GenericDatumWriter logic, with minimal code duplication.

        So, back to your question, I don't see that GenericArray delegates to GenericData, but maybe the explanations above help to better explain the structure of this stuff?

        Show
        Doug Cutting added a comment - > I think the equals() thing is a bug. It was intentional. I think of it as a feature, but that's probably idiosyncratic. The canonical use of equals() is in HashMap. With TreeMap, if you try to mix classes, you'll probably get ClassCastException from the compareTo implementation. But, with a HashMap, if you get a ClassCastException from an equals() implementation it's a bug? We could add 'if (!(that instanceof Record)) return false;' to the front of all of these (I had it at one point and removed it). It makes the code bigger and slower and may make it harder to find uses of null pointers and incorrect types, but, on the other hand, it may permit some uses that would not otherwise be permitted. I lean towards smaller code that's finicky about types to greater flexibility. But if you still think it's a bug, I'll change it. > Do you feel it beneficial to include the hashcode of the schema in the hashcode as well? No. It would make hashCode slower. It would prevent false-positives for data that has the same leaf values in the same order but in different types, but I think such cases are rare. Most Java hashCode implementations do not, e.g., include the hash of the class name, just the instance data. > it seems like extracting the logic for combining two hash codes into a function might be nice. That would be nice. I'm leery of varrargs, since they'll allocate an array per call, and hashCode should be wicked fast. We might instead add a method like: public int incrementHash( int hashCode, Object o, Schema s) { return 31*hashCode + hashCode(o, s) } Do you like that? Do you think we should throw in an xor or a shift? > You chose GenericArray to delegate to GenericData [ ... ] I'm not sure what you mean. GenericArray is the interface that GenericDatumReader and GenericDatumWriter use. So one can change one's array implementation, and, as long as one implements GenericArray, one can use GenericDatumReader and GenericDatumWriter unchanged. The standard implementation of GenericArray is GenericData.Array. But if one does not like the GenericArray interface, then one can override methods in GenericDatumReader and GenericDatumWriter to use an arbitrary class to represent arrays and use GenericArray at all. GenericData has code that's shared by GenericDatumReader and GenericDatumWriter, or that's otherwise common to generic data. GenericData is a singleton, so that its methods may be overidden by singleton subclasses SpecificData and ReflectData. In this way, SpecificDatumReader and SpecificDatumWriter can, e.g., share much of the GenericDatumReader and GenericDatumWriter logic, with minimal code duplication. So, back to your question, I don't see that GenericArray delegates to GenericData, but maybe the explanations above help to better explain the structure of this stuff?
        Hide
        Philip Zeyliger added a comment -

        > equals()

        I think you should include the instanceof check. I read the contract for equals (http://java.sun.com/j2se/1.4.2/docs/api/java/lang/Object.html#equals%28java.lang.Object%29) as indicating that equals() should never throw on valid objects. I agree that in practice this hardly ever crops up, but as a Java user, I would be very confused by the error.

        > public int incrementHash(int hashCode, Object o, Schema s) {

        I like this. I'm nowhere near expert enough in hash functions to say whether there should be xors or shifts, but it seems fine as is.

        > GenericData, GenericArray, etc.

        Thanks for the explanation. I think what I meant to say was that GenericData.Array (itself an implementation of GenericArray) calls back into GenericData. I see now that if you want people to use different GenericArrays, you have to do it this way. I hadn't noticed that GenericDatumReader is designed to be subclassed to have different newArray() implementations. So all is clear(er) now, thanks.

        – Philip

        Show
        Philip Zeyliger added a comment - > equals() I think you should include the instanceof check. I read the contract for equals ( http://java.sun.com/j2se/1.4.2/docs/api/java/lang/Object.html#equals%28java.lang.Object%29 ) as indicating that equals() should never throw on valid objects. I agree that in practice this hardly ever crops up, but as a Java user, I would be very confused by the error. > public int incrementHash(int hashCode, Object o, Schema s) { I like this. I'm nowhere near expert enough in hash functions to say whether there should be xors or shifts, but it seems fine as is. > GenericData, GenericArray, etc. Thanks for the explanation. I think what I meant to say was that GenericData.Array (itself an implementation of GenericArray) calls back into GenericData. I see now that if you want people to use different GenericArrays, you have to do it this way. I hadn't noticed that GenericDatumReader is designed to be subclassed to have different newArray() implementations. So all is clear(er) now, thanks. – Philip
        Hide
        Doug Cutting added a comment -

        > I think you should include the instanceof check.

        Note that including it at the top-level will not prevent ClassCastException, since I implemented equals with compareTo(), which may throw this. For example, a GenericRecord with a value for "foo" that's a float will throw ClassCastException if it's checked for equality with a GenericRecord whose value for "foo" is a long. We could to some degree prevent this (at some performance penality) by first comparing schemas, but even that will fail if an object doesn't actually validate against its schema. So the only way to really prevent ClassCastException is to implement equals() independently of compareTo(), and check with instanceof before each cast. Would you prefer that? If so, I can implement equals independently.

        > I would be very confused by the error.

        Really? If you got a ClassCastException when you tried to put something into a HashMap that wasn't the same class as other stuff in that hash table you'd be confused? Are you similarly confused when you try to do this with a TreeMap? Or do you have some other scenario in mind?

        Show
        Doug Cutting added a comment - > I think you should include the instanceof check. Note that including it at the top-level will not prevent ClassCastException, since I implemented equals with compareTo(), which may throw this. For example, a GenericRecord with a value for "foo" that's a float will throw ClassCastException if it's checked for equality with a GenericRecord whose value for "foo" is a long. We could to some degree prevent this (at some performance penality) by first comparing schemas, but even that will fail if an object doesn't actually validate against its schema. So the only way to really prevent ClassCastException is to implement equals() independently of compareTo(), and check with instanceof before each cast. Would you prefer that? If so, I can implement equals independently. > I would be very confused by the error. Really? If you got a ClassCastException when you tried to put something into a HashMap that wasn't the same class as other stuff in that hash table you'd be confused? Are you similarly confused when you try to do this with a TreeMap? Or do you have some other scenario in mind?
        Hide
        Doug Cutting added a comment -

        New version of patch that updates hashcodes with a common method, hashCodeAdd.

        Show
        Doug Cutting added a comment - New version of patch that updates hashcodes with a common method, hashCodeAdd.
        Hide
        Philip Zeyliger added a comment -

        +1 to hashCodeAdd.

        For equals(), "confused" may be too strong. Perhaps "annoyed". I realize that it's a bit silly to put things in a TreeMap that aren't the same object, but it's perfectly reasonable to have a collection (say, a HashSet) of objects of different types.

        You make a good point that you can get the ClassCastException if the schemas aren't checked. It seems like two objects with different schemas, even if they contain the same data, are not equal, so the schemas should be checked for equality. (There's some leeway here: for a given class, equals() can be defined however, as long as it's clear, consistent, and documented. You could make a strong case that the schemas need not be equal, but merely compatible.) From a performance perspective, it seems likely that the schema objects themselves are going to be the same object (in the "==" sense) most of the time, so that comparison is likely to be fast. You could also just catch ClassCastException, and return false, though that seems less elegant.

        I'm less concerned about an Exception happening if the object itself doesn't validate against the schema. That's a reasonable runtime error.

        – Philip

        Show
        Philip Zeyliger added a comment - +1 to hashCodeAdd. For equals(), "confused" may be too strong. Perhaps "annoyed". I realize that it's a bit silly to put things in a TreeMap that aren't the same object, but it's perfectly reasonable to have a collection (say, a HashSet) of objects of different types. You make a good point that you can get the ClassCastException if the schemas aren't checked. It seems like two objects with different schemas, even if they contain the same data, are not equal, so the schemas should be checked for equality. (There's some leeway here: for a given class, equals() can be defined however, as long as it's clear, consistent, and documented. You could make a strong case that the schemas need not be equal, but merely compatible.) From a performance perspective, it seems likely that the schema objects themselves are going to be the same object (in the "==" sense) most of the time, so that comparison is likely to be fast. You could also just catch ClassCastException, and return false, though that seems less elegant. I'm less concerned about an Exception happening if the object itself doesn't validate against the schema. That's a reasonable runtime error. – Philip
        Hide
        Doug Cutting added a comment -

        > it seems likely that the schema objects themselves are going to be the same object (in the "==" sense) most of the time

        For specific & reflect, we can use == on the classes. So performance is only an issue for generic.

        "Most of the time" may not be good enough: we don't want applications that, for whatever reason, use different copies of a schema to suffer unduly. Schema.equals() showed up in benchmarks as a bottleneck when we used to have a HashMap<Schema,Class> cache, and has since been replaced with a name-based cache.

        Here's a compromise: for generic records, we compare the full, namespace-qualified names of the schemas, and, if those match, we assume that the data can be compared. For generated code, we use == on classes. Patch forthcoming.

        Show
        Doug Cutting added a comment - > it seems likely that the schema objects themselves are going to be the same object (in the "==" sense) most of the time For specific & reflect, we can use == on the classes. So performance is only an issue for generic. "Most of the time" may not be good enough: we don't want applications that, for whatever reason, use different copies of a schema to suffer unduly. Schema.equals() showed up in benchmarks as a bottleneck when we used to have a HashMap<Schema,Class> cache, and has since been replaced with a name-based cache. Here's a compromise: for generic records, we compare the full, namespace-qualified names of the schemas, and, if those match, we assume that the data can be compared. For generated code, we use == on classes. Patch forthcoming.
        Hide
        Philip Zeyliger added a comment -

        +1 to comparing schema names.

        Show
        Philip Zeyliger added a comment - +1 to comparing schema names.
        Hide
        Doug Cutting added a comment -

        Attaching a new version of the patch.

        This adds a new method, Schema#getFullName(), that returns the namespace-qualified name for named schemas. This is used by GenericData.Record#equals() to efficiently check schema compatibility. I've also updated a other places where fully-qualified names are used.

        The equals methods for generated classes now check schema compatibility by using == on class names.

        This patch now conflicts with the patch for AVRO-185. Sigh.

        Show
        Doug Cutting added a comment - Attaching a new version of the patch. This adds a new method, Schema#getFullName(), that returns the namespace-qualified name for named schemas. This is used by GenericData.Record#equals() to efficiently check schema compatibility. I've also updated a other places where fully-qualified names are used. The equals methods for generated classes now check schema compatibility by using == on class names. This patch now conflicts with the patch for AVRO-185 . Sigh.
        Hide
        Philip Zeyliger added a comment -

        I like the new patch. One assertion suggestion and one set of test suggestions below.

        public Name(String name, String space)

        In the constructor for Name, would it be reasonable to assert "space==null" in the case where name contains a ".". (There's a branch based on whether name is qualified or not.) Alternately, just have two constructors?

        assert(o2.equals(o2));

        I think it'd be prudent to add a test that checks avroRecord.equals(

        {null, new Object()}

        ) is false. It would also be great to exercise the schema checking, by having two records that have the same data, but schemas that differ in (say) namespace. You might wish to check that Name works as expected, though there's very little code there.

        Show
        Philip Zeyliger added a comment - I like the new patch. One assertion suggestion and one set of test suggestions below. public Name(String name, String space) In the constructor for Name, would it be reasonable to assert "space==null" in the case where name contains a ".". (There's a branch based on whether name is qualified or not.) Alternately, just have two constructors? assert(o2.equals(o2)); I think it'd be prudent to add a test that checks avroRecord.equals( {null, new Object()} ) is false. It would also be great to exercise the schema checking, by having two records that have the same data, but schemas that differ in (say) namespace. You might wish to check that Name works as expected, though there's very little code there.
        Hide
        Doug Cutting added a comment -

        > would it be reasonable to assert "space==null" in the case where name contains a "."

        I don't think so. If the name is fully qualified, then the namespace should be ignored. When parsing a nested schema that specifies no namespace, the namespace of the containing schema is passed. But we should support using a fully-qualified name in place of a namespace declaration. And, if someone specifies both a fully-qualified name and a namespace, I think the fully-qualified name should win, as the namespace is like a package declaration, determining only the default.

        Here's a new patch with the tests you suggested added.

        Show
        Doug Cutting added a comment - > would it be reasonable to assert "space==null" in the case where name contains a "." I don't think so. If the name is fully qualified, then the namespace should be ignored. When parsing a nested schema that specifies no namespace, the namespace of the containing schema is passed. But we should support using a fully-qualified name in place of a namespace declaration. And, if someone specifies both a fully-qualified name and a namespace, I think the fully-qualified name should win, as the namespace is like a package declaration, determining only the default. Here's a new patch with the tests you suggested added.
        Hide
        Philip Zeyliger added a comment -

        +1 Looks good to me. Thanks for adding tests.

        I had assumed that specifying both a namespace and a fully-qualified name would be user error, but I'm cool with the semantics that one overrides the other.

        Show
        Philip Zeyliger added a comment - +1 Looks good to me. Thanks for adding tests. I had assumed that specifying both a namespace and a fully-qualified name would be user error, but I'm cool with the semantics that one overrides the other.
        Hide
        Doug Cutting added a comment -

        I just committed this. Thanks for all the reviews, Philip!

        Show
        Doug Cutting added a comment - I just committed this. Thanks for all the reviews, Philip!

          People

          • Assignee:
            Doug Cutting
            Reporter:
            Doug Cutting
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development