Avro
  1. Avro
  2. AVRO-667

GenericArray fails to compare with List. SpecificRecord compare gets ClassCastException

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.4.0
    • Fix Version/s: 1.4.1
    • Component/s: java
    • Labels:
      None

      Description

      AVRO-637 is incomplete.

      I am unable to convert my SpecificRecord project to 1.4. It compiles, but unit tests get runtime exceptions.

      compareTo in GenericData does not handle List vs GenericArray and I get class cast exceptions.

      java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.apache.avro.generic.GenericArray
      	at org.apache.avro.generic.GenericData.compare(GenericData.java:502)
      	at org.apache.avro.specific.SpecificData.compare(SpecificData.java:190)
      	at org.apache.avro.generic.GenericData.compare(GenericData.java:494)
      	at org.apache.avro.specific.SpecificData.compare(SpecificData.java:190)
      	at org.apache.avro.specific.SpecificRecordBase.compareTo(SpecificRecordBase.java:45)
      	at org.apache.avro.specific.SpecificRecordBase.equals(SpecificRecordBase.java:35)
      	at com.rr.eventdata.ViewRecord.equals(ViewRecord.java:350)
      	at com.rr.eventdata.WriteStuffTest.main(WriteStuffTest.java:143)
      	at com.rr.eventdata.WriteStuffTest.testReadWriteDataFile(WriteStuffTest.java:55)
      

      Also, an array of string in SpecificRecord ends up as List<CharSequence>.
      However, it should be List<? extends CharSequence> or else you can't assign a List<String> to it.

      1. AVRO-667.patch
        6 kB
        Scott Carey
      2. AVRO-667.patch
        7 kB
        Scott Carey
      3. AVRO-667.patch
        7 kB
        Doug Cutting
      4. AVRO-667.patch
        7 kB
        Doug Cutting

        Activity

        Hide
        Scott Carey added a comment -

        This patch makes GenericData compatible with Lists and CharSequence for equals() and hashCode().

        It also adds unit tests for those two functions that are more complete and demonstrate the issue. The current code coverage, after these tests, is 32% for GenericData. We should increase that to 90%+, but perhaps in another ticket.

        Show
        Scott Carey added a comment - This patch makes GenericData compatible with Lists and CharSequence for equals() and hashCode(). It also adds unit tests for those two functions that are more complete and demonstrate the issue. The current code coverage, after these tests, is 32% for GenericData. We should increase that to 90%+, but perhaps in another ticket.
        Hide
        Scott Carey added a comment -

        Hmm, I just realized this patch doesn't keep hashCode() consistent with equals(). We can only make this consistent for CharSequence if we .toString() the result... which is slow. Although for Utf8 we could just change hashCode() in utf8 into:

        this.toString().hashCode(); 
        

        which will at least cache the result.

        Show
        Scott Carey added a comment - Hmm, I just realized this patch doesn't keep hashCode() consistent with equals(). We can only make this consistent for CharSequence if we .toString() the result... which is slow. Although for Utf8 we could just change hashCode() in utf8 into: this .toString().hashCode(); which will at least cache the result.
        Hide
        Scott Carey added a comment -

        Updated patch with consistend hashCode() and equals() for GenericData.

        Show
        Scott Carey added a comment - Updated patch with consistend hashCode() and equals() for GenericData.
        Hide
        Doug Cutting added a comment -

        My intent is that, for arrays, the generic implementation will accept arbitrary Collection implementations. I've amended your patch to permit this and augmented the test you added. We need more non-List Collection tests, but there's reflection-based code in Hadoop that requires this.

        Other than that, +1.

        Show
        Doug Cutting added a comment - My intent is that, for arrays, the generic implementation will accept arbitrary Collection implementations. I've amended your patch to permit this and augmented the test you added. We need more non-List Collection tests, but there's reflection-based code in Hadoop that requires this. Other than that, +1.
        Hide
        Scott Carey added a comment -

        +1

        Looks great to me!

        Show
        Scott Carey added a comment - +1 Looks great to me!
        Hide
        Doug Cutting added a comment -

        I just realized that this could have a big performance impact on GenericData#compare() and hashCode(). The former is called in the inner loop of MapReduce sorts and is thus performance critical. In this context, both instances will be Utf8. Also, the correct ordering is by codepoint, which is the same as Utf8, while String's ordering may differ for surrogate pairs I believe. So I'd rather convert Strings to Utf8 than vice versa, to better preserve both correctness and performance. Here's a patch that does this. All tests pass.

        Show
        Doug Cutting added a comment - I just realized that this could have a big performance impact on GenericData#compare() and hashCode(). The former is called in the inner loop of MapReduce sorts and is thus performance critical. In this context, both instances will be Utf8. Also, the correct ordering is by codepoint, which is the same as Utf8, while String's ordering may differ for surrogate pairs I believe. So I'd rather convert Strings to Utf8 than vice versa, to better preserve both correctness and performance. Here's a patch that does this. All tests pass.
        Hide
        Scott Carey added a comment -

        If you are concerned about performance, then consider changing:

            case STRING:
              Utf8 u1 = o1 instanceof Utf8 ? (Utf8)o1 : new Utf8(o1.toString());
              Utf8 u2 = o2 instanceof Utf8 ? (Utf8)o2 : new Utf8(o2.toString());
              return u1.compareTo(u2);
        

        to use
        getClass().equals()
        rather than
        instanceof

        since instanceof is slower (though not dramatically).

        This has limitations for anyone who extends Utf8, but it might be best to mark it final anyway. Even if it is not final, we can't rely on a subclass implementing .equals() properly, it might be overridden and break equals() symmetry. So I think Utf8.class.equals() or Utf8.class == are the right things to use, possibly while making Utf8 final.

            case STRING:
          Utf8 u1 = Utf8.class.equals(o1.getClass()) ? (Utf8)o1 : new Utf8(o1.toString());
          Utf8 u2 = Utf8.class.equals(o2.getClass()) ? (Utf8)o2 : new Utf8(o2.toString());
          return u1.compareTo(u2);
        
        Show
        Scott Carey added a comment - If you are concerned about performance, then consider changing: case STRING: Utf8 u1 = o1 instanceof Utf8 ? (Utf8)o1 : new Utf8(o1.toString()); Utf8 u2 = o2 instanceof Utf8 ? (Utf8)o2 : new Utf8(o2.toString()); return u1.compareTo(u2); to use getClass().equals() rather than instanceof since instanceof is slower (though not dramatically). This has limitations for anyone who extends Utf8, but it might be best to mark it final anyway. Even if it is not final, we can't rely on a subclass implementing .equals() properly, it might be overridden and break equals() symmetry. So I think Utf8.class.equals() or Utf8.class == are the right things to use, possibly while making Utf8 final. case STRING: Utf8 u1 = Utf8.class.equals(o1.getClass()) ? (Utf8)o1 : new Utf8(o1.toString()); Utf8 u2 = Utf8.class.equals(o2.getClass()) ? (Utf8)o2 : new Utf8(o2.toString()); return u1.compareTo(u2);
        Hide
        Doug Cutting added a comment -

        I've never found instanceof to be particularly slow, especially compared to new String() or new Utf8(). I'm fairly confident that allocating and copying a new string instance for every key sorted in a MapReduce will slow things significantly. I'm considerably less confident that replacing instanceof with .getClass()== for each key would speed things significantly. We could make Utf8 final to permit that, but then we'd have to remember why it's final when someone asks if they can subclass it. So I'd prefer this patch as-is.

        Show
        Doug Cutting added a comment - I've never found instanceof to be particularly slow, especially compared to new String() or new Utf8(). I'm fairly confident that allocating and copying a new string instance for every key sorted in a MapReduce will slow things significantly. I'm considerably less confident that replacing instanceof with .getClass()== for each key would speed things significantly. We could make Utf8 final to permit that, but then we'd have to remember why it's final when someone asks if they can subclass it. So I'd prefer this patch as-is.
        Hide
        Scott Carey added a comment -

        +1 on the current patch.

        Yes, instanceof is orders of magnitude off of object construction, and several times faster than the compareTo of two small strings.

        If we are trying to optimize comparing GenericRecords for Hadoop there are probably several other more significant things to change if we spent time profiling or experimenting.

        The benefit of marking Utf8 final is that we can guarantee that it uses Utf8 ordering and no subclasses do improper things. That may not be a big concern however.

        Show
        Scott Carey added a comment - +1 on the current patch. Yes, instanceof is orders of magnitude off of object construction, and several times faster than the compareTo of two small strings. If we are trying to optimize comparing GenericRecords for Hadoop there are probably several other more significant things to change if we spent time profiling or experimenting. The benefit of marking Utf8 final is that we can guarantee that it uses Utf8 ordering and no subclasses do improper things. That may not be a big concern however.
        Hide
        Scott Carey added a comment -

        I have committed Doug's latest patch variation.

        Show
        Scott Carey added a comment - I have committed Doug's latest patch variation.

          People

          • Assignee:
            Scott Carey
            Reporter:
            Scott Carey
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development