Avro
  1. Avro
  2. AVRO-997

Union of enum and null cannot be serialized

    Details

    • Type: Bug Bug
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.5.1
    • Fix Version/s: 1.8.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change

      Description

      I have a schema like:

      [
      {
        "type": "enum",
        "name": "Gender",
        "symbols": ["M", "F"]
      },
      
      {
        "type" : "record",
        "name" : "Foo",
        "fields" : [
          { "type" : ["Gender", "null"], "name" : "gender" },
          ...
        ]
      }
      ]
      

      I build a record like Foo foo = new Foo(); foo.gender = Gender.M;

      When I go to serialize this, I get:

      Not in union [{"type":"enum","name":"Gender","symbols":["M","F"]},"null"]: M
      	at org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:482)
      	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:70)
      	at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:104)
      	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:65)
      	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:57)
      
      1. AVRO-997.permissive-generic-api.patch
        15 kB
        Sean Busbey
      2. AVRO-997.patch
        3 kB
        Doug Cutting
      3. AVRO-997.patch
        7 kB
        Sean Busbey

        Issue Links

          Activity

          Hide
          Matthew Campbell added a comment -

          We're running into this issue as well on Avro 1.7.6. We've got a IDL that imports some enum schemas in JSON format (provided by another tool) and we want to have some optional fields in a record of the enum type. These types can't be serialized, though. Just putting a +1 in for this being a blocker for us. We'll work around it in our model for now, but I wanted to leave a note here since the issue has a patch but hasn't been touched in a while.

          Show
          Matthew Campbell added a comment - We're running into this issue as well on Avro 1.7.6. We've got a IDL that imports some enum schemas in JSON format (provided by another tool) and we want to have some optional fields in a record of the enum type. These types can't be serialized, though. Just putting a +1 in for this being a blocker for us. We'll work around it in our model for now, but I wanted to leave a note here since the issue has a patch but hasn't been touched in a while.
          Hide
          Sean Busbey added a comment -

          The only reason I think enums are different is because I've seen people use them differently to date. Going with a consistent, stict enforcement would be fine; the only down side is pain to existing users. I only included the patch so we have a clearer view of the additional maintenance cost of maintaining compatibility.

          I'm more concerned that one of the patches get applied then I am with which one it is. I'm fine with fixing Hive downstream whichever way this goes.

          Show
          Sean Busbey added a comment - The only reason I think enums are different is because I've seen people use them differently to date. Going with a consistent, stict enforcement would be fine; the only down side is pain to existing users. I only included the patch so we have a clearer view of the additional maintenance cost of maintaining compatibility. I'm more concerned that one of the patches get applied then I am with which one it is. I'm fine with fixing Hive downstream whichever way this goes.
          Hide
          Doug Cutting added a comment -

          Sean, I prefer the simpler more consistent approach of my patch. All other schemas use distinct types in Java, why shouldn't enums?

          Show
          Doug Cutting added a comment - Sean, I prefer the simpler more consistent approach of my patch. All other schemas use distinct types in Java, why shouldn't enums?
          Hide
          Sean Busbey added a comment -

          Alternative non-breaking patch. Makes the generic api more permissive about use of things other than GenericEnumSymbol with enum schemas. See review board for details.

          Show
          Sean Busbey added a comment - Alternative non-breaking patch. Makes the generic api more permissive about use of things other than GenericEnumSymbol with enum schemas. See review board for details.
          Hide
          Sean Busbey added a comment -

          Oh no, I definitely think the current behavior is a bug. Specifically having validate pass but write fail is incorrect. I also think that my enum shouldn't stop working just because it's in a union, omitting ambiguities such as the string-enum union.

          Making everything consistent by checking to see if the toString is in the enum member of a union after GenericDate.resolveUnion has done the existing membership checks (~line 574) would solve the part of this that I consider a bug, while making things easier for existing users (like the Hive issue that brought me here). I'll post a patch tomorrow showing how this would work (and an alternate set of tests that would correspond to the failing prior to the fix).

          I think in any case there'll be a documentation component, either calling out the use of GenericEnumSymbol or warning users about the possibly surprising case of enum members ending up as avro strings in serialized data.

          Show
          Sean Busbey added a comment - Oh no, I definitely think the current behavior is a bug. Specifically having validate pass but write fail is incorrect. I also think that my enum shouldn't stop working just because it's in a union, omitting ambiguities such as the string-enum union. Making everything consistent by checking to see if the toString is in the enum member of a union after GenericDate.resolveUnion has done the existing membership checks (~line 574) would solve the part of this that I consider a bug, while making things easier for existing users (like the Hive issue that brought me here). I'll post a patch tomorrow showing how this would work (and an alternate set of tests that would correspond to the failing prior to the fix). I think in any case there'll be a documentation component, either calling out the use of GenericEnumSymbol or warning users about the possibly surprising case of enum members ending up as avro strings in serialized data.
          Hide
          Doug Cutting added a comment -

          Sean: To be clear, I won't be upset if we don't commit this. I don't feel very strongly one way or the other. I thought that you felt the current implementation was a bug. I wrote the patch to see how hard this would be to change, and, having it written it, posted it. Right now the implementation is lenient, only requiring GenericEnumSymbol when writing a union containing an enum, otherwise accepting anything whose toString() returns a matching symbol. That can be confusing, but it also simplifies many programs.

          Maybe this is just a documentation issue?

          Show
          Doug Cutting added a comment - Sean: To be clear, I won't be upset if we don't commit this. I don't feel very strongly one way or the other. I thought that you felt the current implementation was a bug. I wrote the patch to see how hard this would be to change, and, having it written it, posted it. Right now the implementation is lenient, only requiring GenericEnumSymbol when writing a union containing an enum, otherwise accepting anything whose toString() returns a matching symbol. That can be confusing, but it also simplifies many programs. Maybe this is just a documentation issue?
          Hide
          Sean Busbey added a comment -

          Scott, I could write a patch that maintains the current behavior for unions of "things that look like enum members when toString is called" and enums, which is to say the non-enum type gets picked if it is present. eg ["string", "long",

          {"type":"enum", "symbols":["ONE", "TWO", "42"]}

          ] says that "ONE" is a string and 42 is a long. Since it would be consistent with current behavior, a warning in the API about the coercion might be sufficient. It would be less efficient than picking the enum member, but doing that would be just as breaking as reworking the generic api to be strict.

          The only time I've seen this come up is when someone has a union field and then wants to mark it as optional so they move it to a union with null. While their original use of Java String for field values was incorrect, the generic api as is would have worked fine up until the addition of the union. In general I'm a fan of "just works" behavior from APIs, but I can understand Doug's desire to favor consistent strict application as a way of steering users away from behavior that will lead to confusing non-roundtrip-able edge cases.

          Show
          Sean Busbey added a comment - Scott, I could write a patch that maintains the current behavior for unions of "things that look like enum members when toString is called" and enums, which is to say the non-enum type gets picked if it is present. eg ["string", "long", {"type":"enum", "symbols":["ONE", "TWO", "42"]} ] says that "ONE" is a string and 42 is a long. Since it would be consistent with current behavior, a warning in the API about the coercion might be sufficient. It would be less efficient than picking the enum member, but doing that would be just as breaking as reworking the generic api to be strict. The only time I've seen this come up is when someone has a union field and then wants to mark it as optional so they move it to a union with null. While their original use of Java String for field values was incorrect, the generic api as is would have worked fine up until the addition of the union. In general I'm a fan of "just works" behavior from APIs, but I can understand Doug's desire to favor consistent strict application as a way of steering users away from behavior that will lead to confusing non-roundtrip-able edge cases.
          Hide
          Doug Cutting added a comment -

          Representations of the different types must be distinguishable at runtime to correctly write unions. If you have a union like ["string",

          {"type":"enum", ...}

          ] and an enum symbol could be represented as a string then you wouldn't know which branch of the union was intended. (The implementation of union writing uses a hash-table keyed by type name, so types are not tried in order.)

          Currently generic accepts strings when writing enums outside of unions, but when writing a union an enum must be represented as a GenericEnumSymbol. Enum values are always read by generic as GenericEnumSymbol. The patch I attached changes generic to be consistent, so that strings are not acceptable as enum symbols anywhere, a GenericEnumSymbol must always be used. With specific & reflect this is extended to include Java Enum types.

          Show
          Doug Cutting added a comment - Representations of the different types must be distinguishable at runtime to correctly write unions. If you have a union like ["string", {"type":"enum", ...} ] and an enum symbol could be represented as a string then you wouldn't know which branch of the union was intended. (The implementation of union writing uses a hash-table keyed by type name, so types are not tried in order.) Currently generic accepts strings when writing enums outside of unions, but when writing a union an enum must be represented as a GenericEnumSymbol. Enum values are always read by generic as GenericEnumSymbol. The patch I attached changes generic to be consistent, so that strings are not acceptable as enum symbols anywhere, a GenericEnumSymbol must always be used. With specific & reflect this is extended to include Java Enum types.
          Hide
          Scott Carey added a comment -

          Without looking at this in depth:

          Why is it not a good idea to let the Generic API use a String or Java Enum? It would not break compatibility. What is the downside of expanding what GenericDatumWriter can handle for Enums?

          Show
          Scott Carey added a comment - Without looking at this in depth: Why is it not a good idea to let the Generic API use a String or Java Enum? It would not break compatibility. What is the downside of expanding what GenericDatumWriter can handle for Enums?
          Hide
          Doug Cutting added a comment -

          Thanks, Sean. The tests look great.

          I'll commit this soon unless there are objections.

          I think the minor incompatible runtime change here should cause our next release to be called 1.8.0 rather than 1.7.3. But I'll delay changing the Fix-for on other 1.7.3 issues for now though in case we decide to do a compatible 1.7.3 bugfix release.

          Show
          Doug Cutting added a comment - Thanks, Sean. The tests look great. I'll commit this soon unless there are objections. I think the minor incompatible runtime change here should cause our next release to be called 1.8.0 rather than 1.7.3. But I'll delay changing the Fix-for on other 1.7.3 issues for now though in case we decide to do a compatible 1.7.3 bugfix release.
          Hide
          Sean Busbey added a comment -

          Updated patch with tests that fail if GenericData.validate or GenericDatumWriter.write allow the use of String/Java Enum with an Avro enum schema.

          Also adds a constructor to GenericData.EnumSymbol to ease proper use of the generic api.

          Show
          Sean Busbey added a comment - Updated patch with tests that fail if GenericData.validate or GenericDatumWriter.write allow the use of String/Java Enum with an Avro enum schema. Also adds a constructor to GenericData.EnumSymbol to ease proper use of the generic api.
          Hide
          Doug Cutting added a comment -

          > I can add some tests that fail suitably.

          Great! Thanks!

          > would it be fine if I added a constructor to GenericData.EnumSymbol that allowed for Java Enums for the passed in symbol?

          I don't see a problem with that. We might have a constructor that accepts Object and calls toString()?

          Show
          Doug Cutting added a comment - > I can add some tests that fail suitably. Great! Thanks! > would it be fine if I added a constructor to GenericData.EnumSymbol that allowed for Java Enums for the passed in symbol? I don't see a problem with that. We might have a constructor that accepts Object and calls toString()?
          Hide
          Sean Busbey added a comment -

          I can add some tests that fail suitably. To ease fixing existing users of the generic api, would it be fine if I added a constructor to GenericData.EnumSymbol that allowed for Java Enums for the passed in symbol?

          Show
          Sean Busbey added a comment - I can add some tests that fail suitably. To ease fixing existing users of the generic api, would it be fine if I added a constructor to GenericData.EnumSymbol that allowed for Java Enums for the passed in symbol?
          Hide
          Doug Cutting added a comment -

          I think GenericData#validate() and GenericDatumWriter#writeEnum() could both be stricter, failing when isEnum(datum) is false. That would be an incompatible change though, possibly breaking applications.

          Here's the patch I used to try this. Tests pass. We should probably add some tests that fail without this change too before committing something like this.

          Show
          Doug Cutting added a comment - I think GenericData#validate() and GenericDatumWriter#writeEnum() could both be stricter, failing when isEnum(datum) is false. That would be an incompatible change though, possibly breaking applications. Here's the patch I used to try this. Tests pass. We should probably add some tests that fail without this change too before committing something like this.
          Hide
          Sean Busbey added a comment -

          I just ran into this as part of Hive's integration with Avro 1.7.1 (HIVE-3538). AFAICT, Hive only makes use of the Generic API.

          GenericData.validate returns true for a record with an Avro enum or a union that contains an Avro enum so long as the result of datum.toString is in the set of elements. Actually serializing works fine for arbitrary incoming values when the field is an Avro enum, but fails in the union case if the value isn't GenericEnumSymbol. In addition to Java enums, this seems likely to come up for GenericData users when they attempt to use Strings.

          This seems like a bug for GenericData. I could provide a patch that makes GenericDatumWriter.write consistent with GenericData.validate for the union case, if there's interest. Alternatively, I could provide one that causes the validate and write calls to be stricter wrt to the plain enum case, which I think would help avoid user confusion if enums are only supposed to be GenericEnumSybmol.

          Thoughts?

          Show
          Sean Busbey added a comment - I just ran into this as part of Hive's integration with Avro 1.7.1 ( HIVE-3538 ). AFAICT, Hive only makes use of the Generic API. GenericData.validate returns true for a record with an Avro enum or a union that contains an Avro enum so long as the result of datum.toString is in the set of elements. Actually serializing works fine for arbitrary incoming values when the field is an Avro enum, but fails in the union case if the value isn't GenericEnumSymbol. In addition to Java enums, this seems likely to come up for GenericData users when they attempt to use Strings. This seems like a bug for GenericData. I could provide a patch that makes GenericDatumWriter.write consistent with GenericData.validate for the union case, if there's interest. Alternatively, I could provide one that causes the validate and write calls to be stricter wrt to the plain enum case, which I think would help avoid user confusion if enums are only supposed to be GenericEnumSybmol. Thoughts?
          Hide
          Doug Cutting added a comment -

          I think in current trunk you'd get "Unknown datum type: M" which is somewhat more informative than the 1.5.1 message. Maybe this could be improved to something like, "Unknown datum for GenericData: M" to hint that specific, reflect, thrift, protobuf or some other data model might be better?

          Show
          Doug Cutting added a comment - I think in current trunk you'd get "Unknown datum type: M" which is somewhat more informative than the 1.5.1 message. Maybe this could be improved to something like, "Unknown datum for GenericData: M" to hint that specific, reflect, thrift, protobuf or some other data model might be better?
          Hide
          Aaron Kimball added a comment -

          Thanks. In that case, this might be resolved as Invalid, unless you think it merits an exception, documentation fix, etc. as I suggest above.

          Show
          Aaron Kimball added a comment - Thanks. In that case, this might be resolved as Invalid, unless you think it merits an exception, documentation fix, etc. as I suggest above.
          Hide
          Doug Cutting added a comment -

          SpecificDatumWriter can write both generic and specific instances, while GenericDatumWriter is only intended to correctly write generic instances.

          Show
          Doug Cutting added a comment - SpecificDatumWriter can write both generic and specific instances, while GenericDatumWriter is only intended to correctly write generic instances.
          Hide
          Aaron Kimball added a comment -

          Doug, I'm using Avro 1.5.1 as noted in the affectsVersion field of this ticket.

          Scott, you're correct: our system uses a single codepath to serialize all values passed in from the client/data layer to the underlying storage. This codepath accepts generic and specific values, and we currently always allocate a GenericDatumWriter for each schema. At least until this particular schema, this has always worked.

          Is this an illegal use of GenericDatumWriter? If so, maybe its write() method could warn (or more strongly, throw an exception) if an instance of SpecificRecord is passed as an argument?

          Show
          Aaron Kimball added a comment - Doug, I'm using Avro 1.5.1 as noted in the affectsVersion field of this ticket. Scott, you're correct: our system uses a single codepath to serialize all values passed in from the client/data layer to the underlying storage. This codepath accepts generic and specific values, and we currently always allocate a GenericDatumWriter for each schema. At least until this particular schema, this has always worked. Is this an illegal use of GenericDatumWriter? If so, maybe its write() method could warn (or more strongly, throw an exception) if an instance of SpecificRecord is passed as an argument?
          Hide
          Scott Carey added a comment -

          Ah, I missed that test.

          In that case, a GenericDatumWriter may be being used to serialize a SpecificRecord, which won't work. SpecificRecord however is an instance of IndexedRecord, so the API won't complain and it will work for 'most' operations.

          Show
          Scott Carey added a comment - Ah, I missed that test. In that case, a GenericDatumWriter may be being used to serialize a SpecificRecord, which won't work. SpecificRecord however is an instance of IndexedRecord, so the API won't complain and it will work for 'most' operations.
          Hide
          Doug Cutting added a comment -

          SpecificData#isEnum() overrides GenericData#isEnum() and returns true for instances of Enum. So this should work. Also, TestSpecificDatumWriter#testResolveUnion writes and reads an instance of a record with a field whose type is a union of "null" and an enum using generated, specific code. So this appears to be tested.

          Aaron, can you provide a simple, complete test that fails? Also, what version of Avro are you using?

          Show
          Doug Cutting added a comment - SpecificData#isEnum() overrides GenericData#isEnum() and returns true for instances of Enum. So this should work. Also, TestSpecificDatumWriter#testResolveUnion writes and reads an instance of a record with a field whose type is a union of "null" and an enum using generated, specific code. So this appears to be tested. Aaron, can you provide a simple, complete test that fails? Also, what version of Avro are you using?
          Hide
          Aaron Kimball added a comment -

          For now I'm working around this by adding a third Unspecified symbol to my enum, and removing the union with null entirely. Thanks for taking a look.

          Show
          Aaron Kimball added a comment - For now I'm working around this by adding a third Unspecified symbol to my enum, and removing the union with null entirely. Thanks for taking a look.
          Hide
          Scott Carey added a comment -

          As a work-around, you may be able to wrap the Enum inside another record.

          [
          {
            "type": "enum",
            "name": "Gender",
            "symbols": ["M", "F"]
          },
          
          { "type": "record",
            "name": "GenderWrapper",
            "fields": [
            {"name": "gender", "type":"Gender"}
            ]
          },
          
          {
            "type" : "record",
            "name" : "Foo",
            "fields" : [
              { "type" : ["GenderWrapper", "null"], "name" : "gender" },
              ...
            ]
          }
          ]
          
          Show
          Scott Carey added a comment - As a work-around, you may be able to wrap the Enum inside another record. [ { "type" : " enum " , "name" : "Gender" , "symbols" : [ "M" , "F" ] }, { "type" : "record" , "name" : "GenderWrapper" , "fields" : [ { "name" : "gender" , "type" : "Gender" } ] }, { "type" : "record" , "name" : "Foo" , "fields" : [ { "type" : [ "GenderWrapper" , " null " ], "name" : "gender" }, ... ] } ]
          Hide
          Scott Carey added a comment -

          If that is the case, then this is a bug. GenericData.resolveUnion() only operates on 'instanceof GenericEnumSymbol' for enums.

          This means that an Enum in a union of any kind cannot be serialized in a SpecificRecord. This is either a bug where SpecificData should override GenericData's behavior and handle the Java enum, or GenericData should handle java Enums as well.

          It also means we need better test coverage on SpecificRecords.

          Show
          Scott Carey added a comment - If that is the case, then this is a bug. GenericData.resolveUnion() only operates on 'instanceof GenericEnumSymbol' for enums. This means that an Enum in a union of any kind cannot be serialized in a SpecificRecord. This is either a bug where SpecificData should override GenericData's behavior and handle the Java enum, or GenericData should handle java Enums as well. It also means we need better test coverage on SpecificRecords.
          Hide
          Aaron Kimball added a comment -

          Foo is a SpecificRecord. The Gender type generated by Avro is a Java enum; is this a bug in the Avro code generator, or am I just misusing it?

          Show
          Aaron Kimball added a comment - Foo is a SpecificRecord. The Gender type generated by Avro is a Java enum ; is this a bug in the Avro code generator, or am I just misusing it?
          Hide
          Scott Carey added a comment -

          The following works in Avro 1.6.1, not sure about Avro 1.5.1 (it would need to be modified since it uses Schema.Parser)
          This does not test an object created by the Specific Record API.

            @Test
            public void testWriteNullableEnum() throws IOException {
              Schema.Parser p = new Schema.Parser();
              Schema genderSchema = p.parse(
                  "{"
                  + "\"type\": \"enum\","
                  + "\"name\": \"Gender\","
                  + "\"symbols\": [\"M\", \"F\"]"
                  +"}");
              Schema fooSchema = p.parse(
                  "{"
                  + "\"type\" : \"record\","
                  + "\"name\" : \"Foo\","
                  + "\"fields\" : ["
                  + "{\"type\" : [\"Gender\", \"null\"], \"name\" : \"gender\" }"
                  +"]}");
              
              GenericData.Record foo = new GenericData.Record(fooSchema);
              
              foo.put(0, new GenericData.EnumSymbol(genderSchema, "M"));
              
              GenericDatumWriter<GenericData.Record> w = new GenericDatumWriter<GenericData.Record>(fooSchema);
              
              ByteArrayOutputStream baos = new ByteArrayOutputStream();
              Encoder e = EncoderFactory.get().binaryEncoder(baos, null);
              
              w.write(foo, e);
              
            }
          

          Note that with a SpecificRecord, the enum must be a GenericData.EnumSymbol, it does not currently support a Java enum.

          Show
          Scott Carey added a comment - The following works in Avro 1.6.1, not sure about Avro 1.5.1 (it would need to be modified since it uses Schema.Parser) This does not test an object created by the Specific Record API. @Test public void testWriteNullableEnum() throws IOException { Schema.Parser p = new Schema.Parser(); Schema genderSchema = p.parse( "{" + "\" type\ ": \" enum \ "," + "\" name\ ": \" Gender\ "," + "\" symbols\ ": [\" M\ ", \" F\ "]" + "}" ); Schema fooSchema = p.parse( "{" + "\" type\ " : \" record\ "," + "\" name\ " : \" Foo\ "," + "\" fields\ " : [" + "{\" type\ " : [\" Gender\ ", \" null \ "], \" name\ " : \" gender\ " }" + "]}" ); GenericData.Record foo = new GenericData.Record(fooSchema); foo.put(0, new GenericData.EnumSymbol(genderSchema, "M" )); GenericDatumWriter<GenericData.Record> w = new GenericDatumWriter<GenericData.Record>(fooSchema); ByteArrayOutputStream baos = new ByteArrayOutputStream(); Encoder e = EncoderFactory.get().binaryEncoder(baos, null ); w.write(foo, e); } Note that with a SpecificRecord, the enum must be a GenericData.EnumSymbol, it does not currently support a Java enum.

            People

            • Assignee:
              Sean Busbey
              Reporter:
              Aaron Kimball
            • Votes:
              3 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:

                Development