Uploaded image for project: 'Apache Avro'
  1. Apache Avro
  2. AVRO-3313

enum default value to allow deserializer to deserialize to when encountering new enum symbols doesn't work

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.9.0, 1.10.0, 1.9.1, 1.9.2, 1.11.0, 1.10.1, 1.10.2
    • None
    • java

    Description

      I wanted to use the avro enums and evolve my schema over time by adding the values.

      From the doc it says : 

      default: A default value for this enumeration, used during resolution when the reader encounters a symbol from the writer that isn't defined in the reader's schema (optional). The value provided here must be a JSON string that's a member of the symbols array. See documentation on schema resolution for how this gets used. 

       

      And the section of the documentation about schema resolution says : 

      https://avro.apache.org/docs/current/spec.html#Schema+Resolution

      if both are enums:
      if the writer's symbol is not present in the reader's enum and the reader has a default value, then that value is used, otherwise an error is signalled. 

      This feature has been introduced in avro 1.9.0 with this issue : 

      https://avro.apache.org/docs/current/spec.html#Enums

       

      However I have found that it doesn't work at all like the specification says.

      Here is an example.

       

      If I have a schema used for writing in version 1.

      It has two symbols (A and B) and specify to default to symbol A.

      {
          "type": "record",
          "name": "RecordA",
          "fields":
          [
              {
                  "name": "fieldA",
                  "type":
                  {
                      "type": "enum",
                      "name": "Enum1",
                      "symbols":
                      [
                          "A",
                          "B"
                      ]
                  },
                  "default": "A"
              }
          ]
      } 

      Later when the schema needs a evolution on the writer, we add a new symbol (C) and publish a new schema in version 2.

      And the default value is still A.

      {
          "type": "record",
          "name": "RecordA",
          "fields":
          [
              {
                  "name": "fieldA",
                  "type":
                  {
                      "type": "enum",
                      "name": "Enum1",
                      "symbols":
                      [
                          "A",
                          "B",
                          "C"
                      ]
                  },
                  "default": "A"
              }
          ]
      } 

      According to the documentation on the reader side with the old schema in version 1, we should be able to deserialize a payload containing an enum value of C that was generated by the writer side with the schema in version 2. Since the value C is unknown by the reader it should be deserialized as A.

      Again, the doc says : 

      A default value for this enumeration, used during resolution when the reader encounters a symbol from the writer that isn't defined in the reader's schema 

      The issue here is either the documentation is wrong or the avro deserialization code is wrong. Since this was an intended feature I assume that this is a bug and the code is wrong.

       

      I have forked the repository and created a test to demonstrate the issue : 

      https://github.com/idkw/avro/commit/7d36203c137aa6a728d5b85b87969a3f743b45ee

      The test should verify that the reader side using the old schema should deserialize the value A when receiving a value C. However it fails with the exception `org.apache.avro.AvroTypeException: No match for C`

      @Test  public void enumRecordWithExtendedSchemaCanBeReadIfNewValuesAreUsedUsingDefault() throws Exception {    
        Schema readerSchemaV1 = ENUM_AB_RECORD_DEFAULT_A;    
        Schema writerSchemaV2 = ENUM_ABC_RECORD_DEFAULT_A;    
        Record record = defaultRecordWithSchema(
          writerSchemaV2, 
          FIELD_A, 
          new EnumSymbol(writerSchemaV2, "C")
        );    
        byte[] encoded = encodeGenericBlob(record);    
        Record decodedRecord = decodeGenericBlob(
          readerSchemaV1, 
          writerSchemaV2, 
          encoded
          );    
        Assert.assertEquals("A", decodedRecord.get(FIELD_A).toString());  
      } 

       

      It should not fail but deserialize to "A".

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              idkw0 Valentin
              Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: