Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Not A Bug
-
1.9.0, 1.10.0, 1.9.1, 1.9.2, 1.11.0, 1.10.1, 1.10.2
-
None
Description
I wanted to use the avro enums and evolve my schema over time by adding the values.
From the doc it says :
default: A default value for this enumeration, used during resolution when the reader encounters a symbol from the writer that isn't defined in the reader's schema (optional). The value provided here must be a JSON string that's a member of the symbols array. See documentation on schema resolution for how this gets used.
And the section of the documentation about schema resolution says :
https://avro.apache.org/docs/current/spec.html#Schema+Resolution
if both are enums: if the writer's symbol is not present in the reader's enum and the reader has a default value, then that value is used, otherwise an error is signalled.
This feature has been introduced in avro 1.9.0 with this issue :
https://avro.apache.org/docs/current/spec.html#Enums
However I have found that it doesn't work at all like the specification says.
Here is an example.
If I have a schema used for writing in version 1.
It has two symbols (A and B) and specify to default to symbol A.
{ "type": "record", "name": "RecordA", "fields": [ { "name": "fieldA", "type": { "type": "enum", "name": "Enum1", "symbols": [ "A", "B" ] }, "default": "A" } ] }
Later when the schema needs a evolution on the writer, we add a new symbol (C) and publish a new schema in version 2.
And the default value is still A.
{ "type": "record", "name": "RecordA", "fields": [ { "name": "fieldA", "type": { "type": "enum", "name": "Enum1", "symbols": [ "A", "B", "C" ] }, "default": "A" } ] }
According to the documentation on the reader side with the old schema in version 1, we should be able to deserialize a payload containing an enum value of C that was generated by the writer side with the schema in version 2. Since the value C is unknown by the reader it should be deserialized as A.
Again, the doc says :
A default value for this enumeration, used during resolution when the reader encounters a symbol from the writer that isn't defined in the reader's schema
The issue here is either the documentation is wrong or the avro deserialization code is wrong. Since this was an intended feature I assume that this is a bug and the code is wrong.
I have forked the repository and created a test to demonstrate the issue :
https://github.com/idkw/avro/commit/7d36203c137aa6a728d5b85b87969a3f743b45ee
The test should verify that the reader side using the old schema should deserialize the value A when receiving a value C. However it fails with the exception `org.apache.avro.AvroTypeException: No match for C`
@Test public void enumRecordWithExtendedSchemaCanBeReadIfNewValuesAreUsedUsingDefault() throws Exception { Schema readerSchemaV1 = ENUM_AB_RECORD_DEFAULT_A; Schema writerSchemaV2 = ENUM_ABC_RECORD_DEFAULT_A; Record record = defaultRecordWithSchema( writerSchemaV2, FIELD_A, new EnumSymbol(writerSchemaV2, "C") ); byte[] encoded = encodeGenericBlob(record); Record decodedRecord = decodeGenericBlob( readerSchemaV1, writerSchemaV2, encoded ); Assert.assertEquals("A", decodedRecord.get(FIELD_A).toString()); }
It should not fail but deserialize to "A".
Attachments
Issue Links
- Discovered while testing
-
AVRO-1340 use default to allow old readers to specify default enum value when encountering new enum symbols
- Resolved