Uploaded image for project: 'Apache Avro'
  1. Apache Avro
  2. AVRO-3029

Specification is a little ambiguous about where enum defaults should be defined which might be causing library differences

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 1.10.1
    • None
    • java, python, ruby
    • None

    Description

      In the specification, an enum type can have a `default` attribute. At the same time, each field in a record can have a default. On top of that, the chart of example default values for fields includes enum in the example.

      So, if I want to define a record with a enum field, where would I put the default? Do I define it like this:

      {
          "type": "record",
          "name": "test",
          "fields": [
              {
                  "name": "enum",
                  "type": {
                      "type": "enum",
                      "name": "enum_field",
                      "symbols": ["FOO", "BAR"],
                  },
                  "default": "FOO",
              },
          ],
      }
      

      Or like this:

      {
          "type": "record",
          "name": "test",
          "fields": [
              {
                  "name": "enum",
                  "type": {
                      "type": "enum",
                      "name": "enum_field",
                      "symbols": ["FOO", "BAR"],
                      "default": "FOO",
                  },
              },
          ],
      }
      

      I was confused, so I started looking for examples, but it seems like I'm not the only one confused about this because this stackoverflow and https://issues.apache.org/jira/browse/AVRO-2518 put the default at the field level whereas https://issues.apache.org/jira/browse/AVRO-2879 puts the default at the enum level.

      So then I started looking at examples in the codebase. It seems like there's a ruby test case and java test case that put the default at the enum level.

      Okay, solved, right? Since the test cases have the default at the enum level, that's where it should be... but then I tried to create a simple python script (since I'm a python user) to double check this, and it seems like the python library disagrees. Here's the example script that uses the default at the enum level:

      import json
      from io import BytesIO
      import avro.schema
      from avro.datafile import DataFileReader, DataFileWriter
      from avro.io import DatumReader, DatumWriter
      
      writer_schema = {
          "type": "record",
          "name": "test",
          "fields": [
              {
                  "name": "foo",
                  "type": "string"
              }
          ],
      }
      
      reader_schema = {
          "type": "record",
          "name": "test",
          "fields": [
              {
                  "name": "foo",
                  "type": "string"
              },
              {
                  "name": "enum",
                  "type": {
                      "type": "enum",
                      "name": "enum_field",
                      "symbols": ["FOO", "BAR"],
                      "default": "FOO",
                  },
              },
          ],
      }
      
      w_schema = avro.schema.parse(json.dumps(writer_schema))
      r_schema = avro.schema.parse(json.dumps(reader_schema))
      
      bio = BytesIO()
      
      writer = DataFileWriter(bio, DatumWriter(), w_schema)
      writer.append({"foo": "bar"})
      writer.flush()
      
      bio.seek(0)
      
      reader = DataFileReader(bio, DatumReader(w_schema, r_schema))
      for record in reader:
          print(record)
      

      But when I run that, I get an exception:

      avro.io.SchemaResolutionException: No default value for field enum
      Writer's Schema: {
        "type": "record",
        "name": "test",
        "fields": [
          {
            "type": "string",
            "name": "foo"
          }
        ]
      }
      Reader's Schema: {
        "type": "record",
        "name": "test",
        "fields": [
          {
            "type": "string",
            "name": "foo"
          },
          {
            "type": {
              "type": "enum",
              "default": "FOO",
              "name": "enum_field",
              "symbols": [
                "FOO",
                "BAR"
              ]
            },
            "name": "enum"
          }
        ]
      }
      

      And if I change the script to use a reader_schema that has the default on the field level like this:

      reader_schema = {
          "type": "record",
          "name": "test",
          "fields": [
              {
                  "name": "foo",
                  "type": "string"
              },
              {
                  "name": "enum",
                  "type": {
                      "type": "enum",
                      "name": "enum_field",
                      "symbols": ["FOO", "BAR"],
                  },
                  "default": "FOO",
              },
          ],
      }
      

      Then it works and prints out the record with the default value for the enum:

      {'foo': 'bar', 'enum': 'FOO'}
      

      I don't have a Java environment set up to try to run the same type of script in Java to verify that implementation, but I would assume based on the test case that it works exactly the opposite and expects the default at the enum level.

      I think making the libraries consistent could cause massive breakages for whichever library doesn't currently conform to what the specification should be (which I'm honestly not sure based on how the spec is currently written). Therefore, I think it might be easiest to allow an enum's default to be defined at either the field level or the enum level. I maintain the `fastavro` library and the behavior there is the same as the avro python implementation and I would hate to have to force a massive breaking change like this on the users if the specification is updated to say that enum default values have to be defined at the enum level rather than the field level.

      Please let me know your thoughts and thank you for taking the time to read this lengthy message.

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            scottbelden Scott
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: