Avro
  1. Avro
  2. AVRO-600

add support for type and field name aliases

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4.0
    • Component/s: java, spec
    • Labels:
      None

      Description

      It would be good if Avro would permit one to still read data if a type or field name has been changed. I propose we add a notion of name aliases. Aliases could be listed for every named type and for record fields. The writers schema would be permitted to contain any of the aliases.

      In general, this permits one to construct schemas that can read different types into a single type. One could use this not just to handle renamings, but also to join different datasets. For example, if two datasets each contain differently named records with a date and an ip address field, this could be used be used to project these both to a single record with just those fields.

      1. AVRO-600.patch
        16 kB
        Doug Cutting
      2. AVRO-600.patch
        16 kB
        Doug Cutting
      3. AVRO-600.patch
        13 kB
        Doug Cutting

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          An example:

          Data written with:

          {"type": "record", "name": "org.x.Foo", "fields": [
              {"name": "a", "type": "int"},
              {"name": "b", "type": "int"}
            ]
          }
          

          Could be read with:

          {"type": "record", "name": "org.y.Bar", "fields": [
              {"name": "c", "type": "int", "aliases": ["a"]},
              {"name": "d", "type": "int", "default": 0}
          
            ],
           "aliases": ["org.x.Foo"]
          }
          

          It would be an error for a type alias to name an already-defined type or for a field alias to name an already-defined field.

          The semantics would be equivalent to rewriting the writer's schema, replacing matching aliased types and fields with their names in the reader's schema. In the above example, the writer's schema would be rewritten as:

          {"type": "record", "name": "org.y.Bar", "fields": [
              {"name": "c", "type": "int"},
              {"name": "b", "type": "int"}
            ]
          }
          

          When instances are read, values for "a" would be read into the "c" field, values for "b" would be dropped, and "d" would have the default value of zero.

          Show
          Doug Cutting added a comment - An example: Data written with: { "type" : "record" , "name" : "org.x.Foo" , "fields" : [ { "name" : "a" , "type" : " int " }, { "name" : "b" , "type" : " int " } ] } Could be read with: { "type" : "record" , "name" : "org.y.Bar" , "fields" : [ { "name" : "c" , "type" : " int " , "aliases" : [ "a" ]}, { "name" : "d" , "type" : " int " , " default " : 0} ], "aliases" : [ "org.x.Foo" ] } It would be an error for a type alias to name an already-defined type or for a field alias to name an already-defined field. The semantics would be equivalent to rewriting the writer's schema, replacing matching aliased types and fields with their names in the reader's schema. In the above example, the writer's schema would be rewritten as: { "type" : "record" , "name" : "org.y.Bar" , "fields" : [ { "name" : "c" , "type" : " int " }, { "name" : "b" , "type" : " int " } ] } When instances are read, values for "a" would be read into the "c" field, values for "b" would be dropped, and "d" would have the default value of zero.
          Hide
          Philip Zeyliger added a comment -

          This seems like it adds quite a bit of complexity to the base Avro system. Could this be layered on top? Perhaps, a separate way to indicate a transformation from one schema to another, which could then be used at read-time?

          – Philip

          Show
          Philip Zeyliger added a comment - This seems like it adds quite a bit of complexity to the base Avro system. Could this be layered on top? Perhaps, a separate way to indicate a transformation from one schema to another, which could then be used at read-time? – Philip
          Hide
          Doug Cutting added a comment -

          > This seems like it adds quite a bit of complexity to the base Avro system.

          I think this should be easy to implement as a single-pass re-write of the writer's schema, rewriting any names that are aliases in the reader's schema. In Java, this will be a single recursive method, plus a single call to this method in GenericDatumReader just before the ResolvingDecoder is created.

          Moreover this can be an optional feature. The schema stored with the data always fully and accurately describes the data. Applications build using implementations without this feature would have to manually correlate data which has different names, as they do today.

          Consider an alternate, functionally-equivalent, implementation that puts such aliases in a separate data structure that's passed to the reader, i.e., an aliasing feature of that particular reader implementation. Such a feature would be useful, and would be completely consistent with the Avro specification. The only difference between that and the proposal here is that the aliases are made available via the schema to every implementation in a standard form should they choose to implement this feature.

          Show
          Doug Cutting added a comment - > This seems like it adds quite a bit of complexity to the base Avro system. I think this should be easy to implement as a single-pass re-write of the writer's schema, rewriting any names that are aliases in the reader's schema. In Java, this will be a single recursive method, plus a single call to this method in GenericDatumReader just before the ResolvingDecoder is created. Moreover this can be an optional feature. The schema stored with the data always fully and accurately describes the data. Applications build using implementations without this feature would have to manually correlate data which has different names, as they do today. Consider an alternate, functionally-equivalent, implementation that puts such aliases in a separate data structure that's passed to the reader, i.e., an aliasing feature of that particular reader implementation. Such a feature would be useful, and would be completely consistent with the Avro specification. The only difference between that and the proposal here is that the aliases are made available via the schema to every implementation in a standard form should they choose to implement this feature.
          Hide
          Doug Cutting added a comment -

          Here's a patch for this, with tests.

          Show
          Doug Cutting added a comment - Here's a patch for this, with tests.
          Hide
          Doug Cutting added a comment -

          Updated patch with documentation added to spec.

          Show
          Doug Cutting added a comment - Updated patch with documentation added to spec.
          Hide
          Doug Cutting added a comment -

          I think this is ready to commit.

          Show
          Doug Cutting added a comment - I think this is ready to commit.
          Hide
          Doug Cutting added a comment -

          Here's a new version of the patch, updated for conflicts with AVRO-557.

          Note that now the re-written, aliased schema is cached in the resolving decoder. Performance of the Perf benchmarks is not measurably different.

          Show
          Doug Cutting added a comment - Here's a new version of the patch, updated for conflicts with AVRO-557 . Note that now the re-written, aliased schema is cached in the resolving decoder. Performance of the Perf benchmarks is not measurably different.
          Hide
          Doug Cutting added a comment -

          I will commit this today unless someone objects.

          Show
          Doug Cutting added a comment - I will commit this today unless someone objects.
          Hide
          Doug Cutting added a comment -

          I committed this.

          Show
          Doug Cutting added a comment - I committed this.

            People

            • Assignee:
              Doug Cutting
              Reporter:
              Doug Cutting
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development