Uploaded image for project: 'Apache Avro'
  1. Apache Avro
  2. AVRO-2159

Naming Limitations of Schemas in Stricter Reference Contexts

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: spec
    • Labels:
      None

      Description

      (Excuse the lengthiness of this ticket description - it was initially written as an email that became too long. Feel free to correct any misguided reasoning.)

      I've come to realize that there are some undesirable constraints on how avro schemas can be used in Java code generation and IDL, that only appear as minor annoyances when you use schemas generically. In particular, I'm focused on cases where it's desirable to use two schemas that have the same name in some context.
       
      Issue:
      Suppose I'm writing an application that publishes a many different kinds of data somewhere, with each type of data having its own schema. And then suppose that a some number of those schemas would like to share some kind of common schema, to start with.
       
      If I do this, and I happen to be using Java code generation to manage schemas, I'll soon find difficulty in two directions:
       

      • I would find it difficult to upgrade the data shared among all of these external schemas by way of the common schema, without upgrading all of those schemas at the same time. The problem here being that neither Java's classpath nor an IDL protocol can support the way avro's name field maps as a class name onto the classpath or a reference name onto a protocol's symbols.
         
        The intermediate step of the application being partially migrated between version 1 and version 2 of a common schema has no representation in either of these contexts. Using a different name becomes a very annoying option in many cases, since it is an incompatible change (or with aliases, it's at least not consistently compatible across implementations).
      • I would find it difficult to migrate away from the external schemas using that shared schema, for the same reasons listed above.

      In IDL (without code generation), these issues can usually be avoided by creating a second protocol, and in generic avro, the issues would be avoided by using a different schema parser or schema builder.
       
      Analysis:
      At first glance, it is tempting to blame the name-matching requirement for schema resolution as a culprit - and it may be correct in many cases that requiring schemas have compatible structure is all that is needed.
       
      However, the way I see it is that the name-matching requirement for schema resolution is there to ensure that there is the intent for two schemas to resolve with each other, and the rest of the checks are just there to make sure that such an intent can be reasonably carried out.
       
      The difficulty from either the two examples above happens not because of a lack of pre-determined intent for schemas to resolve, but rather the inability to simultaneously supply a unique reference for each of the schemas, while intending that the correct groups of schemas can resolve.
       
      Thus, the way to avoid these issues so far has been to create a new reference context, and the severity of the issue in each case corresponds to the difficulty of creating a new reference context:

      • For generic schemas, create a new parser or schema builder [easy - minorly annoying]
      • For IDL, create a new protocol [minorly annoying - somewhat annoying]
      • For Java code generation, create a new classpath [very annoying (Java 9) - impossible]

      Based on that, I understand a schema's name as expressing two overlapping meanings:

      • the intent to be able to resolve with other schemas with the same name (let's call this the resolveName)
      • the ability to be uniquely referenced from some context (let's call this the referenceName)

       
      If these two meanings were able to be specified independently, I think that schemas would be much easier to use in contexts where references are more limited.
       
      Speculative Solutions:
      Minimally, I think it's reasonable to create at least one new field to separate the meaning of a schema's referenceName from its resolveName, and use the old name field to compatibly handle missing values. Then other tools that don't immediately apply schema resolution, can optionally upgrade to support using the referenceName instead of the resolveName.
       
      Beyond that, having name continue to mean resolveName would mean that old avro implementations would be able to treat newer schemas as valid and resolve against them correctly. So I think it's reasonable to say that referenceName should be the new field introduced (not necessarily with that name).
       
      Assuming I've made no mistakes up until this point, there are some remaining questions:
       
      1. How should this appear in IDL? There are two solutions that come to mind, using the existing intuition of how annotations work:
       
        a. Declared as: @ref("UserV2") record User {}
            Used as: @ref("UserV2") User user;
       
            - Can be quite verbose
            - The annotated type (User) only really exists as a placeholder
            - Requires a new error message for when a type is used but needs a reference name.

        b. Declared as: @name("User") record UserV2 {}
            Used as: UserV2 user;
       
            - Relies on a weird intuition of how a "type name" maps to a raw schema. Normally, the type name becomes the schema's name field, but when the name field is specified by annotation, the type name just becomes the referenceName field.
            - Requires more work to change the implementation of type annotations, since name is normally a reserved field.
            - Matches the current intuition that two schemas resolve only if their name fields match, and that type names are always used for references.
       
      2. How should the namespace field be handled?

      Should both names share the namespace field and act like the name field does now (where any name that contains a dot is assumed to specify the full name)? Or should reference names ignore the namespace field (and maybe have one of their own)?
       
      3. If avro parsers are changed to be use referenceName instead of name, when present, how much concern is there about old parsers not being able to parse the new referenceName field if they are used sparingly (only generated when necessary)?

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              howellbridger Bridger Howell
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: