Avro
  1. Avro
  2. AVRO-872

Allow interdependancies across IDL schema imports

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5.3
    • Component/s: java
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      This currently doesn't work because Player depends on Position, but it should:

      $ cat position.avsc 
      {"type":"enum", "name": "Position", "namespace": "avro.examples.baseball",
          "symbols": ["P", "C", "B1", "B2", "B3", "SS", "LF", "CF", "RF", "DH"]
      }
      
      $ cat player.avsc 
      {"type":"record", "name":"Player", "namespace": "avro.examples.baseball",
        "fields": [
         {"name": "number", "type": "int"},
         {"name": "first_name", "type": "string"},
         {"name": "last_name", "type": "string"},
         {"name": "position", "type": {"type": "array", "items": "avro.examples.baseball.Position"} }
        ]
      }
      
      $ cat baseball.avdl 
      @namespace("avro.examples.baseball")
      protocol Baseball {
         import schema "position.avsc";
         import schema "player.avsc";
      }
      
      $ java -jar avro-tools-1.5.1.jar idl baseball.avdl baseball.avpr
      
      1. AVRO_872_commit.patch
        9 kB
        Bill Graham
      2. AVRO-872.patch
        9 kB
        Doug Cutting
      3. AVRO-872_4.patch
        8 kB
        Bill Graham
      4. AVRO-872.patch
        8 kB
        Doug Cutting
      5. AVRO-872.patch
        8 kB
        Doug Cutting
      6. AVRO-872.patch
        8 kB
        Doug Cutting

        Activity

        Hide
        Doug Cutting added a comment -

        Here's a patch that fixes this.

        It replaces the Schema.parse() methods with a more flexible Schema.Parser API. If folks think this new API is reasonable, then we should perhaps switch all of the calls to Schema.parse() to the new API.

        This includes the test case you provided.

        Show
        Doug Cutting added a comment - Here's a patch that fixes this. It replaces the Schema.parse() methods with a more flexible Schema.Parser API. If folks think this new API is reasonable, then we should perhaps switch all of the calls to Schema.parse() to the new API. This includes the test case you provided.
        Hide
        Doug Cutting added a comment -

        Updated patch. Schema.Parser setter methods now permit chaining.

        Show
        Doug Cutting added a comment - Updated patch. Schema.Parser setter methods now permit chaining.
        Hide
        Bill Graham added a comment -

        Thanks Doug. I've verified that the idl tool now generates a protocol file. I'm unable to parse this using the Schema.parse(File file) method though. Is that supposed to work, or am I doing it wrong?

        I've also verified that this now works:

        Schema.Parser parser = new Schema.Parser();
        parser.parse(new File("position.avsc"));
        Schema playerSchema = parser.parse(new File("player.avsc"));
        

        On the email list we've been discussion alternate APIs (http://search-hadoop.com/m/wzbMK11aTL82). Something like this:

        public Schema Schema.parse(File[] files);
        public Schema Schema.parse(File[] files, Map<Name, Schema> context);
        

        I propose morphing these two approaches to something that could be used like this:

        Schema.Parser parser = new Schema.Parser();
        parser.parse(new File("position.avsc"));
        parser.parse(new File("player.avsc"));
        Schema schema = parser.getSchemaByName("Player");
        
        // or alternatively you can pass multiple files to the parse method once
        parser.parse(mySchemaFiles);
        

        Thoughts?

        Show
        Bill Graham added a comment - Thanks Doug. I've verified that the idl tool now generates a protocol file. I'm unable to parse this using the Schema.parse(File file) method though. Is that supposed to work, or am I doing it wrong? I've also verified that this now works: Schema.Parser parser = new Schema.Parser(); parser.parse(new File("position.avsc")); Schema playerSchema = parser.parse(new File("player.avsc")); On the email list we've been discussion alternate APIs ( http://search-hadoop.com/m/wzbMK11aTL82 ). Something like this: public Schema Schema.parse(File[] files); public Schema Schema.parse(File[] files, Map<Name, Schema> context); I propose morphing these two approaches to something that could be used like this: Schema.Parser parser = new Schema.Parser(); parser.parse(new File("position.avsc")); parser.parse(new File("player.avsc")); Schema schema = parser.getSchemaByName("Player"); // or alternatively you can pass multiple files to the parse method once parser.parse(mySchemaFiles); Thoughts?
        Hide
        Doug Cutting added a comment -

        I believe the former will work with the current patch: sequential calls using the same parser will accumulate names in the parser. The latter would be an easy addition to the patch, if desired.

        To parse a protocol file use Protocol.parse(File). We should perhaps convert that to a Schema.Parser-style API too.

        Show
        Doug Cutting added a comment - I believe the former will work with the current patch: sequential calls using the same parser will accumulate names in the parser. The latter would be an easy addition to the patch, if desired. To parse a protocol file use Protocol.parse(File). We should perhaps convert that to a Schema.Parser-style API too.
        Hide
        Doug Cutting added a comment -

        Note that the 'getSchemaByName' method you use in the first case above does not exist. Instead you can either use the value of parser.parse(). If we did want to add a 'getSchemaByName' method it would need to accept a fully-qualified name, e.g., parser.getSchemaByName("org.apache.avro.examples.Player"). Or we could change parser.getTypes() to instead return Map<String,Schema> so you could call parser.getTypes().get("org.apache.avro.examples.Player").

        Show
        Doug Cutting added a comment - Note that the 'getSchemaByName' method you use in the first case above does not exist. Instead you can either use the value of parser.parse(). If we did want to add a 'getSchemaByName' method it would need to accept a fully-qualified name, e.g., parser.getSchemaByName("org.apache.avro.examples.Player"). Or we could change parser.getTypes() to instead return Map<String,Schema> so you could call parser.getTypes().get("org.apache.avro.examples.Player").
        Hide
        Bill Graham added a comment -

        Yes, the former parsing approach will work with the current patch. We'd just need to add the convenience method to do

        Schema schema = parser.getSchemaByName("Player");

        (Or maybe getType aligns more with the existing API?) This would allow the caller to load a number of files and then fetch the schema(s) they're interested in, without necessarily knowing which file contained it. They'd still need to be sure to parse in reverse-dependent order.

        Thanks for the pointer re Protocol parsing, that worked.

        Show
        Bill Graham added a comment - Yes, the former parsing approach will work with the current patch. We'd just need to add the convenience method to do Schema schema = parser.getSchemaByName("Player"); (Or maybe getType aligns more with the existing API?) This would allow the caller to load a number of files and then fetch the schema(s) they're interested in, without necessarily knowing which file contained it. They'd still need to be sure to parse in reverse-dependent order. Thanks for the pointer re Protocol parsing, that worked.
        Hide
        Bill Graham added a comment -

        My last comment was sent before seeing yours FYI. Sure we can either expose the Map or a method to get a value from the map. Either way works. Exposing the Map might be be better since it provides more flexibility to the caller.

        Show
        Bill Graham added a comment - My last comment was sent before seeing yours FYI. Sure we can either expose the Map or a method to get a value from the map. Either way works. Exposing the Map might be be better since it provides more flexibility to the caller.
        Hide
        Doug Cutting added a comment -

        Here's a version where getTypes() returns Map<String,Schema>. This implementation not very efficient since it creates new Map entries each time its called. This could be optimized, either by defining and returning a Map implementation wrapping the Names instance, or by converting Names to be a Map<String,Schema> and returning it directly. But I'm not sure that's worth the effort now, since this doesn't seem a likely performance bottleneck.

        Show
        Doug Cutting added a comment - Here's a version where getTypes() returns Map<String,Schema>. This implementation not very efficient since it creates new Map entries each time its called. This could be optimized, either by defining and returning a Map implementation wrapping the Names instance, or by converting Names to be a Map<String,Schema> and returning it directly. But I'm not sure that's worth the effort now, since this doesn't seem a likely performance bottleneck.
        Hide
        Doug Cutting added a comment -

        I'm beginning to question the motivation for this a bit. The example player.avsc file is not a well-formed JSON schema, since it's not standalone, but rather depends on another .avsc file. To date, we've only consumed or produced standalone JSON, not fragmentary. JSON is meant to be the low-level schema language, with IDL as the higher level language, better supporting manually maintained schemas.

        So, before we commit this, I'd like to understand the use case a bit more. Is there a reason one couldn't define the two schemas in a single .avsc, .avpr file, or multiple .avdl files?

        We could I suppose change the Schema parser to, when it encounters an undefined name, look for a file defining it, much like the Java compiler looks for a .java file on the CLASSPATH, but I feel such features should be confined to IDL and that JSON should be primarily used for self-contained schemas. Standalone JSON schemas are what we save with files and exchange in RPC handshakes. Currently the API will not permit one to write a non-standalone Schema so I'm a bit reluctant to permit reading them.

        Do others have thoughts on this?

        Show
        Doug Cutting added a comment - I'm beginning to question the motivation for this a bit. The example player.avsc file is not a well-formed JSON schema, since it's not standalone, but rather depends on another .avsc file. To date, we've only consumed or produced standalone JSON, not fragmentary. JSON is meant to be the low-level schema language, with IDL as the higher level language, better supporting manually maintained schemas. So, before we commit this, I'd like to understand the use case a bit more. Is there a reason one couldn't define the two schemas in a single .avsc, .avpr file, or multiple .avdl files? We could I suppose change the Schema parser to, when it encounters an undefined name, look for a file defining it, much like the Java compiler looks for a .java file on the CLASSPATH, but I feel such features should be confined to IDL and that JSON should be primarily used for self-contained schemas. Standalone JSON schemas are what we save with files and exchange in RPC handshakes. Currently the API will not permit one to write a non-standalone Schema so I'm a bit reluctant to permit reading them. Do others have thoughts on this?
        Hide
        Bill Graham added a comment -

        The last patch works well, thanks. Attached is patch #4 which also has a parse(File[]) method, if we choose to go this route.

        The value of being able to parse multiple JSON files into a single schema is that it allows for a more modular approach when creating and managing schema definitions. Without support for this at the JSON level, users will resort to copy and pasting common schemas into much larger and less manageable schema definitions.

        It seems like a defacto best-practice is emerging to concat multiple schemas together into a union as a way to partially get around repeatedly in-lining JSON child schemas. This approach gets the job done, but has manageability problems.

        This problem can be solved at the IDL level, but that provides yet another level of abstraction, a new language syntax and a compilation step to complicate what would otherwise be a very simple use case.

        Regarding consuming/producing fragmentary JSON, with the proposed approach producing JSON fragments will still not occur, since the in-memory schema is always complete, due to the reverse-dependency ordering that is required at parse time (not unlike parsing a union). Also, parsing a JSON fragment will still fail without parsing it's dependancies first so it's not loosening the contract of how parsing is handled in any way.

        I'd also like to hear others thoughts on this though.

        Show
        Bill Graham added a comment - The last patch works well, thanks. Attached is patch #4 which also has a parse(File[]) method, if we choose to go this route. The value of being able to parse multiple JSON files into a single schema is that it allows for a more modular approach when creating and managing schema definitions. Without support for this at the JSON level, users will resort to copy and pasting common schemas into much larger and less manageable schema definitions. It seems like a defacto best-practice is emerging to concat multiple schemas together into a union as a way to partially get around repeatedly in-lining JSON child schemas. This approach gets the job done, but has manageability problems. This problem can be solved at the IDL level, but that provides yet another level of abstraction, a new language syntax and a compilation step to complicate what would otherwise be a very simple use case. Regarding consuming/producing fragmentary JSON, with the proposed approach producing JSON fragments will still not occur, since the in-memory schema is always complete, due to the reverse-dependency ordering that is required at parse time (not unlike parsing a union). Also, parsing a JSON fragment will still fail without parsing it's dependancies first so it's not loosening the contract of how parsing is handled in any way. I'd also like to hear others thoughts on this though.
        Hide
        Scott Carey added a comment - - edited

        Two things block me from using AvroIDL:

        • I started maintaining and using Avro schemas before AvroIDL existed, so it is natural for developers on my project to use the JSON form, not AvroIDL.
        • AvroIDL only supports protocols. I only use schemas.

        The former can be overcome, AvroIDL should be easier to use anyway.
        The latter needs work, and I haven't looked at what it would take to extend AvroIDL to work with schemas.

        Show
        Scott Carey added a comment - - edited Two things block me from using AvroIDL: I started maintaining and using Avro schemas before AvroIDL existed, so it is natural for developers on my project to use the JSON form, not AvroIDL. AvroIDL only supports protocols. I only use schemas. The former can be overcome, AvroIDL should be easier to use anyway. The latter needs work, and I haven't looked at what it would take to extend AvroIDL to work with schemas.
        Hide
        Doug Cutting added a comment -

        There's not much overhead to using IDL for schemas: just use an idl file without any messages:

        @namespace("foo.bar")
        protocol MyProtocol {
          import idl "Bar.avdl" ;
          record Foo {
            ...
          }
        }
        

        If you're generating specific code, then Foo will have the same name as if you defined it in a .avsc file, so no changes to clients should be required.

        The only unnecessary part is the protocol name and the fact that an interface is generated with no methods that you'll never use. We could change that, e.g., by changing the idl parser to accept files of the form:

        @namespace("foo.bar")
        record Foo {
          import idl "Bar.avdl";
          ...
        }
        

        The return type for the IDL parser could then be either a schema or a protocol, and clients would need to do different things depending.

        Show
        Doug Cutting added a comment - There's not much overhead to using IDL for schemas: just use an idl file without any messages: @namespace( "foo.bar" ) protocol MyProtocol { import idl "Bar.avdl" ; record Foo { ... } } If you're generating specific code, then Foo will have the same name as if you defined it in a .avsc file, so no changes to clients should be required. The only unnecessary part is the protocol name and the fact that an interface is generated with no methods that you'll never use. We could change that, e.g., by changing the idl parser to accept files of the form: @namespace( "foo.bar" ) record Foo { import idl "Bar.avdl" ; ... } The return type for the IDL parser could then be either a schema or a protocol, and clients would need to do different things depending.
        Hide
        Bill Graham added a comment -

        It is possible to generate the Protocol at run time directly from an IDL file. That way the Project can be generated and consumed without the extra step to pre-generate an avpr file. Something like this works:

        Idl parser = new Idl(idfFile);
        Protocol p = parser.CompilationUnit();
        

        The downside is that now the avro-compiler jar would be needed at run time. I agree though that it would be nice to not have to create and use Protocols if you only need Schemas.

        @Doug if all your schema objects are defined in your avsc files, then defining a dummy "record Foo" would be needed, right? Could you instead do something like this to load a collection of Schemas instead?:

        @namespace("foo.bar")
        schema MySchema {
          import schema "Bar.avsc";
          ...
        }
        

        Granted, now there's a dummy MySchema thing, but it's not a record.

        Show
        Bill Graham added a comment - It is possible to generate the Protocol at run time directly from an IDL file. That way the Project can be generated and consumed without the extra step to pre-generate an avpr file. Something like this works: Idl parser = new Idl(idfFile); Protocol p = parser.CompilationUnit(); The downside is that now the avro-compiler jar would be needed at run time. I agree though that it would be nice to not have to create and use Protocols if you only need Schemas. @Doug if all your schema objects are defined in your avsc files, then defining a dummy "record Foo" would be needed, right? Could you instead do something like this to load a collection of Schemas instead?: @namespace("foo.bar") schema MySchema { import schema "Bar.avsc"; ... } Granted, now there's a dummy MySchema thing, but it's not a record.
        Hide
        Doug Cutting added a comment -

        Bill, I don't understand the need for the dummy. Why would all your schemas be defined in avsc files? Why wouldn't you use avdl files to define them?

        Show
        Doug Cutting added a comment - Bill, I don't understand the need for the dummy. Why would all your schemas be defined in avsc files? Why wouldn't you use avdl files to define them?
        Hide
        Bill Graham added a comment -

        We'd like to continue to use avsc files because they're easier to read and author and our developers are already familiar with them. They're also not experimental and changing like the IDL language. So we'd just use IDL as a mechanism to combine fragmented avsc files, like the initial problem statement:

        @namespace("avro.examples.baseball")
        protocol Baseball {
           import schema "position.avsc";
           import schema "player.avsc";
        }
        
        Show
        Bill Graham added a comment - We'd like to continue to use avsc files because they're easier to read and author and our developers are already familiar with them. They're also not experimental and changing like the IDL language. So we'd just use IDL as a mechanism to combine fragmented avsc files, like the initial problem statement: @namespace("avro.examples.baseball") protocol Baseball { import schema "position.avsc"; import schema "player.avsc"; }
        Hide
        Doug Cutting added a comment -

        We should perhaps remove the 'experimental' declaration from the IDL documentation. I don't think we should make incompatible changes to the syntax of IDL, so we might as well declare it stable. But that's another issue...

        Bill, will you use the parse(File[]) method, or would you instead use an IDL file? It's not yet clear to me that method is so common a pattern that it warrants adding here. If we think it's a common pattern then we should at add some javadoc, otherwise we should remove it. Other than that, I'm willing to commit this.

        Show
        Doug Cutting added a comment - We should perhaps remove the 'experimental' declaration from the IDL documentation. I don't think we should make incompatible changes to the syntax of IDL, so we might as well declare it stable. But that's another issue... Bill, will you use the parse(File[]) method, or would you instead use an IDL file? It's not yet clear to me that method is so common a pattern that it warrants adding here. If we think it's a common pattern then we should at add some javadoc, otherwise we should remove it. Other than that, I'm willing to commit this.
        Hide
        Bill Graham added a comment -

        If I could I'd prefer to just use the parse(File[]) method approach without having to deal with the IDL files.

        Show
        Bill Graham added a comment - If I could I'd prefer to just use the parse(File[]) method approach without having to deal with the IDL files.
        Hide
        Doug Cutting added a comment -

        Here's a version that includes javadoc for the parse(File[]) method and improves the javadoc for Parser in general.

        Any objections to committing this?

        Show
        Doug Cutting added a comment - Here's a version that includes javadoc for the parse(File[]) method and improves the javadoc for Parser in general. Any objections to committing this?
        Hide
        Bill Graham added a comment -

        +1

        Thanks, Doug!

        Show
        Bill Graham added a comment - +1 Thanks, Doug!
        Hide
        Scott Carey added a comment -

        +1

        I like it. Parser is a big improvement.

        Show
        Scott Carey added a comment - +1 I like it. Parser is a big improvement.
        Hide
        Bill Graham added a comment -

        Doug if you're hesitant about the parse(File[]) method, we can always leave that one out. Calling parse(File) repeatedly is not a big deal for the client.

        Show
        Bill Graham added a comment - Doug if you're hesitant about the parse(File[]) method, we can always leave that one out. Calling parse(File) repeatedly is not a big deal for the client.
        Hide
        Scott Carey added a comment -

        The most recent patch has:

        +    /** Parse schemas from the provided files, in the specified order.
        +     * If named, each schema is added to the names known to this parser. */
        +    public Schema[] parse(File[] files) throws IOException {
        +      Schema[] schemas = new Schema[files.length];
        +      for (int i = 0; i < files.length; i++)
        +        schemas[i] = parse(files[i]);
        +      return schemas;
        +    }
        

        in the Parser class.

        Show
        Scott Carey added a comment - The most recent patch has: + /** Parse schemas from the provided files, in the specified order. + * If named, each schema is added to the names known to this parser. */ + public Schema[] parse(File[] files) throws IOException { + Schema[] schemas = new Schema[files.length]; + for (int i = 0; i < files.length; i++) + schemas[i] = parse(files[i]); + return schemas; + } in the Parser class.
        Hide
        Doug Cutting added a comment -

        My slight preference would be to leave parse(File[]) out, since I can imagine many use cases that are slightly different, e.g., 'List<Schema> parse(List<File>)' or 'Schema parse(File[])' that returns the last schema, etc. All of these are just a few lines of code that I think is reasonable to leave to applications. On the other hand, if lots of applications are using the same few lines of code, then it makes sense to capture it in a utility, but I don't yet know what the common idioms are here. Meh.

        Show
        Doug Cutting added a comment - My slight preference would be to leave parse(File[]) out, since I can imagine many use cases that are slightly different, e.g., 'List<Schema> parse(List<File>)' or 'Schema parse(File[])' that returns the last schema, etc. All of these are just a few lines of code that I think is reasonable to leave to applications. On the other hand, if lots of applications are using the same few lines of code, then it makes sense to capture it in a utility, but I don't yet know what the common idioms are here. Meh.
        Hide
        Bill Graham added a comment -

        +1 on leaving parse(File[]) out.

        Another reason to omit it it to keep other similar APIs (i.e. Protocol) consistant and concise. No need to re-implement Array-based support everywhere else. Also I'm already thinking about contributing a parse(URL) method. Not having to support each type as both single and array signatures will keep the code from bloating.

        Show
        Bill Graham added a comment - +1 on leaving parse(File[]) out. Another reason to omit it it to keep other similar APIs (i.e. Protocol) consistant and concise. No need to re-implement Array-based support everywhere else. Also I'm already thinking about contributing a parse(URL) method. Not having to support each type as both single and array signatures will keep the code from bloating.
        Hide
        Doug Cutting added a comment -

        I committed this, without the parse(File[]) method.

        Show
        Doug Cutting added a comment - I committed this, without the parse(File[]) method.
        Hide
        Bill Graham added a comment -

        Attaching the patch that reflects what was committed, for those playing along at home.

        Show
        Bill Graham added a comment - Attaching the patch that reflects what was committed, for those playing along at home.

          People

          • Assignee:
            Doug Cutting
            Reporter:
            Bill Graham
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development