Avro
  1. Avro
  2. AVRO-258

Higher-level language for authoring schemata

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Component/s: spec
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Early users of Avro have noted that authoring schemas and especially protocols in JSON feels unnatural. This JIRA is to work on a higher-level language that feels more like defining interfaces and classes in Java/C/etc.

      1. AVRO-258.txt
        33 kB
        Todd Lipcon
      2. simple.avpr
        0.5 kB
        Todd Lipcon
      3. simple-genned.avpr
        1 kB
        Todd Lipcon
      4. avro-258.txt
        59 kB
        Todd Lipcon
      5. avro-258.txt
        74 kB
        Todd Lipcon
      6. genavro.pdf
        25 kB
        Todd Lipcon

        Activity

        Hide
        Todd Lipcon added a comment -

        Here's the "Simple" example in my made-up language:

        protocol Simple {
          enum Kind {
            FOO,
            BAR,
            BAZ
          }
          
          fixed MD5(16);
        
          record TestRecord {
            @order("ignore")
            string name;
        
            @order("descending")
            Kind kind;
        
            MD5 hash;
          }
        
          error TestError {
            string message;
          }
        }
        

        Currently I'm parsing this with JavaCC, but not generating any schema or AST or anything. I think the next steps are:

        • See if people like the above style (and this idea at all)
        • Make the parser actually generate a Schema object
        • Dump that Schema object to JSON

        I'm proposing this as a way for developers to author and generate schemas, and do not expect that each language binding would have to implement a parser. We could keep the authoritative high-level-language code in Java. This has a side benefit of being able to do some semantic checking of schemata, too.

        Show
        Todd Lipcon added a comment - Here's the "Simple" example in my made-up language: protocol Simple { enum Kind { FOO, BAR, BAZ } fixed MD5(16); record TestRecord { @order("ignore") string name; @order("descending") Kind kind; MD5 hash; } error TestError { string message; } } Currently I'm parsing this with JavaCC, but not generating any schema or AST or anything. I think the next steps are: See if people like the above style (and this idea at all) Make the parser actually generate a Schema object Dump that Schema object to JSON I'm proposing this as a way for developers to author and generate schemas, and do not expect that each language binding would have to implement a parser. We could keep the authoritative high-level-language code in Java. This has a side benefit of being able to do some semantic checking of schemata, too.
        Hide
        Jeff Hammerbacher added a comment -

        We could also use this layer to do protocol inheritance.

        Show
        Jeff Hammerbacher added a comment - We could also use this layer to do protocol inheritance.
        Hide
        Jeff Hammerbacher added a comment -

        And comments.

        Show
        Jeff Hammerbacher added a comment - And comments.
        Hide
        Todd Lipcon added a comment -

        Other open questions:

        • Should this support complex defaults? If so, do we end up essentially embedding JSON inside the language? Eg:
        record SomeOtherRecord {
          TestRecord myfield = {
            'name': 'bob',
            'kind': 'FOO',
            'hash': 0x1234234
          }
        }
        

        or should it be something with less quoting? In the first iteration, I'll probably just leave out non-literal defaults.

        • How do we write nested non-reference-style records? For example, part of interop.avsc might be written something like:
        @namespace(org.apache.avro)
        record Interop {
          record Node {
            string label;
            array<Node> children;
          } recordField;
        }
        

        but that looks a little bizarre to me. The other option is to force records to be defined more like inner classes:

        @namespace(org.apache.avro)
        record Interop {
          record Node {
            string label;
            array<Node> children;
          }
          Node recordField;
        }
        
        Show
        Todd Lipcon added a comment - Other open questions: Should this support complex defaults? If so, do we end up essentially embedding JSON inside the language? Eg: record SomeOtherRecord { TestRecord myfield = { 'name': 'bob', 'kind': 'FOO', 'hash': 0x1234234 } } or should it be something with less quoting? In the first iteration, I'll probably just leave out non-literal defaults. How do we write nested non-reference-style records? For example, part of interop.avsc might be written something like: @namespace(org.apache.avro) record Interop { record Node { string label; array<Node> children; } recordField; } but that looks a little bizarre to me. The other option is to force records to be defined more like inner classes: @namespace(org.apache.avro) record Interop { record Node { string label; array<Node> children; } Node recordField; }
        Hide
        Philip Zeyliger added a comment -

        We could also use this layer to do protocol inheritance.

        I don't think protocol inheritance is a good idea. It's better to support servers that "implement" a number of protocols.

        Show
        Philip Zeyliger added a comment - We could also use this layer to do protocol inheritance. I don't think protocol inheritance is a good idea. It's better to support servers that "implement" a number of protocols.
        Hide
        Doug Cutting added a comment -

        First, I don't think we want to make such a tool a part of the spec: we don't expect there to be more than a single implementation of it, do we? Given that, we should implement it such a way that it's available to the widest variety of platforms: Perl or Python might thus be preferable to Java.

        Support for comments and includes would we wonderful to have.

        Another approach, rather than trying to make the syntax more Java-like, implementing a full parser, is to just remove the most annoying things from JSON. A good pre-processor that supports includes, comments and made quotes optional would make things vastly more readable and functional. Beyond that, it starts to become lisp-versus-algol, unresolvable and a tremendous time sink.

        If we wanted to get fancy, we could try to do more complex JSON transformations, like transform "foo bar {}" (no comma) into "

        {type: foo, name: bar, ...}

        ", and "a; b; c;" into [a, b, c], etc.

        {type: record, name: Foo, fields: [

        {name: f, type string}

        ], java-class: FooImpl}

        {type: enum, name: Bar, symbols: [X, Y, Z]}

        become

        record Foo { fields: string f {}; java-class: FooImpl}
        enum Bar

        { symbols: X; Y; Z; }

        A json-preprocessor approach lets us more easily support default values, metadata, etc, and makes the transition to JSON easier for folks, since developers will see JSON-format schemas too. My first choice would probably be to simply support comments, includes and make quotes optional. That would gives us great bang for very little buck. But I may not be able to convince anyone else of this...

        Show
        Doug Cutting added a comment - First, I don't think we want to make such a tool a part of the spec: we don't expect there to be more than a single implementation of it, do we? Given that, we should implement it such a way that it's available to the widest variety of platforms: Perl or Python might thus be preferable to Java. Support for comments and includes would we wonderful to have. Another approach, rather than trying to make the syntax more Java-like, implementing a full parser, is to just remove the most annoying things from JSON. A good pre-processor that supports includes, comments and made quotes optional would make things vastly more readable and functional. Beyond that, it starts to become lisp-versus-algol, unresolvable and a tremendous time sink. If we wanted to get fancy, we could try to do more complex JSON transformations, like transform "foo bar {}" (no comma) into " {type: foo, name: bar, ...} ", and "a; b; c;" into [a, b, c] , etc. {type: record, name: Foo, fields: [ {name: f, type string} ], java-class: FooImpl} {type: enum, name: Bar, symbols: [X, Y, Z]} become record Foo { fields: string f {}; java-class: FooImpl} enum Bar { symbols: X; Y; Z; } A json-preprocessor approach lets us more easily support default values, metadata, etc, and makes the transition to JSON easier for folks, since developers will see JSON-format schemas too. My first choice would probably be to simply support comments, includes and make quotes optional. That would gives us great bang for very little buck. But I may not be able to convince anyone else of this...
        Hide
        Todd Lipcon added a comment -

        First, I don't think we want to make such a tool a part of the spec

        Fair enough - I'm ambivalent there.

        Perl or Python might thus be preferable to Java.

        I looked at some Python based parsers, but the issue is that many of them rely on libraries rather than code generation. Many of those libraries are GPL or LGPL license, and also aren't available on CentOS/RHEL 5, which means that in a lot of ways it's less deployable than Java. Pyparsing, which I like a lot and have used before, is a friendly license but still has the library requirement, and would still have to bundled with the script. Having recently worked on some python software that bundles a lot of library dependencies, it's a huge huge huge pain.

        I actually almost did this in C/C++ with straight lex/yacc, but went towards Java since it was easier for a quick first pass. Moving to C in the long run would be fine by me for the reasons you outlined.

        Another approach, rather than trying to make the syntax more Java-like, implementing a full parser, is to just remove the most annoying things from JSON... more complex JSON transformations ... etc

        So, maybe I'm misunderstanding you, but it seems like you're proposing either (a) writing a custom JSON parser that has some extensions to make the syntax more palatable, or (b) writing a text-based preprocessor that outputs JSON which is then fed into the parser. Solution (a) seems to me like it has all the same difficulties as writing our own language, but with a less familiar syntax. Solution (b) seems hackish, and has the downside that it inherits the syntactic strangeness of using JSON while not getting the benefits of using a standard language (editor support, preexisting familiarity, etc).

        Beyond that, it starts to become lisp-versus-algol, unresolvable and a tremendous time sink.

        I'm not convinced that implementing our own language is really that tough. In about 3 hours of work I got the above stuff done, and I'd never used JavaCC before. As for the religious lisp-versus-algol question, I think it's already been resolved in the sense that most existing protocol/data description languages are more algol-like than JSON-like (eg xdr, CORBA IDL, protobufs, Apache Thrift, Apache Etch). The counterexamples are things like WSDL which no one seems to really like.

        To reiterate, I'm definitely not suggesting than JSON be supplanted as the definitive schema definition language for AVRO. It's great in that there are existing parsers in most languages and readily machine-readable.

        Show
        Todd Lipcon added a comment - First, I don't think we want to make such a tool a part of the spec Fair enough - I'm ambivalent there. Perl or Python might thus be preferable to Java. I looked at some Python based parsers, but the issue is that many of them rely on libraries rather than code generation. Many of those libraries are GPL or LGPL license, and also aren't available on CentOS/RHEL 5, which means that in a lot of ways it's less deployable than Java. Pyparsing, which I like a lot and have used before, is a friendly license but still has the library requirement, and would still have to bundled with the script. Having recently worked on some python software that bundles a lot of library dependencies, it's a huge huge huge pain. I actually almost did this in C/C++ with straight lex/yacc, but went towards Java since it was easier for a quick first pass. Moving to C in the long run would be fine by me for the reasons you outlined. Another approach, rather than trying to make the syntax more Java-like, implementing a full parser, is to just remove the most annoying things from JSON... more complex JSON transformations ... etc So, maybe I'm misunderstanding you, but it seems like you're proposing either (a) writing a custom JSON parser that has some extensions to make the syntax more palatable, or (b) writing a text-based preprocessor that outputs JSON which is then fed into the parser. Solution (a) seems to me like it has all the same difficulties as writing our own language, but with a less familiar syntax. Solution (b) seems hackish, and has the downside that it inherits the syntactic strangeness of using JSON while not getting the benefits of using a standard language (editor support, preexisting familiarity, etc). Beyond that, it starts to become lisp-versus-algol, unresolvable and a tremendous time sink. I'm not convinced that implementing our own language is really that tough. In about 3 hours of work I got the above stuff done, and I'd never used JavaCC before. As for the religious lisp-versus-algol question, I think it's already been resolved in the sense that most existing protocol/data description languages are more algol-like than JSON-like (eg xdr, CORBA IDL, protobufs, Apache Thrift, Apache Etch). The counterexamples are things like WSDL which no one seems to really like. To reiterate, I'm definitely not suggesting than JSON be supplanted as the definitive schema definition language for AVRO. It's great in that there are existing parsers in most languages and readily machine-readable.
        Hide
        Todd Lipcon added a comment -

        Here's a patch that shows what works so far. It does not include proper unit tests yet, but you can try it like so:

        $ ant compile-java
        $ java -cp build/lib/*:build/classes/ org.apache.avro.genavro.GenAvro < src/test/genavro/simple.avpr
        

        I also attached input and output for a schema basically the same as the test/schemata/simple.avpr

        Certainly more work to be done here, but I want to get feedback from the community before I spend the time to get it properly tested, documented, corner-cases worked out, etc. Most importantly, is this something people want? Does this style of syntax seem reasonable?

        Show
        Todd Lipcon added a comment - Here's a patch that shows what works so far. It does not include proper unit tests yet, but you can try it like so: $ ant compile-java $ java -cp build/lib/*:build/classes/ org.apache.avro.genavro.GenAvro < src/test/genavro/simple.avpr I also attached input and output for a schema basically the same as the test/schemata/simple.avpr Certainly more work to be done here, but I want to get feedback from the community before I spend the time to get it properly tested, documented, corner-cases worked out, etc. Most importantly, is this something people want? Does this style of syntax seem reasonable?
        Hide
        Doug Cutting added a comment -

        The syntax looks reasonable enough to me. I think we'll want both an avroj command-line tool for this and an Ant task.

        Show
        Doug Cutting added a comment - The syntax looks reasonable enough to me. I think we'll want both an avroj command-line tool for this and an Ant task.
        Hide
        Ryan King added a comment -

        I like the general approach here, but am not a fan of the decorator-like syntax. That sort of syntax is necessary when you can't make backwards-incompatible changes to a language. We don't have that constraint here. I would do this:

        <pre>
        string name ignore;
        </pre>

        Show
        Ryan King added a comment - I like the general approach here, but am not a fan of the decorator-like syntax. That sort of syntax is necessary when you can't make backwards-incompatible changes to a language. We don't have that constraint here. I would do this: <pre> string name ignore; </pre>
        Hide
        Todd Lipcon added a comment -

        Ryan: how do you extend that to support the arbitrary properties? 'order' isn't the only property that can be attached to schemas. I certainly don't love the java annotation style syntax, but I went with it because it was familiar-looking.

        Show
        Todd Lipcon added a comment - Ryan: how do you extend that to support the arbitrary properties? 'order' isn't the only property that can be attached to schemas. I certainly don't love the java annotation style syntax, but I went with it because it was familiar-looking.
        Hide
        Ryan King added a comment -

        We could make the properties, key-value:

        string name sort=ignore future=property

        I think my aversion to the annotation-style syntax is that they're very unfamiliar looking to me and just seem unnecessary when we're starting from scratch.

        Show
        Ryan King added a comment - We could make the properties, key-value: string name sort=ignore future=property I think my aversion to the annotation-style syntax is that they're very unfamiliar looking to me and just seem unnecessary when we're starting from scratch.
        Hide
        Thiruvalluvan M. G. added a comment -

        The syntax looks fine for me. The only addition I'd like to add to the grammar is to support optional fields in records. We use the idiom that a union of a "Type" and "null" makes the field optional. I'd like the language support it directly. That is I'd like something like:

        <pre>
        record r

        { string name; optional string org; }

        </pre>

        as a shorthand for

        <pre>
        record r {
        string name;
        union

        { null, string }

        org;
        }
        </pre>

        Show
        Thiruvalluvan M. G. added a comment - The syntax looks fine for me. The only addition I'd like to add to the grammar is to support optional fields in records. We use the idiom that a union of a "Type" and "null" makes the field optional. I'd like the language support it directly. That is I'd like something like: <pre> record r { string name; optional string org; } </pre> as a shorthand for <pre> record r { string name; union { null, string } org; } </pre>
        Hide
        Todd Lipcon added a comment -

        Thanks for the input, guys.

        We could make the properties, key-value: string name sort=ignore future=property

        I'm not a huge fan of this because it makes user-properties intermix with language constructors in a non-clearly-defined way. But I don't feel that strongly - anyone else have some opinions?

        The only addition I'd like to add to the grammar is to support optional fields in records

        Definitely +1. I was planning on using the keyword "nullable" instead of "optional" since to me "optional" seems to indicate a ternary state possibility (ie unset, set to null, and set to something else). But in general I like that kind of syntactic sugar.

        Show
        Todd Lipcon added a comment - Thanks for the input, guys. We could make the properties, key-value: string name sort=ignore future=property I'm not a huge fan of this because it makes user-properties intermix with language constructors in a non-clearly-defined way. But I don't feel that strongly - anyone else have some opinions? The only addition I'd like to add to the grammar is to support optional fields in records Definitely +1. I was planning on using the keyword "nullable" instead of "optional" since to me "optional" seems to indicate a ternary state possibility (ie unset, set to null, and set to something else). But in general I like that kind of syntactic sugar.
        Hide
        Todd Lipcon added a comment -

        err, by "language constructors" i meant "language constructs". Sorry for the typo

        Show
        Todd Lipcon added a comment - err, by "language constructors" i meant "language constructs". Sorry for the typo
        Hide
        Todd Lipcon added a comment -

        New patch. This needs to go on top of AVRO-263 and includes the following since last patch:

        • "avroj genavro" tool implementation
        • fixes to build.xml so clean build works properly with javacc stuff
        • test harness and some tests in src/test/genavro/
        • added support for namespaces
        • added support for errors
        • added support for using reserved words bare in certain contexts, and escaped by backticks in others. See src/test/genavro/input/reservedwords.genavro
        • cleaned up the .jj file a bit
        Show
        Todd Lipcon added a comment - New patch. This needs to go on top of AVRO-263 and includes the following since last patch: "avroj genavro" tool implementation fixes to build.xml so clean build works properly with javacc stuff test harness and some tests in src/test/genavro/ added support for namespaces added support for errors added support for using reserved words bare in certain contexts, and escaped by backticks in others. See src/test/genavro/input/reservedwords.genavro cleaned up the .jj file a bit
        Hide
        Doug Cutting added a comment -

        This is looking good.

        • GenAvroTool needs javadoc.
        • The language needs documentation. This probably belongs in the Forrest rather than javadoc.
        Show
        Doug Cutting added a comment - This is looking good. GenAvroTool needs javadoc. The language needs documentation. This probably belongs in the Forrest rather than javadoc.
        Hide
        Todd Lipcon added a comment -

        New patch includes better docs for the tool, as well as forrest source for docs on the language. I'm also attaching the generated PDF for easier reference here.

        I modified build.xml a bit to separate out a forrestdoc target from the general "doc" target (got tired of waiting for all the javadoc, etc, to build just to test the forrest docs)

        Show
        Todd Lipcon added a comment - New patch includes better docs for the tool, as well as forrest source for docs on the language. I'm also attaching the generated PDF for easier reference here. I modified build.xml a bit to separate out a forrestdoc target from the general "doc" target (got tired of waiting for all the javadoc, etc, to build just to test the forrest docs)
        Hide
        Doug Cutting added a comment -

        I just committed this. I took the liberty of making a few changes:

        • added Apache license to genavro files and removed the change to rat-excludes.txt
        • moved GenAvroTool.java to tool package, since genavro package had no package.html, and no other public classes.
        • removed verbose test output.
        • added a missing final newline to TestGenAvro, as flagged by checkstyle.

        Thanks Todd!

        Show
        Doug Cutting added a comment - I just committed this. I took the liberty of making a few changes: added Apache license to genavro files and removed the change to rat-excludes.txt moved GenAvroTool.java to tool package, since genavro package had no package.html, and no other public classes. removed verbose test output. added a missing final newline to TestGenAvro, as flagged by checkstyle. Thanks Todd!
        Hide
        Doug Cutting added a comment -

        One more change, Todd: Schema.Names should not have been made public: it has no Javadoc, it references other non-public classes, and otherwise clutters the core Javadoc with something that few need to know about. Perhaps, in a separate issue, we could consider moving a lot of Schema.java's nested classes to an implementation package, that most users don't need to see, but for now using Schema.Names didn't actually save you more than a couple of lines of code, so I just removed its use and made it package-private again.

        Show
        Doug Cutting added a comment - One more change, Todd: Schema.Names should not have been made public: it has no Javadoc, it references other non-public classes, and otherwise clutters the core Javadoc with something that few need to know about. Perhaps, in a separate issue, we could consider moving a lot of Schema.java's nested classes to an implementation package, that most users don't need to see, but for now using Schema.Names didn't actually save you more than a couple of lines of code, so I just removed its use and made it package-private again.

          People

          • Assignee:
            Todd Lipcon
            Reporter:
            Todd Lipcon
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development