Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Umbrella issue.
      Goal is to provide Scala friendly APIs for Avro records and protocols (RPCs).

      Related project: http://code.google.com/p/avro-scala-compiler-plugin/ looks dead (no change since Sep 2010).

      1. avro-scala.patch
        111 kB
        Christophe Taton

        Activity

        Hide
        Christophe Taton added a comment -

        Here is a first shot of a compiler for avro schemas in Scala.

        This is incomplete in many ways, but I hope and believe this can be used as a starting point.

        Show
        Christophe Taton added a comment - Here is a first shot of a compiler for avro schemas in Scala. This is incomplete in many ways, but I hope and believe this can be used as a starting point.
        Hide
        Scott Carey added a comment -

        From the mailing list, from Michael Armbrust:

        We have a plugin for the scala compiler that takes case classes that extent a special marker trait (AvroRecord) and generates the code needed for Avro serialization. It has mostly been used for research thus far, but we use it quite a bit as the serialization for our K/V store, storing experimental results, as well as our own homegrown message passing system.

        Details can be found here: https://github.com/radlab/SCADS/wiki/Avro-Plugin

        Let me know if you have any questions!

        Michael

        Show
        Scott Carey added a comment - From the mailing list, from Michael Armbrust: We have a plugin for the scala compiler that takes case classes that extent a special marker trait (AvroRecord) and generates the code needed for Avro serialization. It has mostly been used for research thus far, but we use it quite a bit as the serialization for our K/V store, storing experimental results, as well as our own homegrown message passing system. Details can be found here: https://github.com/radlab/SCADS/wiki/Avro-Plugin Let me know if you have any questions! Michael
        Hide
        Christophe Taton added a comment -

        This plugin is very appealing for users who live in a pure Scala world, but is fairly impractical in the general case:

        • the 22 fields limit (due to Scala case class limitations) is not acceptable when you already have lots of legacy records;
        • Avro IDL is a much better way to declare records and protocols if you need to generate binding in multiple languages.
        Show
        Christophe Taton added a comment - This plugin is very appealing for users who live in a pure Scala world, but is fairly impractical in the general case: the 22 fields limit (due to Scala case class limitations) is not acceptable when you already have lots of legacy records; Avro IDL is a much better way to declare records and protocols if you need to generate binding in multiple languages.
        Hide
        Scott Carey added a comment -

        Is there anyone watching that can comment on this patch? It would be great to move this forward, are there folks with a Scala background interested in reviewing this? I would like to review it in more depth, but I am not a Scala expert.

        Show
        Scott Carey added a comment - Is there anyone watching that can comment on this patch? It would be great to move this forward, are there folks with a Scala background interested in reviewing this? I would like to review it in more depth, but I am not a Scala expert.
        Hide
        Quinn Slack added a comment -

        I'm not an Avro committer, but I'll try out this patch and post feedback/patches. My company has several Scala developers who use Avro quite heavily, and we'd love to see native Scala codegen for Avro.

        Show
        Quinn Slack added a comment - I'm not an Avro committer, but I'll try out this patch and post feedback/patches. My company has several Scala developers who use Avro quite heavily, and we'd love to see native Scala codegen for Avro.
        Hide
        Quinn Slack added a comment -

        I've done some more work on Christophe's patch to get Avro Scala codegen working. We're now using it in production, although there are still some edge cases where it generates Scala code that doesn't compile (unions of complex types).

        https://github.com/sqs/avro/tree/sqs-scala-2.10.0-RC2/lang/scala (this sqs-scala-2.10.0-RC2 contains the 2.10 version; the "sqs" branch contains the 2.9 version)

        Posting it here to solicit feedback and make sure others who are interested don't repeat effort. I'll work with Christophe to push this along and prepare a patch.

        Show
        Quinn Slack added a comment - I've done some more work on Christophe's patch to get Avro Scala codegen working. We're now using it in production, although there are still some edge cases where it generates Scala code that doesn't compile (unions of complex types). https://github.com/sqs/avro/tree/sqs-scala-2.10.0-RC2/lang/scala (this sqs-scala-2.10.0-RC2 contains the 2.10 version; the "sqs" branch contains the 2.9 version) Posting it here to solicit feedback and make sure others who are interested don't repeat effort. I'll work with Christophe to push this along and prepare a patch.
        Hide
        John A. De Goes added a comment -

        I have reviewed the Scala patch. Although this would provide some Avro functionality for Scala, I cannot recommend the patch, because (1) it's not idiomatic Scala, (2) it uses code generation when there are far better facilities for providing the same functionality (2.10 macros or type-level programming), and (3) I cannot see the Scala community embracing this as an officially sanctioned means of Scala-Avro interop.

        I know Avro needs to have broad language support, but before Avro adds functionality for some language community, I think it's essentially to get buy-in from that community. I don't represent the Scala community by any stretch of the imagination, but I think a lot of Scala devs will look at the patch and think, "I'd never use that." And if I won't use it, you can bet I won't be maintaining it in the future.

        Show
        John A. De Goes added a comment - I have reviewed the Scala patch. Although this would provide some Avro functionality for Scala, I cannot recommend the patch, because (1) it's not idiomatic Scala, (2) it uses code generation when there are far better facilities for providing the same functionality (2.10 macros or type-level programming), and (3) I cannot see the Scala community embracing this as an officially sanctioned means of Scala-Avro interop. I know Avro needs to have broad language support, but before Avro adds functionality for some language community, I think it's essentially to get buy-in from that community. I don't represent the Scala community by any stretch of the imagination, but I think a lot of Scala devs will look at the patch and think, "I'd never use that." And if I won't use it, you can bet I won't be maintaining it in the future.
        Hide
        Scott Carey added a comment -

        @Quinn: based on my glance on github, it seems that the implementation is code-gen based and wraps the existing Java implementation for most of its work. Is that correct? That is fine, code gen is a common use case with Avro (along with two other common cases, I'll discuss shortly). As you indicate there are Scala devs who would like to use it. We don't have to start out with all use cases available, or with a pure Scala implementation.

        Common Avro use patterns

        There are three common patterns for interacting with Avro data from code:

        • "Schema First" (e.g. code gen) : Schemas are managed outside of the code, and shared across products / languages. These generally represent business objects and result in pure-data classes available to the programmer.
        • "Code First" (e.g. reflection) : The canonical representation for data is in code, and Avro schemas are generated based on that code for persistence of data and schema evolution.
        • "Dynamic" (e.g. Java generic API) : Code has no a priori knowledge of schemas and programs interpret Avro data dynamically based on inputs or directives.
        • Schema first patterns work well with long-lived data types and applications built to directly work with those data types, or exchange them with other applications. These applications often want to expose the data types to the programmer directly (e.g. make record 'Foo' appear as class "Foo" with field accessors name the same as the fields for compile time safety).
        • Code first patterns have low programming overhead and fit well with agile use cases, prototypes, or situations where a single language can host the canonical representation of a long living data type.
        • Dynamic patterns are required for general data processing and storage, generic data access and transformation tools, or any other use case where a priori knowledge of the schemas passing through the system by the programmer is impossible or a burden.

        If this patch only addresses one of the three use cases, that is OK with me, we simply need to be clear what it does not do, and encourage others to contribute work that completes other use cases. This is really a Scala code gen wrapper around the Java implementation, we need it to be clear that this is not a full language implementation – maybe it is simply a module within the Java implementation.
        On the other hand, if there are ways to improve this work and achieve the same use cases then that is something to consider now, especially if it improves buy-in from the Scala community.

        Typically, once a language has all three use case types, much of the implementation overlaps on the back-end.

        @John: This patch does not appear to address the dynamic use cases where macros and type level programming would really shine, nor any code first style. That would require a different contribution effort. However, for a schema first style, are Scala 2.10 macros truly an alternative to code generation? I believe they can generate classes conforming to types defined at compile time from a schema, but are they powerful enough to inject type and field names that correspond to the schema record and field names? I want to make sure we are talking about solving the same use cases.
        On the idiomatic Scala objection, I see a few things in the implementation that are a result of using the Avro Java implementation's APIs for encoding, decoding, and Schemas; changing that does not make sense for a Scala wrapper around the Java API, I am more concerned about things that are exposed to users.

        Show
        Scott Carey added a comment - @Quinn: based on my glance on github, it seems that the implementation is code-gen based and wraps the existing Java implementation for most of its work. Is that correct? That is fine, code gen is a common use case with Avro (along with two other common cases, I'll discuss shortly). As you indicate there are Scala devs who would like to use it. We don't have to start out with all use cases available, or with a pure Scala implementation. Common Avro use patterns There are three common patterns for interacting with Avro data from code: "Schema First" (e.g. code gen) : Schemas are managed outside of the code, and shared across products / languages. These generally represent business objects and result in pure-data classes available to the programmer. "Code First" (e.g. reflection) : The canonical representation for data is in code, and Avro schemas are generated based on that code for persistence of data and schema evolution. "Dynamic" (e.g. Java generic API) : Code has no a priori knowledge of schemas and programs interpret Avro data dynamically based on inputs or directives. Schema first patterns work well with long-lived data types and applications built to directly work with those data types, or exchange them with other applications. These applications often want to expose the data types to the programmer directly (e.g. make record 'Foo' appear as class "Foo" with field accessors name the same as the fields for compile time safety). Code first patterns have low programming overhead and fit well with agile use cases, prototypes, or situations where a single language can host the canonical representation of a long living data type. Dynamic patterns are required for general data processing and storage, generic data access and transformation tools, or any other use case where a priori knowledge of the schemas passing through the system by the programmer is impossible or a burden. If this patch only addresses one of the three use cases, that is OK with me, we simply need to be clear what it does not do, and encourage others to contribute work that completes other use cases. This is really a Scala code gen wrapper around the Java implementation, we need it to be clear that this is not a full language implementation – maybe it is simply a module within the Java implementation. On the other hand, if there are ways to improve this work and achieve the same use cases then that is something to consider now, especially if it improves buy-in from the Scala community. Typically, once a language has all three use case types, much of the implementation overlaps on the back-end. @John: This patch does not appear to address the dynamic use cases where macros and type level programming would really shine, nor any code first style. That would require a different contribution effort. However, for a schema first style, are Scala 2.10 macros truly an alternative to code generation? I believe they can generate classes conforming to types defined at compile time from a schema, but are they powerful enough to inject type and field names that correspond to the schema record and field names? I want to make sure we are talking about solving the same use cases. On the idiomatic Scala objection, I see a few things in the implementation that are a result of using the Avro Java implementation's APIs for encoding, decoding, and Schemas; changing that does not make sense for a Scala wrapper around the Java API, I am more concerned about things that are exposed to users.
        Hide
        Scott Carey added a comment -

        I did a little bit of research on Scala macros. As of Scala 2.10 (nearing release), macros cannot create named types or methods, but they can implement or override methods defined in the source or create anonymous types.
        Until there are type macros they cannot be used to fully replace code generation. However, they can be used to optimize / implement an API defined in advance, such as the "Dynamic" use cases. They also do not yet interact with annotations deeply, which would be very useful for some "Code first" use cases – think the equivalent of Java reflection on annotations at compile time rather than runtime, so that a macro can be triggered to generate code declaratively by the presence of an annotation.

        Show
        Scott Carey added a comment - I did a little bit of research on Scala macros. As of Scala 2.10 (nearing release), macros cannot create named types or methods, but they can implement or override methods defined in the source or create anonymous types. Until there are type macros they cannot be used to fully replace code generation. However, they can be used to optimize / implement an API defined in advance, such as the "Dynamic" use cases. They also do not yet interact with annotations deeply, which would be very useful for some "Code first" use cases – think the equivalent of Java reflection on annotations at compile time rather than runtime, so that a macro can be triggered to generate code declaratively by the presence of an annotation.
        Hide
        John A. De Goes added a comment -

        Sorry for the delay. Here are my thoughts:

        1. For the Dynamic use case, at some point we'll want an idiomatic Scala API, which leverages Scala collections, embraces immutability, and so forth. The patch above does not address this use case.

        2. For the Code First use case, macros can be used to generate a compile-time schema from Scala classes (of course, it's possible to do this using reflection, as well, but you risk the possibility of blowing up at runtime if you can't handle the mapping between Scala and Avro). The patch above does not address this use case.

        3. For the Schema First use case, I think code generation is acceptable for some scenarios. When you have to support multiple conflicting schemas, it becomes easier to manage migration in code with user-defined classes (and, perhaps, a compile-time mapping between those classes and Avro schemas). However, for the simple case where there exists only one schema or multiple compatible schemas, some developers may prefer just to generate the case classes using code similar to the above patch.

        As for comment about the patch specifically:

        1. I would prefer if you could decode into immutable classes directly.

        2. The union type should be a sealed trait, not an abstract class, and commonalities between the case classes should be factored out into the trait.

        3. Consider an option to generate pimps for the types so users can "add" their own methods without having to actually edit the code.

        Show
        John A. De Goes added a comment - Sorry for the delay. Here are my thoughts: 1. For the Dynamic use case, at some point we'll want an idiomatic Scala API, which leverages Scala collections, embraces immutability, and so forth. The patch above does not address this use case. 2. For the Code First use case, macros can be used to generate a compile-time schema from Scala classes (of course, it's possible to do this using reflection, as well, but you risk the possibility of blowing up at runtime if you can't handle the mapping between Scala and Avro). The patch above does not address this use case. 3. For the Schema First use case, I think code generation is acceptable for some scenarios. When you have to support multiple conflicting schemas, it becomes easier to manage migration in code with user-defined classes (and, perhaps, a compile-time mapping between those classes and Avro schemas). However, for the simple case where there exists only one schema or multiple compatible schemas, some developers may prefer just to generate the case classes using code similar to the above patch. As for comment about the patch specifically: 1. I would prefer if you could decode into immutable classes directly. 2. The union type should be a sealed trait, not an abstract class, and commonalities between the case classes should be factored out into the trait. 3. Consider an option to generate pimps for the types so users can "add" their own methods without having to actually edit the code.
        Hide
        Quinn Slack added a comment -

        Thanks for the feedback, John and Scott. I improved the API a little bit by having it serialize into case classes (eliminating the MutableRecords entirely). It is still very, very rough. I'd like to find others who are interested in using and maintaining it before investing more time into it.

        https://github.com/sqs/avro/tree/scala-codegen/lang/scala

        Sample codegenned file:
        https://github.com/sqs/avro/blob/scala-codegen/lang/scala/src/test/scala/org/apache/avro/scala/test/generated/scala/Animal.scala (the original schema is at the bottom)

        Here are some more possible improvements for the schema-first approach:

        • Make it truly immutable. Right now, the codegenned case classes have "var" fields (because of the way Avro deserialization works).
        • When Scala 2.11 is getting nearer, it would be great to make this use type macros and completely eliminate codegen.
        • Use lenses to simplify updating nested values.
        • Make union type fields easier to assign (without union case class indirection), in addition to John's suggestion about making it a sealed trait.
        • Per John's suggestion, make it easy to generate pimps so users can add methods to codegenned classes.
        • Create an sbt plugin to automatically run codegen when the AVDL/AVPR changes (a Makefile has sufficed for us so far).

        If anybody else would like to contribute, I am happy to assist.

        Show
        Quinn Slack added a comment - Thanks for the feedback, John and Scott. I improved the API a little bit by having it serialize into case classes (eliminating the MutableRecords entirely). It is still very, very rough. I'd like to find others who are interested in using and maintaining it before investing more time into it. https://github.com/sqs/avro/tree/scala-codegen/lang/scala Sample codegenned file: https://github.com/sqs/avro/blob/scala-codegen/lang/scala/src/test/scala/org/apache/avro/scala/test/generated/scala/Animal.scala (the original schema is at the bottom) Here are some more possible improvements for the schema-first approach: Make it truly immutable. Right now, the codegenned case classes have "var" fields (because of the way Avro deserialization works). When Scala 2.11 is getting nearer, it would be great to make this use type macros and completely eliminate codegen. Use lenses to simplify updating nested values. Make union type fields easier to assign (without union case class indirection), in addition to John's suggestion about making it a sealed trait. Per John's suggestion, make it easy to generate pimps so users can add methods to codegenned classes. Create an sbt plugin to automatically run codegen when the AVDL/AVPR changes (a Makefile has sufficed for us so far). If anybody else would like to contribute, I am happy to assist.
        Hide
        Connor Doyle added a comment -

        I've done a little bit of work to support the "code-first" (reflective) use case here: https://github.com/GenslerAppsPod/scalavro

        Scalavro depends upon runtime reflection to wrap the Java implementation. I've been doing some research on 2.10 macros which may provide better performance, as discussed above.

        All feedback welcome.

        Show
        Connor Doyle added a comment - I've done a little bit of work to support the "code-first" (reflective) use case here: https://github.com/GenslerAppsPod/scalavro Scalavro depends upon runtime reflection to wrap the Java implementation. I've been doing some research on 2.10 macros which may provide better performance, as discussed above. All feedback welcome.

          People

          • Assignee:
            Unassigned
            Reporter:
            Christophe Taton
          • Votes:
            6 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

            • Created:
              Updated:

              Development