Crunch
  1. Crunch
  2. CRUNCH-293

Injection of reader into AvroRecordReader

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7.0, 0.8.0
    • Fix Version/s: 0.9.0, 0.8.2
    • Component/s: Core
    • Labels:
      None

      Description

      With CRUNCH-243, I wanted to support injecting custom readers to handle the cases like passivity between Avro Schema. The changes made however were not complete as we also need to be able to inject a reader into the AvroRecordReader which constructs its own SpecificDatumReader.

      We could create a SpecificDataFactory which emulates the ReflectDataFactory. Or simplify to a single DataFactory which will create either Reflect/Specific/Generic. Thoughts?

      1. 0001-CRUNCH-293-Add-AvroMode-to-inject-avro-readers.patch
        27 kB
        Ryan Blue
      2. CRUNCH-293_v2.patch
        27 kB
        Micah Whitacre
      3. CRUNCH-293.patch
        9 kB
        Micah Whitacre

        Activity

        Hide
        Micah Whitacre added a comment -

        Here's a first pass at what I was thinking. I think it could be more combined with the ReflectDataFactory but hit a few random failures doing that in the some scrunch tests.

        Show
        Micah Whitacre added a comment - Here's a first pass at what I was thinking. I think it could be more combined with the ReflectDataFactory but hit a few random failures doing that in the some scrunch tests.
        Hide
        Josh Wills added a comment -

        I'd prefer to merge them into one, but yeah, there will be some Scrunch changes required to support that. Scrunch has its own impl of ReflectData to support Scala collection types (e.g., Buffer). The key bit of code that is causing some trouble is in PTypeFamily.scala:

        object Avros extends PTypeFamily

        { override def ptf = AvroTypeFamily.getInstance() CAvros.REFLECT_DATA_FACTORY = new ScalaReflectDataFactory() def reflects[T: ClassManifest]() = CAvros.reflects(classManifest[T].erasure) }

        That CAvros.REFLECT_DATA_FACTORY line would need to be updated w/a ScalaReflectDataFactory that supported the new interface.

        Show
        Josh Wills added a comment - I'd prefer to merge them into one, but yeah, there will be some Scrunch changes required to support that. Scrunch has its own impl of ReflectData to support Scala collection types (e.g., Buffer). The key bit of code that is causing some trouble is in PTypeFamily.scala: object Avros extends PTypeFamily { override def ptf = AvroTypeFamily.getInstance() CAvros.REFLECT_DATA_FACTORY = new ScalaReflectDataFactory() def reflects[T: ClassManifest]() = CAvros.reflects(classManifest[T].erasure) } That CAvros.REFLECT_DATA_FACTORY line would need to be updated w/a ScalaReflectDataFactory that supported the new interface.
        Hide
        Micah Whitacre added a comment -

        Here's a patch that combines the two factories. Altered the method names to differentiate the reflect and the specific.

        Still need a bit more verification from some consumers that this handles injection of the reader in all needed cases.

        Show
        Micah Whitacre added a comment - Here's a patch that combines the two factories. Altered the method names to differentiate the reflect and the specific. Still need a bit more verification from some consumers that this handles injection of the reader in all needed cases.
        Hide
        Josh Wills added a comment -

        Looks basically right to me-- which cases are you worried about?

        Show
        Josh Wills added a comment - Looks basically right to me-- which cases are you worried about?
        Hide
        Micah Whitacre added a comment -

        The ones I don't know about mainly. I just want to verify there aren't cases that I overlooked. Like I didn't make the change to the AvroDeepCopier to use the factory mostly because I assumed it was used for already deserialized values where the schema needed to change and therefore a custom reader wasn't needed.

        Show
        Micah Whitacre added a comment - The ones I don't know about mainly. I just want to verify there aren't cases that I overlooked. Like I didn't make the change to the AvroDeepCopier to use the factory mostly because I assumed it was used for already deserialized values where the schema needed to change and therefore a custom reader wasn't needed.
        Hide
        Ryan Blue added a comment - - edited

        Micah, Josh just pointed me at this issue, which is well-timed. I just ran into the same problem recently and implemented a solution I was going to submit a patch for. My problem was that I wanted to override the avro generic classes rather than the specific. My implementation is very similar to yours, only I changed the reflect factory to a ReaderWriterFactory and added an AvroMode enum to handle each case (REFLECT, SPECIFIC, GENERIC). The enum approach brings all of the code together like your AvroDataFactory, but keeps the handling for each mode separate. Each mode can be individually overridden:

          AvroMode.SPECIFIC.override(specificFactory);
        

        It also cleans up some of the places that hard-code specific or reflect readers to use the right one based on the AvroType:

          AvroMode.fromType(atype).configure(bundle)
        

        This is what makes it possible for me to override the generics correctly. A lot of places simply used Reflect because it is the most general... but that causes problems if you need to change specific (e.g., use a different ClassLoader) or generic.

        Let me know what you think of the branch and whether it might work for your needs. Please ignore all of the unnecessary import changes, I still need to clean it up.

        Show
        Ryan Blue added a comment - - edited Micah, Josh just pointed me at this issue, which is well-timed. I just ran into the same problem recently and implemented a solution I was going to submit a patch for. My problem was that I wanted to override the avro generic classes rather than the specific. My implementation is very similar to yours, only I changed the reflect factory to a ReaderWriterFactory and added an AvroMode enum to handle each case (REFLECT, SPECIFIC, GENERIC). The enum approach brings all of the code together like your AvroDataFactory, but keeps the handling for each mode separate. Each mode can be individually overridden: AvroMode.SPECIFIC.override(specificFactory); It also cleans up some of the places that hard-code specific or reflect readers to use the right one based on the AvroType: AvroMode.fromType(atype).configure(bundle) This is what makes it possible for me to override the generics correctly. A lot of places simply used Reflect because it is the most general... but that causes problems if you need to change specific (e.g., use a different ClassLoader) or generic. Let me know what you think of the branch and whether it might work for your needs. Please ignore all of the unnecessary import changes, I still need to clean it up.
        Hide
        Ryan Blue added a comment -

        One more thing: I was careful to make my branch compatible with the current method to override the ReflectDataFactory, so that puts off the need to make simultaneous changes to scrunch. We can also go through a proper deprecation cycle.

        Show
        Ryan Blue added a comment - One more thing: I was careful to make my branch compatible with the current method to override the ReflectDataFactory, so that puts off the need to make simultaneous changes to scrunch. We can also go through a proper deprecation cycle.
        Hide
        Micah Whitacre added a comment -

        The changes look good to me. The one part I'm still trying to think through is if configuring the custom factory on a global scope vs per source is sufficient. Generally most teams will have a single custom implementation of a custom DataReader but there might be cases where they need to pull sources from multiple teams and need a custom reader.

        Show
        Micah Whitacre added a comment - The changes look good to me. The one part I'm still trying to think through is if configuring the custom factory on a global scope vs per source is sufficient. Generally most teams will have a single custom implementation of a custom DataReader but there might be cases where they need to pull sources from multiple teams and need a custom reader.
        Hide
        Josh Wills added a comment -

        Global seems easier to me from a serialization perspective. Can we get this one in soon-ish? I'd like to add a new built-in Source that reads PCollection<GenericRecord> by parsing the schema of an input file, and I'd like to do it against the new API here.

        Show
        Josh Wills added a comment - Global seems easier to me from a serialization perspective. Can we get this one in soon-ish? I'd like to add a new built-in Source that reads PCollection<GenericRecord> by parsing the schema of an input file, and I'd like to do it against the new API here.
        Hide
        Ryan Blue added a comment -

        I'll get a pull request cleaned up and submitted tomorrow. I'll go with the current global implementation. we can always update it later when the use case comes up.

        Show
        Ryan Blue added a comment - I'll get a pull request cleaned up and submitted tomorrow. I'll go with the current global implementation. we can always update it later when the use case comes up.
        Hide
        Micah Whitacre added a comment -

        I'm ok with global for now. Worst case scenario the teams who write their own custom readers need to start coordinating and working together.

        Show
        Micah Whitacre added a comment - I'm ok with global for now. Worst case scenario the teams who write their own custom readers need to start coordinating and working together.
        Hide
        Ryan Blue added a comment -

        I just submitted a pull request on github. If a patch is preferred, let me know and I'll post one here. The imports are cleaned up and the previous overrides for the ReflectDataFactory are marked as deprecated.

        Show
        Ryan Blue added a comment - I just submitted a pull request on github. If a patch is preferred, let me know and I'll post one here. The imports are cleaned up and the previous overrides for the ReflectDataFactory are marked as deprecated.
        Hide
        Micah Whitacre added a comment -

        +1 for a patch if you don't mind. The changes look good to me so I can get it merged later today.

        Show
        Micah Whitacre added a comment - +1 for a patch if you don't mind. The changes look good to me so I can get it merged later today.
        Hide
        Ryan Blue added a comment -

        Patch to add AvroMode

        Show
        Ryan Blue added a comment - Patch to add AvroMode
        Hide
        Micah Whitacre added a comment -

        Patch looks good. Pushed to master.

        Show
        Micah Whitacre added a comment - Patch looks good. Pushed to master.
        Hide
        Ryan Blue added a comment -

        Thanks!

        Show
        Ryan Blue added a comment - Thanks!

          People

          • Assignee:
            Micah Whitacre
            Reporter:
            Micah Whitacre
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development