Crunch
  1. Crunch
  2. CRUNCH-90

Object reuse is not accounted for in mapper fusion

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.4.0
    • Component/s: None
    • Labels:
      None

      Description

      When multiple DoFns are run over the same output (i.e. in the case of mapper fusion), the same value object is passed to multiple underlying DoFns. If the state of that value object is changed by one DoFn, other DoFns are called with the updated object.

      This is a situation that can happen quite easily when the input of a DoFn is simply updated and then emitted. In general, this bug will only affect values whose type is the same as the underlying serialization type.

      1. CRUNCH-90.patch
        12 kB
        Gabriel Reid
      2. CRUNCH-90-reflect.patch
        74 kB
        Gabriel Reid
      3. CRUNCH-90-reflect.patch
        48 kB
        Josh Wills

        Activity

        Hide
        Matthias Friedrich added a comment -

        Closing all resolved issues for 0.4.0.

        Show
        Matthias Friedrich added a comment - Closing all resolved issues for 0.4.0.
        Hide
        Gabriel Reid added a comment -

        Committed updated patch

        Show
        Gabriel Reid added a comment - Committed updated patch
        Hide
        Josh Wills added a comment -

        Most definitely. +1.

        Show
        Josh Wills added a comment - Most definitely. +1.
        Hide
        Gabriel Reid added a comment -

        Thanks for figuring that out Josh! And sorry for getting you to do all my monkey work for me, I feel pretty guilty about that now (but it's definitely extra motivation for me to get my act together in terms of figuring Scala out).

        I reworked the patch a bit to pass the configuration in via the PType#initialize method instead of PType#getDetachedValue. My dream (or at least my intention) is to have the PType get initialized and automatically made available within DoFns. Sound good to you?

        Show
        Gabriel Reid added a comment - Thanks for figuring that out Josh! And sorry for getting you to do all my monkey work for me, I feel pretty guilty about that now (but it's definitely extra motivation for me to get my act together in terms of figuring Scala out). I reworked the patch a bit to pass the configuration in via the PType#initialize method instead of PType#getDetachedValue. My dream (or at least my intention) is to have the PType get initialized and automatically made available within DoFns. Sound good to you?
        Hide
        Josh Wills added a comment -

        Attached a patch that fixes the PageRankClassTest. In order to do it, I had to besmirch the really elegant DeepCopier infrastructure (like, seriously elegant-- I haven't looked at it that closely before and it was a joy to read) by passing Configuration objects all over the place so that the Avro deepCopy's will use the correct instance of ReflectData-- the Scrunch stuff has to use a different version than the Java stuff.

        Show
        Josh Wills added a comment - Attached a patch that fixes the PageRankClassTest. In order to do it, I had to besmirch the really elegant DeepCopier infrastructure (like, seriously elegant-- I haven't looked at it that closely before and it was a joy to read) by passing Configuration objects all over the place so that the Avro deepCopy's will use the correct instance of ReflectData-- the Scrunch stuff has to use a different version than the Java stuff.
        Hide
        Josh Wills added a comment -

        Will take a look this evening and see if I can come up with a fix.

        Show
        Josh Wills added a comment - Will take a look this evening and see if I can come up with a fix.
        Hide
        Gabriel Reid added a comment -

        I've just discovered that this patch breaks the PageRankClassTest in scrunch. Unfortunately, I only ran the crunch integration tests (instead of the full suite) before committing it. The exception being thrown is as follows:

        java.lang.IllegalArgumentException: Can not set final [Ljava.lang.String; field org.apache.crunch.scrunch.PageRankData.urls to org.apache.avro.generic.GenericData$Array

        It appears that the deep copying is running into an issue with reading in a PageRankData object using reflection-based serialization. I'm not sure why this is only coming out now, as I would have thought that existing serialization logic would have caused an issue with it without even using deep copying. This appears to be a bug in Avro (similar to AVRO-1046).

        Josh Wills I'm totally clueless when it comes to Scala – any chance you could take a look and possibly try changing the type of the urls field to something that Avro can deal with, or give me any pointers on what you think might be going on here? I definitely want to report this to the Avro JIRA as well if it is indeed an Avro bug, but it would be good to work around it for now.

        Show
        Gabriel Reid added a comment - I've just discovered that this patch breaks the PageRankClassTest in scrunch. Unfortunately, I only ran the crunch integration tests (instead of the full suite) before committing it. The exception being thrown is as follows: java.lang.IllegalArgumentException: Can not set final [Ljava.lang.String; field org.apache.crunch.scrunch.PageRankData.urls to org.apache.avro.generic.GenericData$Array It appears that the deep copying is running into an issue with reading in a PageRankData object using reflection-based serialization. I'm not sure why this is only coming out now, as I would have thought that existing serialization logic would have caused an issue with it without even using deep copying. This appears to be a bug in Avro (similar to AVRO-1046 ). Josh Wills I'm totally clueless when it comes to Scala – any chance you could take a look and possibly try changing the type of the urls field to something that Avro can deal with, or give me any pointers on what you think might be going on here? I definitely want to report this to the Avro JIRA as well if it is indeed an Avro bug, but it would be good to work around it for now.
        Hide
        Gabriel Reid added a comment -

        Committed

        Show
        Gabriel Reid added a comment - Committed
        Hide
        Josh Wills added a comment -

        +1.

        Show
        Josh Wills added a comment - +1.
        Hide
        Gabriel Reid added a comment -

        Patch to fix the issue, as well as tests that demonstrate it. If there are no objections, I'll commit this shortly.

        Show
        Gabriel Reid added a comment - Patch to fix the issue, as well as tests that demonstrate it. If there are no objections, I'll commit this shortly.

          People

          • Assignee:
            Gabriel Reid
            Reporter:
            Gabriel Reid
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development