Details

    • Type: Improvement
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.7.1
    • Fix Version/s: None
    • Component/s: build
    • Labels:
      None

      Description

      Hadoop has depended upon Avro 1.7.4 for a couple of years now (see HADOOP-9672), but Apache Spark depends upon what is currently the latest version of Avro (1.7.7).

      This can cause issues if Spark is configured to include the full Hadoop classpath, as the classpath would then contain both Avro 1.7.4 and 1.7.7, with the 1.7.4 classes possibly winning depending on ordering. Here is an example of this issue: http://stackoverflow.com/questions/33159254/avro-error-on-aws-emr/33403111#33403111

      Would it be possible to upgrade Hadoop's Avro dependency to 1.7.7 now?

        Issue Links

          Activity

          Hide
          busbey Sean Busbey added a comment -

          sure, that works for me.

          Show
          busbey Sean Busbey added a comment - sure, that works for me.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Sean Busbey, SPARK-16617 is looking at this. How about you do a patch for 3.x and we can worry about the branch-2 implications?

          Show
          stevel@apache.org Steve Loughran added a comment - Sean Busbey , SPARK-16617 is looking at this. How about you do a patch for 3.x and we can worry about the branch-2 implications?
          Hide
          ajisakaa Akira Ajisaka added a comment -

          FWIW, I'd like a bump to 1.7.7 in Hadoop 2.8+ and 1.8.1 in Hadoop 3 alpha2.

          +1 for the idea.

          Show
          ajisakaa Akira Ajisaka added a comment - FWIW, I'd like a bump to 1.7.7 in Hadoop 2.8+ and 1.8.1 in Hadoop 3 alpha2. +1 for the idea.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          As an aside "Serializable" shouldn't be considered a good feature in distributed systems. as well as being brittle-unless-well-engineered, java serialization is now a common attack point.

          Show
          stevel@apache.org Steve Loughran added a comment - As an aside "Serializable" shouldn't be considered a good feature in distributed systems. as well as being brittle-unless-well-engineered, java serialization is now a common attack point.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          "jackson' is a word to bring fear into the upgrade path. It sounds like moving up to Avro 1.7.7 is OK for Hadoop 2.8. Avro 1.8 sounds more traumatic all round. Joda time has had problems with Java versions, if it gets pulled into more than just hadoop-aws then we should start by managing it explicitly

          Show
          stevel@apache.org Steve Loughran added a comment - "jackson' is a word to bring fear into the upgrade path. It sounds like moving up to Avro 1.7.7 is OK for Hadoop 2.8. Avro 1.8 sounds more traumatic all round. Joda time has had problems with Java versions, if it gets pulled into more than just hadoop-aws then we should start by managing it explicitly
          Hide
          chengas123 Ben McCann added a comment -

          That seems like a reasonable path forward to me

          Show
          chengas123 Ben McCann added a comment - That seems like a reasonable path forward to me
          Hide
          busbey Sean Busbey added a comment -

          FWIW, I'd like a bump to 1.7.7 in Hadoop 2.8+ and 1.8.1 in Hadoop 3 alpha2.

          Show
          busbey Sean Busbey added a comment - FWIW, I'd like a bump to 1.7.7 in Hadoop 2.8+ and 1.8.1 in Hadoop 3 alpha2.
          Hide
          chengas123 Ben McCann added a comment -

          Going from Avro 1.7.4 to Avro 1.7.6 or 1.7.7 bumps Jackson from 1.8.x to 1.9.x. This should be a no-op though because Hadoop is already using Jackson 1.9.x.

          Going to Avro 1.8.1 bumps paranamer from 2.3 to 2.7 and commons-compress from 1.4.1 to 1.8.1. It also adds dependencies on xz 1.5 and joda-time 2.7.

          Given that there's essentially no transitive dependency changes from bumping Avro to 1.7.7 and that bumping Avro is a low risk upgrade, could we upgrade to 1.7.7 at least?

          Show
          chengas123 Ben McCann added a comment - Going from Avro 1.7.4 to Avro 1.7.6 or 1.7.7 bumps Jackson from 1.8.x to 1.9.x. This should be a no-op though because Hadoop is already using Jackson 1.9.x. Going to Avro 1.8.1 bumps paranamer from 2.3 to 2.7 and commons-compress from 1.4.1 to 1.8.1. It also adds dependencies on xz 1.5 and joda-time 2.7. Given that there's essentially no transitive dependency changes from bumping Avro to 1.7.7 and that bumping Avro is a low risk upgrade, could we upgrade to 1.7.7 at least?
          Hide
          busbey Sean Busbey added a comment -

          This ends up being related to a similar request in HBase (and probably Spark), esp if we're talking about going to 1.8.z. (which I'd prefer).

          HBase being on 1.7.6 is probably an accident and not in any release. FWIW, I'll be trying to update HBase to match the version shipping with whatever version of Hadoop we default to when HBase 2.0.0 RCs come around (I hope that will be a Hadoop 3 of some stripe).

          As I mentioned on that HBase JIRA, the two relevant breaking changes in Avro 1.8 AFAICT are AVRO-1502 and AVRO-997. I believe I fixed Hive ages ago wrt AVRO-997. I believe HBase's (currently unreleased) Spark SQL-over-Avro-in-HBase code is currently incompatible with AVRO-997 (but should be fixable). I haven't examined any other Spark use.

          Show
          busbey Sean Busbey added a comment - This ends up being related to a similar request in HBase (and probably Spark), esp if we're talking about going to 1.8.z. (which I'd prefer). HBase being on 1.7.6 is probably an accident and not in any release. FWIW, I'll be trying to update HBase to match the version shipping with whatever version of Hadoop we default to when HBase 2.0.0 RCs come around (I hope that will be a Hadoop 3 of some stripe). As I mentioned on that HBase JIRA, the two relevant breaking changes in Avro 1.8 AFAICT are AVRO-1502 and AVRO-997 . I believe I fixed Hive ages ago wrt AVRO-997 . I believe HBase's (currently unreleased) Spark SQL-over-Avro-in-HBase code is currently incompatible with AVRO-997 (but should be fixable). I haven't examined any other Spark use.
          Hide
          chengas123 Ben McCann added a comment -

          Note that there are no incompatible changes introduced in Avro 1.7.5-1.7.7. See https://github.com/apache/avro/blob/master/CHANGES.txt

          Show
          chengas123 Ben McCann added a comment - Note that there are no incompatible changes introduced in Avro 1.7.5-1.7.7. See https://github.com/apache/avro/blob/master/CHANGES.txt
          Hide
          chengas123 Ben McCann added a comment -
          Show
          chengas123 Ben McCann added a comment - Hive is already using 1.7.7. See here: https://github.com/apache/hive/blob/master/pom.xml#L112 And it looks like HBase is using 1.7.6: https://github.com/apache/hbase/blob/master/hbase-spark/pom.xml#L45
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user benmccann commented on the issue:

          https://github.com/apache/hadoop/pull/39

          No, I don't think it's necessary. It just seems like it'd be good to do at some point

          Show
          githubbot ASF GitHub Bot added a comment - Github user benmccann commented on the issue: https://github.com/apache/hadoop/pull/39 No, I don't think it's necessary. It just seems like it'd be good to do at some point
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user ejono commented on the issue:

          https://github.com/apache/hadoop/pull/39

          @benmccann, is that actually necessary for this patch? I don't actually know. Also, that line you point to currently has an open-ended max version, so it's already going to use the latest, right?

          Show
          githubbot ASF GitHub Bot added a comment - Github user ejono commented on the issue: https://github.com/apache/hadoop/pull/39 @benmccann, is that actually necessary for this patch? I don't actually know. Also, that line you point to currently has an open-ended max version, so it's already going to use the latest, right?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user benmccann commented on the issue:

          https://github.com/apache/hadoop/pull/39

          Also upgrade avro-maven-plugin? https://github.com/apache/hadoop/blob/trunk/pom.xml#L225

          Show
          githubbot ASF GitHub Bot added a comment - Github user benmccann commented on the issue: https://github.com/apache/hadoop/pull/39 Also upgrade avro-maven-plugin? https://github.com/apache/hadoop/blob/trunk/pom.xml#L225
          Hide
          chengas123 Ben McCann added a comment - - edited

          +1 (although I'd really like to see Avro upgraded to 1.8.x)

          Show
          chengas123 Ben McCann added a comment - - edited +1 (although I'd really like to see Avro upgraded to 1.8.x)
          Hide
          srowen Sean Owen added a comment -

          I mean, I suspect it's in everyone's interest to update the dependency. I'd also hope there are in fact no significant changes in a maintenance release, and I don't see anyone signaling that there are problems. So, yes seems like the thing to do if there are no reasons to believe it would be incompatible and no evidence that it is.

          Show
          srowen Sean Owen added a comment - I mean, I suspect it's in everyone's interest to update the dependency. I'd also hope there are in fact no significant changes in a maintenance release, and I don't see anyone signaling that there are problems. So, yes seems like the thing to do if there are no reasons to believe it would be incompatible and no evidence that it is.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          What does Sean Owen think here?

          Show
          stevel@apache.org Steve Loughran added a comment - What does Sean Owen think here?
          Hide
          jonathak Jonathan Kelly added a comment -

          Yes, definitely, but I just have not yet taken the time to test it more thoroughly. I just wanted to cut this JIRA as soon as possible in order to start the discussion.

          Show
          jonathak Jonathan Kelly added a comment - Yes, definitely, but I just have not yet taken the time to test it more thoroughly. I just wanted to cut this JIRA as soon as possible in order to start the discussion.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          jon, don't you have your own test suites here to play with? If there's anything that can be done with downstream tests, going beyond the core Hadoop code, that's what's really critical to finding bugs.

          Specific things I'd worry about here are HBase and Hive

          Show
          stevel@apache.org Steve Loughran added a comment - jon, don't you have your own test suites here to play with? If there's anything that can be done with downstream tests, going beyond the core Hadoop code, that's what's really critical to finding bugs. Specific things I'd worry about here are HBase and Hive
          Hide
          jonathak Jonathan Kelly added a comment -

          Yes, I did try building Hadoop (2.6.0) with this change, and I verified that Hadoop still seems to work in general, but I did not actually try using Avro directly in a Hadoop job. I also have not yet run the Hadoop project tests.

          Show
          jonathak Jonathan Kelly added a comment - Yes, I did try building Hadoop (2.6.0) with this change, and I verified that Hadoop still seems to work in general, but I did not actually try using Avro directly in a Hadoop job. I also have not yet run the Hadoop project tests.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Yes, please, let's document / ensure its compatibility before we make the jump.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Yes, please, let's document / ensure its compatibility before we make the jump.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          We aren't aware (currently) of any major backwards compatibility problems with avro versions, so this isn't an immediate veto request, the way changes to guava are.

          have you tried swapping in avro 1.7.7 into your AWS hadoop build? What happens?

          Show
          stevel@apache.org Steve Loughran added a comment - We aren't aware (currently) of any major backwards compatibility problems with avro versions, so this isn't an immediate veto request, the way changes to guava are. have you tried swapping in avro 1.7.7 into your AWS hadoop build? What happens?

            People

            • Assignee:
              busbey Sean Busbey
              Reporter:
              jonathak Jonathan Kelly
            • Votes:
              2 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:

                Development