Sqoop
  1. Sqoop
  2. SQOOP-384

Sqoop is incompatible with Hadoop prior to 0.21

    Details

      Description

      The following are the APIs Sqoop relies upon, which are not available in Hadoop prior to 0.21:
      org.apache.hadoop.conf.Configuration.getInstances
      org.apache.hadoop.mapreduce.lib.db.DBWritable
      org.apache.hadoop.mapreduce.lib.input.CombineFileSplit
      org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat
      org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader

        Issue Links

          Activity

          Hide
          Eric Wadsworth added a comment -

          I checked hadoop-0.20.204.0, and found some more information.

          The Configuration class lacks the getInstances method, which is what is used.

          The rest of these files exist, but are in different places:
          src/mapred/org/apache/hadoop/mapred/lib/db/DBWritable.java
          src/mapred/org/apache/hadoop/mapred/lib/CombineFileSplit.java
          src/mapred/org/apache/hadoop/mapred/lib/CombineFileInputFormat.java
          src/mapred/org/apache/hadoop/mapred/lib/CombineFileRecordReader.java

          I wonder if it would work just to point to the mapred version of them?

          — wad

          Show
          Eric Wadsworth added a comment - I checked hadoop-0.20.204.0, and found some more information. The Configuration class lacks the getInstances method, which is what is used. The rest of these files exist, but are in different places: src/mapred/org/apache/hadoop/mapred/lib/db/DBWritable.java src/mapred/org/apache/hadoop/mapred/lib/CombineFileSplit.java src/mapred/org/apache/hadoop/mapred/lib/CombineFileInputFormat.java src/mapred/org/apache/hadoop/mapred/lib/CombineFileRecordReader.java I wonder if it would work just to point to the mapred version of them? — wad
          Hide
          Jarek Jarcec Cecho added a comment -

          Hi Eric,
          unfortunately you can't use those files, because they are incompatible.

          Generally speaking files under org.apache.hadoop.mapreduce are using new context API whereas files under org.apache.hadoop.mapred are using old API. Both APIs are incompatible, so you can't mixed classes using both APIs.

          Jarcec

          Show
          Jarek Jarcec Cecho added a comment - Hi Eric, unfortunately you can't use those files, because they are incompatible. Generally speaking files under org.apache.hadoop.mapreduce are using new context API whereas files under org.apache.hadoop.mapred are using old API. Both APIs are incompatible, so you can't mixed classes using both APIs. Jarcec
          Hide
          Eric Wadsworth added a comment -

          Well, it was worth a shot. I'm going to look at the code, and see if it works to copy this functionality from a version of hadoop that has it, and stick it right into sqoop.

          — wad

          Show
          Eric Wadsworth added a comment - Well, it was worth a shot. I'm going to look at the code, and see if it works to copy this functionality from a version of hadoop that has it, and stick it right into sqoop. — wad
          Hide
          Jarek Jarcec Cecho added a comment -

          I would suggest you to download CDH3 tarball - Cloudera is attaching all patches that they've applied on top of official hadoop release. I'm not sure what had to be changed to port those classes to new API, but possibly you might end up with just one additional jar that will include new classes.

          Jarcec

          Show
          Jarek Jarcec Cecho added a comment - I would suggest you to download CDH3 tarball - Cloudera is attaching all patches that they've applied on top of official hadoop release. I'm not sure what had to be changed to port those classes to new API, but possibly you might end up with just one additional jar that will include new classes. Jarcec
          Hide
          Tom White added a comment -

          > I'm going to look at the code, and see if it works to copy this functionality from a version of hadoop that has it, and stick it right into sqoop.

          That should work. In fact, Mahout and HBase did something similar to use libraries for the new MapReduce API that were missing from 0.20.

          Another approach would be to backport the classes in the Hadoop 0.20 (now 1.0) branch.

          Show
          Tom White added a comment - > I'm going to look at the code, and see if it works to copy this functionality from a version of hadoop that has it, and stick it right into sqoop. That should work. In fact, Mahout and HBase did something similar to use libraries for the new MapReduce API that were missing from 0.20. Another approach would be to backport the classes in the Hadoop 0.20 (now 1.0) branch.
          Hide
          Eric Wadsworth added a comment -

          Okay. I'm working on this right now. I'll put a patch on this JIRA once I have one ready.
          — wad

          Show
          Eric Wadsworth added a comment - Okay. I'm working on this right now. I'll put a patch on this JIRA once I have one ready. — wad
          Hide
          Eric Wadsworth added a comment -

          So, before I go too much farther, I just submitted the patch above. Took a couple of tries before I got to this solution.

          Basically, I just made a wrapper around the Configuration class that comes from the (older) hadoop code. The wrapper has one extra method on it, that missing getInstances(). See the JavaDoc in ConfigurationHolder for details, but basically it tries to maintain backwards compatibility with the older APIs (in the com.cloudera packages), and be compatible with anything external, while still allowing use of that missing method.

          The tests all pass, but I don't have a very good environment to actually run this code (hence my work to make sqoop compatible with my cluster!), so it could use some exercise.

          Anyone want to take a look at this? Shoot arrows at it, maybe?

          — wad

          Show
          Eric Wadsworth added a comment - So, before I go too much farther, I just submitted the patch above. Took a couple of tries before I got to this solution. Basically, I just made a wrapper around the Configuration class that comes from the (older) hadoop code. The wrapper has one extra method on it, that missing getInstances(). See the JavaDoc in ConfigurationHolder for details, but basically it tries to maintain backwards compatibility with the older APIs (in the com.cloudera packages), and be compatible with anything external, while still allowing use of that missing method. The tests all pass, but I don't have a very good environment to actually run this code (hence my work to make sqoop compatible with my cluster!), so it could use some exercise. Anyone want to take a look at this? Shoot arrows at it, maybe? — wad
          Hide
          Jarek Jarcec Cecho added a comment -

          Hi Eric,
          thank you very much for your time and afford. Would you be so kind uploading your patch to review board (available at https://reviews.apache.org/) for review?

          Also we prefer to see the patches as file attachments instead of pasting them as a comments. There is an option "Attach File" in menu "More actions" above that you can use for that.

          Thank you very much,
          Jarcec

          Show
          Jarek Jarcec Cecho added a comment - Hi Eric, thank you very much for your time and afford. Would you be so kind uploading your patch to review board (available at https://reviews.apache.org/ ) for review? Also we prefer to see the patches as file attachments instead of pasting them as a comments. There is an option "Attach File" in menu "More actions" above that you can use for that. Thank you very much, Jarcec
          Hide
          Eric Wadsworth added a comment -

          This patch has the ConfigurationHolder code in it.

          Show
          Eric Wadsworth added a comment - This patch has the ConfigurationHolder code in it.
          Hide
          Eric Wadsworth added a comment -

          Jarek, thanks for the direction. This exists now:
          https://reviews.apache.org/r/3126/

          Show
          Eric Wadsworth added a comment - Jarek, thanks for the direction. This exists now: https://reviews.apache.org/r/3126/
          Hide
          Eric Wadsworth added a comment -

          For whoever is working on this, there is a binary incompatibility between hadoop 0.20 and hadoop 0.23, around the JobContext class. It was converted into an interface in 0.23. So, when you build sqoop, do it like this, so that the class files are generated against the older JobContext:
          ant -Dhadoopversion=20 jar-all

          I don't know how this will work against hadoop 0.23 however.

          Show
          Eric Wadsworth added a comment - For whoever is working on this, there is a binary incompatibility between hadoop 0.20 and hadoop 0.23, around the JobContext class. It was converted into an interface in 0.23. So, when you build sqoop, do it like this, so that the class files are generated against the older JobContext: ant -Dhadoopversion=20 jar-all I don't know how this will work against hadoop 0.23 however.
          Hide
          Eric Wadsworth added a comment -

          Also, I updated the reviews page with a new (and much smaller) patch.

          — wad

          Show
          Eric Wadsworth added a comment - Also, I updated the reviews page with a new (and much smaller) patch. — wad
          Hide
          Eric Wadsworth added a comment -

          The patch is attached to issue SQOOP-412.

          Show
          Eric Wadsworth added a comment - The patch is attached to issue SQOOP-412 .
          Hide
          Tapas added a comment -

          Is there anything else to be done in this patch to get it working with hadoop 20.205 or 1.0 ?

          Show
          Tapas added a comment - Is there anything else to be done in this patch to get it working with hadoop 20.205 or 1.0 ?
          Hide
          Jarek Jarcec Cecho added a comment -

          Hi Tapas,
          we already have positive feedback on this issue. It seems that current trunk is working on hadoop versions prior 0.21 (or CDH3). We would greatly appreciate your feedback if you have time and resources to try it as well.

          Jarcec

          Show
          Jarek Jarcec Cecho added a comment - Hi Tapas, we already have positive feedback on this issue. It seems that current trunk is working on hadoop versions prior 0.21 (or CDH3). We would greatly appreciate your feedback if you have time and resources to try it as well. Jarcec
          Hide
          Eric Wadsworth added a comment -

          Tapas,

          This patch fixed the problem for me.

          — wad

          Show
          Eric Wadsworth added a comment - Tapas, This patch fixed the problem for me. — wad
          Hide
          Matt Foley added a comment -

          Friends,
          With respect, this is really the wrong way to do this. There is already a bug open, MAPREDUCE-3607, to backport certain missing APIs in the 'mapreduce' family to Hadoop 1.0, and it is almost ready to commit. As RM for hadoop-1, I have already agreed to include it in the Hadoop 1.1 release, and if it helps do this the right way in sqoop, I will further commit to a 1.0.1 release in the near term.

          MAPREDUCE-3607 already includes all four of the mapreduce APIs named above, and lacks only the Configuration.getInstances API. I'll ask the contributor on MAPREDUCE-3607 to add it.
          --Matt

          Show
          Matt Foley added a comment - Friends, With respect, this is really the wrong way to do this. There is already a bug open, MAPREDUCE-3607 , to backport certain missing APIs in the 'mapreduce' family to Hadoop 1.0, and it is almost ready to commit. As RM for hadoop-1, I have already agreed to include it in the Hadoop 1.1 release, and if it helps do this the right way in sqoop, I will further commit to a 1.0.1 release in the near term. MAPREDUCE-3607 already includes all four of the mapreduce APIs named above, and lacks only the Configuration.getInstances API. I'll ask the contributor on MAPREDUCE-3607 to add it. --Matt
          Hide
          Roman Shaposhnik added a comment -

          @Matt,

          if you can make it happen with 1.0.1 that'll be extremely nice. Bigtop is preparing for its next big release of the entire stack based on Hadoop 1.0 (we are currently waiting for Hive 0.8.1 to come out since this is the only Hive compatible with Hadoop 1.0).

          If we can have 1.0.1 that includes this fix being a basis for the distribution that'll be extremely helpful.

          What timeline are you looking at wrt. 1.0.1 that might include the fix for Sqoop?

          Show
          Roman Shaposhnik added a comment - @Matt, if you can make it happen with 1.0.1 that'll be extremely nice. Bigtop is preparing for its next big release of the entire stack based on Hadoop 1.0 (we are currently waiting for Hive 0.8.1 to come out since this is the only Hive compatible with Hadoop 1.0). If we can have 1.0.1 that includes this fix being a basis for the distribution that'll be extremely helpful. What timeline are you looking at wrt. 1.0.1 that might include the fix for Sqoop?
          Hide
          Tom White added a comment -

          I think this may be my fault - I didn't open MAPREDUCE-3607 until after this had gone in (in fact, seeing this work in Sqoop prompted me to submit the MR fix). Perhaps we can keep this in Sqoop until there is a public Hadoop release with the new libraries in, at which point we can switch to using them in Sqoop. Does that sound reasonable?

          Show
          Tom White added a comment - I think this may be my fault - I didn't open MAPREDUCE-3607 until after this had gone in (in fact, seeing this work in Sqoop prompted me to submit the MR fix). Perhaps we can keep this in Sqoop until there is a public Hadoop release with the new libraries in, at which point we can switch to using them in Sqoop. Does that sound reasonable?
          Hide
          Matt Foley added a comment -

          If I could do a 1.0.1 RC in two weeks, would that be fast enough? That gives Tom time to add the last API and get the patch reviewed, and me time to manage the build.

          When is Hive 0.8.1 expected?

          I'm really allergic to putting Hadoop code into sqoop and then saying we'll remember to take it out later. Let's just do it right the first time. The projects in this ecosystem have to find a better way to accomodate each other than by swallowing obsolete chunks of each other BigTop makes a great forcing function for this, and we shouldn't waste it.

          Show
          Matt Foley added a comment - If I could do a 1.0.1 RC in two weeks, would that be fast enough? That gives Tom time to add the last API and get the patch reviewed, and me time to manage the build. When is Hive 0.8.1 expected? I'm really allergic to putting Hadoop code into sqoop and then saying we'll remember to take it out later. Let's just do it right the first time. The projects in this ecosystem have to find a better way to accomodate each other than by swallowing obsolete chunks of each other BigTop makes a great forcing function for this, and we shouldn't waste it.
          Hide
          Roman Shaposhnik added a comment -

          @Matt,

          first of all, thanks for pursuing this issue – that's really appreciated. I'm
          strongly +1 on putting the code that belongs to Hadoop where it belongs.

          To answer your question, if we can have an RC by end of Jan that should
          give Bigtop enough time. That is also when we expect Hive 0.8.1 to materialize
          (as much as anybody can plan OS software release, of course).

          Finally, just as with Hive 0.8.1, we would appreciate if the delta between
          1.0 and 1.0.1 is small. We've been running tests against 1.0 and it seems
          quite nice. Keeping the delta small will allow us to somewhat rely on those
          test results.

          Show
          Roman Shaposhnik added a comment - @Matt, first of all, thanks for pursuing this issue – that's really appreciated. I'm strongly +1 on putting the code that belongs to Hadoop where it belongs. To answer your question, if we can have an RC by end of Jan that should give Bigtop enough time. That is also when we expect Hive 0.8.1 to materialize (as much as anybody can plan OS software release, of course). Finally, just as with Hive 0.8.1, we would appreciate if the delta between 1.0 and 1.0.1 is small. We've been running tests against 1.0 and it seems quite nice. Keeping the delta small will allow us to somewhat rely on those test results.
          Hide
          Matt Foley added a comment -

          Fine then, we'll plan on that.

          Agree the delta to 1.0.1 should be small. I will publish the short list at the end of the week.

          @Tom, will you have time to add that last API to MAPREDUCE-3607 sometime in the next week? Thanks.

          Show
          Matt Foley added a comment - Fine then, we'll plan on that. Agree the delta to 1.0.1 should be small. I will publish the short list at the end of the week. @Tom, will you have time to add that last API to MAPREDUCE-3607 sometime in the next week? Thanks.
          Hide
          Tom White added a comment -

          Yes, I'll update the patch. Thanks Matt!

          Show
          Tom White added a comment - Yes, I'll update the patch. Thanks Matt!
          Hide
          Jarek Jarcec Cecho added a comment -

          Thank you very much for your comments guys. I really appreciate your feedback. I agree that copy&pasting code from core hadoop is weird and MAPREDUCE-3607 is far more better way to achieve this goal.

          However works on this JIRA were driven by community need to support also version 0.20.2 and 0.20.20X on already deployed clusters. I'm not expecting users to upgrade on 1.0(.1) immediately. So I would (as RM for 1.4.1) prefer to let this patch in for now and as Tom suggested remove it later when there will be no need from community to support those versions.

          Also we recently kicked off works on sqoop 2 which will be complete rewrite of the project, so those "backports" will be definitely removed in the future anyway.

          Does my reasoning make sense to you guys?

          Jarcec

          Show
          Jarek Jarcec Cecho added a comment - Thank you very much for your comments guys. I really appreciate your feedback. I agree that copy&pasting code from core hadoop is weird and MAPREDUCE-3607 is far more better way to achieve this goal. However works on this JIRA were driven by community need to support also version 0.20.2 and 0.20.20X on already deployed clusters. I'm not expecting users to upgrade on 1.0(.1) immediately. So I would (as RM for 1.4.1) prefer to let this patch in for now and as Tom suggested remove it later when there will be no need from community to support those versions. Also we recently kicked off works on sqoop 2 which will be complete rewrite of the project, so those "backports" will be definitely removed in the future anyway. Does my reasoning make sense to you guys? Jarcec
          Hide
          Tom White added a comment -

          I just tested the latest patch from MAPREDUCE-3607 (which includes Configuration.getInstances()) by reverting the changes from SQOOP-412 and SQOOP-413:

          svn merge -c -1221843 .
          svn merge -c -1221127 .
          

          Then I made the following change:

          Index: build.xml
          ===================================================================
          --- build.xml     (revision 1233602)
          +++ build.xml     (working copy)
          @@ -175,7 +175,7 @@
             <if>
               <equals arg1="${hadoopversion}" arg2="20" />
               <then>
          -      <property name="hadoop.version" value="0.20.2-cdh3u1" />
          +      <property name="hadoop.version" value="1.1.0-SNAPSHOT" />
                 <property name="hbase.version" value="0.90.3-cdh3u1" />
                 <property name="zookeeper.version" value="3.3.3-cdh3u1" />
               </then>
          

          And ran

          ant test -Dhadoopversion=20 -Dresolvers=internal
          

          All the tests passed.

          Show
          Tom White added a comment - I just tested the latest patch from MAPREDUCE-3607 (which includes Configuration.getInstances()) by reverting the changes from SQOOP-412 and SQOOP-413 : svn merge -c -1221843 . svn merge -c -1221127 . Then I made the following change: Index: build.xml =================================================================== --- build.xml (revision 1233602) +++ build.xml (working copy) @@ -175,7 +175,7 @@ <if> <equals arg1="${hadoopversion}" arg2="20" /> <then> - <property name="hadoop.version" value="0.20.2-cdh3u1" /> + <property name="hadoop.version" value="1.1.0-SNAPSHOT" /> <property name="hbase.version" value="0.90.3-cdh3u1" /> <property name="zookeeper.version" value="3.3.3-cdh3u1" /> </then> And ran ant test -Dhadoopversion=20 -Dresolvers=internal All the tests passed.
          Hide
          Jarek Jarcec Cecho added a comment -

          If there are no objections to my suggestion to keep both patches in current trunk, I would prefer to close this ticket as there are reports both from developers and users that sqoop is working on hadoop clusters prior 0.21. Any objections for me doing so?

          Jarcec

          Show
          Jarek Jarcec Cecho added a comment - If there are no objections to my suggestion to keep both patches in current trunk, I would prefer to close this ticket as there are reports both from developers and users that sqoop is working on hadoop clusters prior 0.21. Any objections for me doing so? Jarcec
          Hide
          Jarek Jarcec Cecho added a comment -

          There were no objections, so I'm closing this issue. Please feel free to reopen if needed.

          Jarcec

          Show
          Jarek Jarcec Cecho added a comment - There were no objections, so I'm closing this issue. Please feel free to reopen if needed. Jarcec

            People

            • Assignee:
              Jarek Jarcec Cecho
              Reporter:
              Roman Shaposhnik
            • Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development