Sqoop
  1. Sqoop
  2. SQOOP-721

Duplicating rows on export when exporting from compressed files.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.4.2
    • Fix Version/s: 1.4.3
    • Component/s: None
    • Labels:
      None

      Description

      It appears that in some situations export will duplicate rows. It seems that this behavior is happening when user is exporting compressed files that are "big enough".

      1. bugSQOOP-721.patch
        26 kB
        Jarek Jarcec Cecho
      2. bugSQOOP-721.patch
        26 kB
        Jarek Jarcec Cecho

        Issue Links

          Activity

          Hide
          Jarek Jarcec Cecho added a comment - - edited

          We're using CombineFileInputFormat implementation that was copied over from Hadoop namespace to Sqoop namespace to retain exactly the same behavior across all supported Hadoop platforms. It seems that this issue was already fixed upstream in MAPREDUCE-1597. I'll try to port new version to our code base.

          Show
          Jarek Jarcec Cecho added a comment - - edited We're using CombineFileInputFormat implementation that was copied over from Hadoop namespace to Sqoop namespace to retain exactly the same behavior across all supported Hadoop platforms. It seems that this issue was already fixed upstream in MAPREDUCE-1597 . I'll try to port new version to our code base.
          Hide
          Cheolsoo Park added a comment -

          +1.

          I diff'ed CombineFileInputFormat.java from Sqoop and Hadoop-2.0.x and confirmed that there is one change as follows:

          154c160,163
          <     return codec instanceof SplittableCompressionCodec;
          ---
          > 
          >     // Once we remove support for Hadoop < 2.0
          >     //return codec instanceof SplittableCompressionCodec;
          >     return false;
          

          As far as I understand, the only impact of this difference is that the compressed files won't be split even though they're splitable, which doesn't have any impact on correctness while it does on performance.

          I didn't run any tests with this patch, but given that the patch is identical to what's committed in MAPREDUCE-1597, I think that it is fine. Please let me know if anyone has any concerns.

          Thanks!

          Show
          Cheolsoo Park added a comment - +1. I diff'ed CombineFileInputFormat.java from Sqoop and Hadoop-2.0.x and confirmed that there is one change as follows: 154c160,163 < return codec instanceof SplittableCompressionCodec; --- > > // Once we remove support for Hadoop < 2.0 > // return codec instanceof SplittableCompressionCodec; > return false ; As far as I understand, the only impact of this difference is that the compressed files won't be split even though they're splitable, which doesn't have any impact on correctness while it does on performance. I didn't run any tests with this patch, but given that the patch is identical to what's committed in MAPREDUCE-1597 , I think that it is fine. Please let me know if anyone has any concerns. Thanks!
          Hide
          Cheolsoo Park added a comment -

          Committed to trunk. Thanks Jarcec!

          Show
          Cheolsoo Park added a comment - Committed to trunk. Thanks Jarcec!
          Hide
          Hudson added a comment -

          Integrated in Sqoop-ant-jdk-1.6-hadoop20 #339 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6-hadoop20/339/)
          SQOOP-721 Duplicating rows on export when exporting from compressed files. (Revision a840f41fc76688de6ab0d61e2cba3af32b2a9a96)

          Result = SUCCESS
          cheolsoo : https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=a840f41fc76688de6ab0d61e2cba3af32b2a9a96
          Files :

          • src/java/org/apache/sqoop/mapreduce/CombineFileSplit.java
          • src/java/org/apache/sqoop/mapreduce/CombineFileRecordReader.java
          • src/java/org/apache/sqoop/mapreduce/CombineFileInputFormat.java
          Show
          Hudson added a comment - Integrated in Sqoop-ant-jdk-1.6-hadoop20 #339 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6-hadoop20/339/ ) SQOOP-721 Duplicating rows on export when exporting from compressed files. (Revision a840f41fc76688de6ab0d61e2cba3af32b2a9a96) Result = SUCCESS cheolsoo : https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=a840f41fc76688de6ab0d61e2cba3af32b2a9a96 Files : src/java/org/apache/sqoop/mapreduce/CombineFileSplit.java src/java/org/apache/sqoop/mapreduce/CombineFileRecordReader.java src/java/org/apache/sqoop/mapreduce/CombineFileInputFormat.java
          Hide
          Hudson added a comment -

          Integrated in Sqoop-ant-jdk-1.6-hadoop23 #509 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6-hadoop23/509/)
          SQOOP-721 Duplicating rows on export when exporting from compressed files. (Revision a840f41fc76688de6ab0d61e2cba3af32b2a9a96)

          Result = SUCCESS
          cheolsoo : https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=a840f41fc76688de6ab0d61e2cba3af32b2a9a96
          Files :

          • src/java/org/apache/sqoop/mapreduce/CombineFileRecordReader.java
          • src/java/org/apache/sqoop/mapreduce/CombineFileSplit.java
          • src/java/org/apache/sqoop/mapreduce/CombineFileInputFormat.java
          Show
          Hudson added a comment - Integrated in Sqoop-ant-jdk-1.6-hadoop23 #509 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6-hadoop23/509/ ) SQOOP-721 Duplicating rows on export when exporting from compressed files. (Revision a840f41fc76688de6ab0d61e2cba3af32b2a9a96) Result = SUCCESS cheolsoo : https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=a840f41fc76688de6ab0d61e2cba3af32b2a9a96 Files : src/java/org/apache/sqoop/mapreduce/CombineFileRecordReader.java src/java/org/apache/sqoop/mapreduce/CombineFileSplit.java src/java/org/apache/sqoop/mapreduce/CombineFileInputFormat.java
          Hide
          Hudson added a comment -

          Integrated in Sqoop-ant-jdk-1.6-hadoop200 #342 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6-hadoop200/342/)
          SQOOP-721 Duplicating rows on export when exporting from compressed files. (Revision a840f41fc76688de6ab0d61e2cba3af32b2a9a96)

          Result = FAILURE
          cheolsoo : https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=a840f41fc76688de6ab0d61e2cba3af32b2a9a96
          Files :

          • src/java/org/apache/sqoop/mapreduce/CombineFileRecordReader.java
          • src/java/org/apache/sqoop/mapreduce/CombineFileInputFormat.java
          • src/java/org/apache/sqoop/mapreduce/CombineFileSplit.java
          Show
          Hudson added a comment - Integrated in Sqoop-ant-jdk-1.6-hadoop200 #342 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6-hadoop200/342/ ) SQOOP-721 Duplicating rows on export when exporting from compressed files. (Revision a840f41fc76688de6ab0d61e2cba3af32b2a9a96) Result = FAILURE cheolsoo : https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=a840f41fc76688de6ab0d61e2cba3af32b2a9a96 Files : src/java/org/apache/sqoop/mapreduce/CombineFileRecordReader.java src/java/org/apache/sqoop/mapreduce/CombineFileInputFormat.java src/java/org/apache/sqoop/mapreduce/CombineFileSplit.java
          Hide
          Hudson added a comment -

          Integrated in Sqoop-ant-jdk-1.6-hadoop100 #334 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6-hadoop100/334/)
          SQOOP-721 Duplicating rows on export when exporting from compressed files. (Revision a840f41fc76688de6ab0d61e2cba3af32b2a9a96)

          Result = SUCCESS
          cheolsoo : https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=a840f41fc76688de6ab0d61e2cba3af32b2a9a96
          Files :

          • src/java/org/apache/sqoop/mapreduce/CombineFileSplit.java
          • src/java/org/apache/sqoop/mapreduce/CombineFileInputFormat.java
          • src/java/org/apache/sqoop/mapreduce/CombineFileRecordReader.java
          Show
          Hudson added a comment - Integrated in Sqoop-ant-jdk-1.6-hadoop100 #334 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6-hadoop100/334/ ) SQOOP-721 Duplicating rows on export when exporting from compressed files. (Revision a840f41fc76688de6ab0d61e2cba3af32b2a9a96) Result = SUCCESS cheolsoo : https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=a840f41fc76688de6ab0d61e2cba3af32b2a9a96 Files : src/java/org/apache/sqoop/mapreduce/CombineFileSplit.java src/java/org/apache/sqoop/mapreduce/CombineFileInputFormat.java src/java/org/apache/sqoop/mapreduce/CombineFileRecordReader.java
          Hide
          Jarek Jarcec Cecho added a comment -

          Failure on profile 200 is expected and will be handled by SQOOP-731.

          Show
          Jarek Jarcec Cecho added a comment - Failure on profile 200 is expected and will be handled by SQOOP-731 .

            People

            • Assignee:
              Jarek Jarcec Cecho
              Reporter:
              Jarek Jarcec Cecho
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development