Sqoop
  1. Sqoop
  2. SQOOP-603

Support small intervals in IntegerSplitter implementation

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4.2
    • Fix Version/s: 1.4.3
    • Component/s: None
    • Labels:
      None

      Description

      IntegerSplitter is currently creating splits of following nature:

      minimal value <= x < splitPoint1
      splitPoint1 <= x < splitPoint2
      ...
      splitPointN <= x <= maximal value
      

      Please notice that upper bound is always with using condition "<" with exception of the last split that is using condition "<=". This is perfectly fine when creating reasonable amount of splits on very huge interval.

      This approach will however cause issues on very small intervals. For example following splits will be created on interval [0, 5] with 5 splits:

      • 0 <= x < 1
      • 1 <= x < 2
      • 2 <= x < 3
      • 3 <= x < 4
      • 4 <= x <= 5

      Notice that all splits have equal count of numbers except the last one having two numbers - 4 and 5. This becomes very huge issue when for example user needs to create one split per one partition as one mapper will end up with moving two partitions and thus entire job will take twice as long as the other ones.

      Jarcec

      1. SQOOP-603.patch
        3 kB
        Jarek Jarcec Cecho

        Issue Links

          Activity

          Jarek Jarcec Cecho created issue -
          Jarek Jarcec Cecho made changes -
          Field Original Value New Value
          Remote Link This issue links to "Review board (Web Link)" [ 11149 ]
          Jarek Jarcec Cecho made changes -
          Attachment SQOOP-603.patch [ 12545921 ]
          Jarek Jarcec Cecho made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Cheolsoo Park added a comment -

          +1. I will commit after running tests.

          Show
          Cheolsoo Park added a comment - +1. I will commit after running tests.
          Hide
          Cheolsoo Park added a comment - - edited

          Committed to trunk. Thanks Jarcec!

          Note that while running unit tests, I discovered that the 3rd party LobAvro tests fail with -Dhadoopversion=20. However, that's unrelated to this change, and I am going to open a jira for them.

          @Jarcec,
          Can you please do git diff --no-prefix next time? In fact, you might want to define an alias in .bashrc such as gitdiff="git diff --no-prefix".

          Thanks!

          Show
          Cheolsoo Park added a comment - - edited Committed to trunk. Thanks Jarcec! Note that while running unit tests, I discovered that the 3rd party LobAvro tests fail with -Dhadoopversion=20. However, that's unrelated to this change, and I am going to open a jira for them. @Jarcec, Can you please do git diff --no-prefix next time? In fact, you might want to define an alias in .bashrc such as gitdiff="git diff --no-prefix". Thanks!
          Cheolsoo Park made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Jarek Jarcec Cecho added a comment -

          What's the issue you're experiencing with prefixed patch Cheolsoo?

          Jarcec

          Show
          Jarek Jarcec Cecho added a comment - What's the issue you're experiencing with prefixed patch Cheolsoo? Jarcec
          Hide
          Cheolsoo Park added a comment -

          I typically do "patch -0 -i <patch name>" to apply patches. W/o --no-prefix, it complains because of the prefixes such as "a" and "b":

          --- a/src/java/org/apache/sqoop/mapreduce/db/DataDrivenDBInputFormat.java
          +++ b/src/java/org/apache/sqoop/mapreduce/db/DataDrivenDBInputFormat.java
          

          Is there a way to apply prefixed patches nicely? What I do is to run "%s/ a\// /g" and "%s/ b\// /g" in vim before applying patches.

          Thanks!

          Show
          Cheolsoo Park added a comment - I typically do "patch -0 -i <patch name>" to apply patches. W/o --no-prefix, it complains because of the prefixes such as "a" and "b": --- a/src/java/org/apache/sqoop/mapreduce/db/DataDrivenDBInputFormat.java +++ b/src/java/org/apache/sqoop/mapreduce/db/DataDrivenDBInputFormat.java Is there a way to apply prefixed patches nicely? What I do is to run "%s/ a\// /g" and "%s/ b\// /g" in vim before applying patches. Thanks!
          Hide
          Jarek Jarcec Cecho added a comment -

          Och now I see your issue. Applying prefixed patches is actually quite simple. You just need to add "-p1" to the patch command line.

          The "p" stands for prefix and it means how many prefixed directories should be stripped from start. Obviously "-p0" means "nothing should be removed" and it's the default. Specifying "-p1" for git prefix diffs will remove this "a/" and "b/" prefixes.

          Jarcec

          Show
          Jarek Jarcec Cecho added a comment - Och now I see your issue. Applying prefixed patches is actually quite simple. You just need to add "-p1" to the patch command line. The "p" stands for prefix and it means how many prefixed directories should be stripped from start. Obviously "-p0" means "nothing should be removed" and it's the default. Specifying "-p1" for git prefix diffs will remove this "a/" and "b/" prefixes. Jarcec
          Hide
          Cheolsoo Park added a comment -

          Ah, thanks for the explanation. I will keep that in mind next time when I see prefixed patches.

          Show
          Cheolsoo Park added a comment - Ah, thanks for the explanation. I will keep that in mind next time when I see prefixed patches.
          Hide
          Hudson added a comment -

          Integrated in Sqoop-ant-jdk-1.6-hadoop100 #78 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6-hadoop100/78/)
          SQOOP-603 Support small intervals in IntegerSplitter implementation (Revision 5616152ac4c96d6c0589768b982cf67f3277df74)

          Result = SUCCESS
          cheolsoo : https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=5616152ac4c96d6c0589768b982cf67f3277df74
          Files :

          • src/test/org/apache/sqoop/mapreduce/db/TestIntegerSplitter.java
          • src/java/org/apache/sqoop/mapreduce/db/IntegerSplitter.java
          • src/java/org/apache/sqoop/mapreduce/db/DataDrivenDBInputFormat.java
          Show
          Hudson added a comment - Integrated in Sqoop-ant-jdk-1.6-hadoop100 #78 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6-hadoop100/78/ ) SQOOP-603 Support small intervals in IntegerSplitter implementation (Revision 5616152ac4c96d6c0589768b982cf67f3277df74) Result = SUCCESS cheolsoo : https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=5616152ac4c96d6c0589768b982cf67f3277df74 Files : src/test/org/apache/sqoop/mapreduce/db/TestIntegerSplitter.java src/java/org/apache/sqoop/mapreduce/db/IntegerSplitter.java src/java/org/apache/sqoop/mapreduce/db/DataDrivenDBInputFormat.java
          Hide
          Hudson added a comment -

          Integrated in Sqoop-ant-jdk-1.6-hadoop23 #208 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6-hadoop23/208/)
          SQOOP-603 Support small intervals in IntegerSplitter implementation (Revision 5616152ac4c96d6c0589768b982cf67f3277df74)

          Result = SUCCESS
          cheolsoo : https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=5616152ac4c96d6c0589768b982cf67f3277df74
          Files :

          • src/java/org/apache/sqoop/mapreduce/db/IntegerSplitter.java
          • src/test/org/apache/sqoop/mapreduce/db/TestIntegerSplitter.java
          • src/java/org/apache/sqoop/mapreduce/db/DataDrivenDBInputFormat.java
          Show
          Hudson added a comment - Integrated in Sqoop-ant-jdk-1.6-hadoop23 #208 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6-hadoop23/208/ ) SQOOP-603 Support small intervals in IntegerSplitter implementation (Revision 5616152ac4c96d6c0589768b982cf67f3277df74) Result = SUCCESS cheolsoo : https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=5616152ac4c96d6c0589768b982cf67f3277df74 Files : src/java/org/apache/sqoop/mapreduce/db/IntegerSplitter.java src/test/org/apache/sqoop/mapreduce/db/TestIntegerSplitter.java src/java/org/apache/sqoop/mapreduce/db/DataDrivenDBInputFormat.java
          Hide
          Hudson added a comment -

          Integrated in Sqoop-ant-jdk-1.6-hadoop200 #56 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6-hadoop200/56/)
          SQOOP-603 Support small intervals in IntegerSplitter implementation (Revision 5616152ac4c96d6c0589768b982cf67f3277df74)

          Result = SUCCESS
          cheolsoo : https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=5616152ac4c96d6c0589768b982cf67f3277df74
          Files :

          • src/java/org/apache/sqoop/mapreduce/db/IntegerSplitter.java
          • src/test/org/apache/sqoop/mapreduce/db/TestIntegerSplitter.java
          • src/java/org/apache/sqoop/mapreduce/db/DataDrivenDBInputFormat.java
          Show
          Hudson added a comment - Integrated in Sqoop-ant-jdk-1.6-hadoop200 #56 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6-hadoop200/56/ ) SQOOP-603 Support small intervals in IntegerSplitter implementation (Revision 5616152ac4c96d6c0589768b982cf67f3277df74) Result = SUCCESS cheolsoo : https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=5616152ac4c96d6c0589768b982cf67f3277df74 Files : src/java/org/apache/sqoop/mapreduce/db/IntegerSplitter.java src/test/org/apache/sqoop/mapreduce/db/TestIntegerSplitter.java src/java/org/apache/sqoop/mapreduce/db/DataDrivenDBInputFormat.java
          Hide
          Hudson added a comment -

          Integrated in Sqoop-ant-jdk-1.6-hadoop20 #74 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6-hadoop20/74/)
          SQOOP-603 Support small intervals in IntegerSplitter implementation (Revision 5616152ac4c96d6c0589768b982cf67f3277df74)

          Result = SUCCESS
          cheolsoo : https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=5616152ac4c96d6c0589768b982cf67f3277df74
          Files :

          • src/java/org/apache/sqoop/mapreduce/db/IntegerSplitter.java
          • src/test/org/apache/sqoop/mapreduce/db/TestIntegerSplitter.java
          • src/java/org/apache/sqoop/mapreduce/db/DataDrivenDBInputFormat.java
          Show
          Hudson added a comment - Integrated in Sqoop-ant-jdk-1.6-hadoop20 #74 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6-hadoop20/74/ ) SQOOP-603 Support small intervals in IntegerSplitter implementation (Revision 5616152ac4c96d6c0589768b982cf67f3277df74) Result = SUCCESS cheolsoo : https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=5616152ac4c96d6c0589768b982cf67f3277df74 Files : src/java/org/apache/sqoop/mapreduce/db/IntegerSplitter.java src/test/org/apache/sqoop/mapreduce/db/TestIntegerSplitter.java src/java/org/apache/sqoop/mapreduce/db/DataDrivenDBInputFormat.java
          Hide
          Oren Benjamin added a comment -

          Why not use open intervals for all the splits instead of special-casing this? Incrementing the max value and using only open intervals would result in removal of the existing special case for the closed last interval rather than the addition of another special case.

          Show
          Oren Benjamin added a comment - Why not use open intervals for all the splits instead of special-casing this? Incrementing the max value and using only open intervals would result in removal of the existing special case for the closed last interval rather than the addition of another special case.
          Hide
          Jarek Jarcec Cecho added a comment -

          Hi Oren,
          I'm afraid that your suggestion might be quite tricky in corner cases. For example what if the value is the maximal value of given type? I believe that we can't simply increment the value without additional checks.

          Jarcec

          Show
          Jarek Jarcec Cecho added a comment - Hi Oren, I'm afraid that your suggestion might be quite tricky in corner cases. For example what if the value is the maximal value of given type? I believe that we can't simply increment the value without additional checks. Jarcec

            People

            • Assignee:
              Jarek Jarcec Cecho
              Reporter:
              Jarek Jarcec Cecho
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development