Uploaded image for project: 'Tajo'
  1. Tajo
  2. TAJO-1107

Broadcast join on non-leaf node scans only first data file.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.9.0
    • Component/s: None
    • Labels:
      None

      Description

      Fragments are determined by the Repartitioner. In the case of broadcast join Repartitioner uses different determine login according to node type(lead, non-leaf).
      Currently Tajo only uses first data file when broadcast join is non-leaf node. This is the cause of incorrect results.

        Activity

        Hide
        githubbot ASF GitHub Bot added a comment -

        GitHub user babokim opened a pull request:

        https://github.com/apache/tajo/pull/193

        TAJO-1107: Broadcast join on non-leaf node scans only first data file.

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/babokim/tajo TAJO-1107

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/tajo/pull/193.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #193


        commit 04d7b8537a69752fb62b9c2d83cade77124c8a9d
        Author: HyoungJun Kim <babokim@babokim-macbook-pro.local>
        Date: 2014-10-08T08:47:00Z

        TAJO-1107: Broadcast join on non-leaf node scans only first data file.

        commit 4ccc1bdab28acb9796035b7723f4bc53ff732e48
        Author: HyoungJun Kim <babokim@babokim-macbook-pro.local>
        Date: 2014-10-08T08:47:47Z

        Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo

        commit 8cc4aa3248d6a6e81c328cc53844ae1eb7290cf5
        Author: HyoungJun Kim <babokim@babokim-macbook-pro.local>
        Date: 2014-10-08T14:24:42Z

        Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into TAJO-1107


        Show
        githubbot ASF GitHub Bot added a comment - GitHub user babokim opened a pull request: https://github.com/apache/tajo/pull/193 TAJO-1107 : Broadcast join on non-leaf node scans only first data file. You can merge this pull request into a Git repository by running: $ git pull https://github.com/babokim/tajo TAJO-1107 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tajo/pull/193.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #193 commit 04d7b8537a69752fb62b9c2d83cade77124c8a9d Author: HyoungJun Kim <babokim@babokim-macbook-pro.local> Date: 2014-10-08T08:47:00Z TAJO-1107 : Broadcast join on non-leaf node scans only first data file. commit 4ccc1bdab28acb9796035b7723f4bc53ff732e48 Author: HyoungJun Kim <babokim@babokim-macbook-pro.local> Date: 2014-10-08T08:47:47Z Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo commit 8cc4aa3248d6a6e81c328cc53844ae1eb7290cf5 Author: HyoungJun Kim <babokim@babokim-macbook-pro.local> Date: 2014-10-08T14:24:42Z Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into TAJO-1107
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user hyunsik commented on a diff in the pull request:

        https://github.com/apache/tajo/pull/193#discussion_r18605091

        — Diff: tajo-core/src/main/java/org/apache/tajo/master/querymaster/Repartitioner.java —
        @@ -388,12 +392,35 @@ private static void scheduleSymmetricRepartitionJoin(QueryMasterTask.QueryMaster
        int joinTaskNum = Math.min(maxTaskNum, hashEntries.size());
        LOG.info("The determined number of join tasks is " + joinTaskNum);

        • FileFragment[] rightFragments = new FileFragment[1 + (broadcastFragments == null ? 0 : broadcastFragments.length)];
        • rightFragments[0] = fragments[1];
          + List<FileFragment> rightFragments = new ArrayList<FileFragment>();
          + rightFragments.add(fragments[1]);
          +
          if (broadcastFragments != null) {
        • System.arraycopy(broadcastFragments, 0, rightFragments, 1, broadcastFragments.length);
          + //In this phase a ScanNode has a single fragment.
          + //If there are more than one data files, that files should be added to fragments or partition path
          + AbstractStorageManager storageManager = subQuery.getStorageManager();
          + int index = 0;
          + for (FileFragment eachFragment: broadcastFragments) {
            • End diff –

        We already ensure that each broadcastFragments will must have its corresponding element in broadcastScan array. Nevertheless, what ```eachFragment``` variable is not used in the for-loop block seems to not be intuitive. Could you replace it by the loop using broadcastScans?

        Show
        githubbot ASF GitHub Bot added a comment - Github user hyunsik commented on a diff in the pull request: https://github.com/apache/tajo/pull/193#discussion_r18605091 — Diff: tajo-core/src/main/java/org/apache/tajo/master/querymaster/Repartitioner.java — @@ -388,12 +392,35 @@ private static void scheduleSymmetricRepartitionJoin(QueryMasterTask.QueryMaster int joinTaskNum = Math.min(maxTaskNum, hashEntries.size()); LOG.info("The determined number of join tasks is " + joinTaskNum); FileFragment[] rightFragments = new FileFragment [1 + (broadcastFragments == null ? 0 : broadcastFragments.length)] ; rightFragments [0] = fragments [1] ; + List<FileFragment> rightFragments = new ArrayList<FileFragment>(); + rightFragments.add(fragments [1] ); + if (broadcastFragments != null) { System.arraycopy(broadcastFragments, 0, rightFragments, 1, broadcastFragments.length); + //In this phase a ScanNode has a single fragment. + //If there are more than one data files, that files should be added to fragments or partition path + AbstractStorageManager storageManager = subQuery.getStorageManager(); + int index = 0; + for (FileFragment eachFragment: broadcastFragments) { End diff – We already ensure that each broadcastFragments will must have its corresponding element in broadcastScan array. Nevertheless, what ```eachFragment``` variable is not used in the for-loop block seems to not be intuitive. Could you replace it by the loop using broadcastScans?
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user hyunsik commented on the pull request:

        https://github.com/apache/tajo/pull/193#issuecomment-58410665

        +1
        The patch looks good to me. The patch exactly fixes the bug and includes enough unit tests for all cases. I leave one trivial comment. Before committing it, please reflect the comment if you accept.

        Thanks,
        Hyunsik

        Show
        githubbot ASF GitHub Bot added a comment - Github user hyunsik commented on the pull request: https://github.com/apache/tajo/pull/193#issuecomment-58410665 +1 The patch looks good to me. The patch exactly fixes the bug and includes enough unit tests for all cases. I leave one trivial comment. Before committing it, please reflect the comment if you accept. Thanks, Hyunsik
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user babokim commented on a diff in the pull request:

        https://github.com/apache/tajo/pull/193#discussion_r18624541

        — Diff: tajo-core/src/main/java/org/apache/tajo/master/querymaster/Repartitioner.java —
        @@ -388,12 +392,35 @@ private static void scheduleSymmetricRepartitionJoin(QueryMasterTask.QueryMaster
        int joinTaskNum = Math.min(maxTaskNum, hashEntries.size());
        LOG.info("The determined number of join tasks is " + joinTaskNum);

        • FileFragment[] rightFragments = new FileFragment[1 + (broadcastFragments == null ? 0 : broadcastFragments.length)];
        • rightFragments[0] = fragments[1];
          + List<FileFragment> rightFragments = new ArrayList<FileFragment>();
          + rightFragments.add(fragments[1]);
          +
          if (broadcastFragments != null) {
        • System.arraycopy(broadcastFragments, 0, rightFragments, 1, broadcastFragments.length);
          + //In this phase a ScanNode has a single fragment.
          + //If there are more than one data files, that files should be added to fragments or partition path
          + AbstractStorageManager storageManager = subQuery.getStorageManager();
          + int index = 0;
          + for (FileFragment eachFragment: broadcastFragments) {
            • End diff –

        Ok, I'll replace broadcastFragments with broadcastScans before committing.

        Show
        githubbot ASF GitHub Bot added a comment - Github user babokim commented on a diff in the pull request: https://github.com/apache/tajo/pull/193#discussion_r18624541 — Diff: tajo-core/src/main/java/org/apache/tajo/master/querymaster/Repartitioner.java — @@ -388,12 +392,35 @@ private static void scheduleSymmetricRepartitionJoin(QueryMasterTask.QueryMaster int joinTaskNum = Math.min(maxTaskNum, hashEntries.size()); LOG.info("The determined number of join tasks is " + joinTaskNum); FileFragment[] rightFragments = new FileFragment [1 + (broadcastFragments == null ? 0 : broadcastFragments.length)] ; rightFragments [0] = fragments [1] ; + List<FileFragment> rightFragments = new ArrayList<FileFragment>(); + rightFragments.add(fragments [1] ); + if (broadcastFragments != null) { System.arraycopy(broadcastFragments, 0, rightFragments, 1, broadcastFragments.length); + //In this phase a ScanNode has a single fragment. + //If there are more than one data files, that files should be added to fragments or partition path + AbstractStorageManager storageManager = subQuery.getStorageManager(); + int index = 0; + for (FileFragment eachFragment: broadcastFragments) { End diff – Ok, I'll replace broadcastFragments with broadcastScans before committing.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user asfgit closed the pull request at:

        https://github.com/apache/tajo/pull/193

        Show
        githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tajo/pull/193
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Tajo-master-build #406 (See https://builds.apache.org/job/Tajo-master-build/406/)
        TAJO-1107: Broadcast join on non-leaf node scans only first data file. (babokim: rev 5f64cbb1a64e2bcc034fb3f82c65ae448cb824bd)

        • tajo-core/src/test/resources/results/TestJoinBroadcast/testMultiplePartitionedBroadcastDataFileWithZeroLength2.result
        • tajo-core/src/test/resources/results/TestJoinBroadcast/testMultiplePartitionedBroadcastDataFileWithZeroLength.result
        • tajo-core/src/test/resources/results/TestJoinBroadcast/testMultipleBroadcastDataFileWithZeroLength2.result
        • tajo-core/src/main/java/org/apache/tajo/master/querymaster/Repartitioner.java
        • tajo-core/src/test/resources/queries/TestJoinBroadcast/testMultiplePartitionedBroadcastDataFileWithZeroLength.sql
        • tajo-core/src/test/resources/queries/TestJoinBroadcast/testMultipleBroadcastDataFileWithZeroLength2.sql
        • tajo-core/src/test/java/org/apache/tajo/engine/query/TestJoinBroadcast.java
        • tajo-core/src/test/resources/queries/TestJoinBroadcast/testMultiplePartitionedBroadcastDataFileWithZeroLength2.sql
        • CHANGES
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Tajo-master-build #406 (See https://builds.apache.org/job/Tajo-master-build/406/ ) TAJO-1107 : Broadcast join on non-leaf node scans only first data file. (babokim: rev 5f64cbb1a64e2bcc034fb3f82c65ae448cb824bd) tajo-core/src/test/resources/results/TestJoinBroadcast/testMultiplePartitionedBroadcastDataFileWithZeroLength2.result tajo-core/src/test/resources/results/TestJoinBroadcast/testMultiplePartitionedBroadcastDataFileWithZeroLength.result tajo-core/src/test/resources/results/TestJoinBroadcast/testMultipleBroadcastDataFileWithZeroLength2.result tajo-core/src/main/java/org/apache/tajo/master/querymaster/Repartitioner.java tajo-core/src/test/resources/queries/TestJoinBroadcast/testMultiplePartitionedBroadcastDataFileWithZeroLength.sql tajo-core/src/test/resources/queries/TestJoinBroadcast/testMultipleBroadcastDataFileWithZeroLength2.sql tajo-core/src/test/java/org/apache/tajo/engine/query/TestJoinBroadcast.java tajo-core/src/test/resources/queries/TestJoinBroadcast/testMultiplePartitionedBroadcastDataFileWithZeroLength2.sql CHANGES
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Tajo-master-CODEGEN-build #48 (See https://builds.apache.org/job/Tajo-master-CODEGEN-build/48/)
        TAJO-1107: Broadcast join on non-leaf node scans only first data file. (babokim: rev 5f64cbb1a64e2bcc034fb3f82c65ae448cb824bd)

        • tajo-core/src/test/resources/results/TestJoinBroadcast/testMultipleBroadcastDataFileWithZeroLength2.result
        • CHANGES
        • tajo-core/src/test/resources/queries/TestJoinBroadcast/testMultiplePartitionedBroadcastDataFileWithZeroLength2.sql
        • tajo-core/src/main/java/org/apache/tajo/master/querymaster/Repartitioner.java
        • tajo-core/src/test/resources/results/TestJoinBroadcast/testMultiplePartitionedBroadcastDataFileWithZeroLength2.result
        • tajo-core/src/test/resources/results/TestJoinBroadcast/testMultiplePartitionedBroadcastDataFileWithZeroLength.result
        • tajo-core/src/test/resources/queries/TestJoinBroadcast/testMultipleBroadcastDataFileWithZeroLength2.sql
        • tajo-core/src/test/java/org/apache/tajo/engine/query/TestJoinBroadcast.java
        • tajo-core/src/test/resources/queries/TestJoinBroadcast/testMultiplePartitionedBroadcastDataFileWithZeroLength.sql
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Tajo-master-CODEGEN-build #48 (See https://builds.apache.org/job/Tajo-master-CODEGEN-build/48/ ) TAJO-1107 : Broadcast join on non-leaf node scans only first data file. (babokim: rev 5f64cbb1a64e2bcc034fb3f82c65ae448cb824bd) tajo-core/src/test/resources/results/TestJoinBroadcast/testMultipleBroadcastDataFileWithZeroLength2.result CHANGES tajo-core/src/test/resources/queries/TestJoinBroadcast/testMultiplePartitionedBroadcastDataFileWithZeroLength2.sql tajo-core/src/main/java/org/apache/tajo/master/querymaster/Repartitioner.java tajo-core/src/test/resources/results/TestJoinBroadcast/testMultiplePartitionedBroadcastDataFileWithZeroLength2.result tajo-core/src/test/resources/results/TestJoinBroadcast/testMultiplePartitionedBroadcastDataFileWithZeroLength.result tajo-core/src/test/resources/queries/TestJoinBroadcast/testMultipleBroadcastDataFileWithZeroLength2.sql tajo-core/src/test/java/org/apache/tajo/engine/query/TestJoinBroadcast.java tajo-core/src/test/resources/queries/TestJoinBroadcast/testMultiplePartitionedBroadcastDataFileWithZeroLength.sql
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Tajo-block_iteration-branch-build #15 (See https://builds.apache.org/job/Tajo-block_iteration-branch-build/15/)
        TAJO-1107: Broadcast join on non-leaf node scans only first data file. (babokim: rev 5f64cbb1a64e2bcc034fb3f82c65ae448cb824bd)

        • tajo-core/src/test/resources/results/TestJoinBroadcast/testMultiplePartitionedBroadcastDataFileWithZeroLength.result
        • tajo-core/src/test/resources/queries/TestJoinBroadcast/testMultiplePartitionedBroadcastDataFileWithZeroLength2.sql
        • tajo-core/src/test/resources/queries/TestJoinBroadcast/testMultipleBroadcastDataFileWithZeroLength2.sql
        • tajo-core/src/test/resources/results/TestJoinBroadcast/testMultipleBroadcastDataFileWithZeroLength2.result
        • CHANGES
        • tajo-core/src/test/java/org/apache/tajo/engine/query/TestJoinBroadcast.java
        • tajo-core/src/test/resources/results/TestJoinBroadcast/testMultiplePartitionedBroadcastDataFileWithZeroLength2.result
        • tajo-core/src/test/resources/queries/TestJoinBroadcast/testMultiplePartitionedBroadcastDataFileWithZeroLength.sql
        • tajo-core/src/main/java/org/apache/tajo/master/querymaster/Repartitioner.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Tajo-block_iteration-branch-build #15 (See https://builds.apache.org/job/Tajo-block_iteration-branch-build/15/ ) TAJO-1107 : Broadcast join on non-leaf node scans only first data file. (babokim: rev 5f64cbb1a64e2bcc034fb3f82c65ae448cb824bd) tajo-core/src/test/resources/results/TestJoinBroadcast/testMultiplePartitionedBroadcastDataFileWithZeroLength.result tajo-core/src/test/resources/queries/TestJoinBroadcast/testMultiplePartitionedBroadcastDataFileWithZeroLength2.sql tajo-core/src/test/resources/queries/TestJoinBroadcast/testMultipleBroadcastDataFileWithZeroLength2.sql tajo-core/src/test/resources/results/TestJoinBroadcast/testMultipleBroadcastDataFileWithZeroLength2.result CHANGES tajo-core/src/test/java/org/apache/tajo/engine/query/TestJoinBroadcast.java tajo-core/src/test/resources/results/TestJoinBroadcast/testMultiplePartitionedBroadcastDataFileWithZeroLength2.result tajo-core/src/test/resources/queries/TestJoinBroadcast/testMultiplePartitionedBroadcastDataFileWithZeroLength.sql tajo-core/src/main/java/org/apache/tajo/master/querymaster/Repartitioner.java

          People

          • Assignee:
            hjkim Hyoungjun Kim
            Reporter:
            hjkim Hyoungjun Kim
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development