Uploaded image for project: 'Tajo'
  1. Tajo
  2. TAJO-931

Output file can be punctuated depending on the file size.

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.9.0
    • Component/s: Physical Operator
    • Labels:
      None

      Description

      There are some file formats (e.g., Parquet) which are not splittable. They can usually span multiple HDFS blocks if one file is very large. It causes remote HDFS access and limits the parallel degree, resulting in significant performance degradation.

      We can solve this problem if StoreTableExec or

      {Col|SortBased}

      PartitionStoreExec can punctuate the final output file according to the written size.

      In addition, we need to support a session variable to determine the per file size of final output files. So, TAJO-928 blocks this issue.

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Tajo-master-build #343 (See https://builds.apache.org/job/Tajo-master-build/343/)
          TAJO-931: Output file can be punctuated depending on the file size. (hyunsik) (hyunsik: rev a1711d16be579082fb57e5abb43ff1872d424451)

          • tajo-core/src/main/java/org/apache/tajo/engine/planner/PhysicalPlannerImpl.java
          • tajo-storage/src/main/java/org/apache/tajo/storage/thirdparty/parquet/ParquetWriter.java
          • tajo-catalog/tajo-catalog-common/src/test/java/org/apache/tajo/catalog/TestKeyValueSet.java
          • tajo-storage/src/main/java/org/apache/tajo/storage/FileAppender.java
          • CHANGES
          • tajo-storage/src/main/java/org/apache/tajo/storage/thirdparty/parquet/InternalParquetRecordWriter.java
          • tajo-storage/src/main/java/org/apache/tajo/storage/Appender.java
          • tajo-core/src/main/java/org/apache/tajo/engine/planner/logical/PersistentStoreNode.java
          • tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/HashBasedColPartitionStoreExec.java
          • tajo-storage/src/test/java/org/apache/tajo/storage/TestCompressionStorages.java
          • tajo-client/src/main/java/org/apache/tajo/client/TajoGetConf.java
          • tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/ColPartitionStoreExec.java
          • tajo-storage/src/main/java/org/apache/tajo/storage/parquet/ParquetAppender.java
          • tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/PhysicalPlanUtil.java
          • tajo-common/src/main/java/org/apache/tajo/util/KeyValueSet.java
          • tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/SortBasedColPartitionStoreExec.java
          • tajo-common/src/main/java/org/apache/tajo/util/BitArray.java
          • tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/StoreTableExec.java
          • tajo-common/src/main/java/org/apache/tajo/OverridableConf.java
          • tajo-storage/src/main/java/org/apache/tajo/storage/HashShuffleAppender.java
          • tajo-core/src/test/java/org/apache/tajo/engine/planner/physical/TestPhysicalPlanner.java
          • tajo-core/src/main/java/org/apache/tajo/master/session/Session.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Tajo-master-build #343 (See https://builds.apache.org/job/Tajo-master-build/343/ ) TAJO-931 : Output file can be punctuated depending on the file size. (hyunsik) (hyunsik: rev a1711d16be579082fb57e5abb43ff1872d424451) tajo-core/src/main/java/org/apache/tajo/engine/planner/PhysicalPlannerImpl.java tajo-storage/src/main/java/org/apache/tajo/storage/thirdparty/parquet/ParquetWriter.java tajo-catalog/tajo-catalog-common/src/test/java/org/apache/tajo/catalog/TestKeyValueSet.java tajo-storage/src/main/java/org/apache/tajo/storage/FileAppender.java CHANGES tajo-storage/src/main/java/org/apache/tajo/storage/thirdparty/parquet/InternalParquetRecordWriter.java tajo-storage/src/main/java/org/apache/tajo/storage/Appender.java tajo-core/src/main/java/org/apache/tajo/engine/planner/logical/PersistentStoreNode.java tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/HashBasedColPartitionStoreExec.java tajo-storage/src/test/java/org/apache/tajo/storage/TestCompressionStorages.java tajo-client/src/main/java/org/apache/tajo/client/TajoGetConf.java tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/ColPartitionStoreExec.java tajo-storage/src/main/java/org/apache/tajo/storage/parquet/ParquetAppender.java tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/PhysicalPlanUtil.java tajo-common/src/main/java/org/apache/tajo/util/KeyValueSet.java tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/SortBasedColPartitionStoreExec.java tajo-common/src/main/java/org/apache/tajo/util/BitArray.java tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/StoreTableExec.java tajo-common/src/main/java/org/apache/tajo/OverridableConf.java tajo-storage/src/main/java/org/apache/tajo/storage/HashShuffleAppender.java tajo-core/src/test/java/org/apache/tajo/engine/planner/physical/TestPhysicalPlanner.java tajo-core/src/main/java/org/apache/tajo/master/session/Session.java
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/tajo/pull/119

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tajo/pull/119
          Hide
          hyunsik Hyunsik Choi added a comment -

          committed it to master branch.

          Show
          hyunsik Hyunsik Choi added a comment - committed it to master branch.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user blrunner commented on the pull request:

          https://github.com/apache/tajo/pull/119#issuecomment-52870968

          +1

          It looks good overall and 'mvn clean install -Phcatalog-0.12.0 -Dtajo.catalog.store.class=org.apache.tajo.catalog.store.HCatalogStore' finished successfully.

          Show
          githubbot ASF GitHub Bot added a comment - Github user blrunner commented on the pull request: https://github.com/apache/tajo/pull/119#issuecomment-52870968 +1 It looks good overall and 'mvn clean install -Phcatalog-0.12.0 -Dtajo.catalog.store.class=org.apache.tajo.catalog.store.HCatalogStore' finished successfully.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user hyunsik commented on the pull request:

          https://github.com/apache/tajo/pull/119#issuecomment-52804586

          I've rebased, reflected the comments, and fixed some potential bugs. Please review this.

          Show
          githubbot ASF GitHub Bot added a comment - Github user hyunsik commented on the pull request: https://github.com/apache/tajo/pull/119#issuecomment-52804586 I've rebased, reflected the comments, and fixed some potential bugs. Please review this.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user hyunsik commented on the pull request:

          https://github.com/apache/tajo/pull/119#issuecomment-52660031

          rebased.

          Show
          githubbot ASF GitHub Bot added a comment - Github user hyunsik commented on the pull request: https://github.com/apache/tajo/pull/119#issuecomment-52660031 rebased.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user hyunsik commented on the pull request:

          https://github.com/apache/tajo/pull/119#issuecomment-52468899

          Hi @blrunner,

          Thank you for your comment. I've reflected your comment and rebased it against the latest revision.

          Show
          githubbot ASF GitHub Bot added a comment - Github user hyunsik commented on the pull request: https://github.com/apache/tajo/pull/119#issuecomment-52468899 Hi @blrunner, Thank you for your comment. I've reflected your comment and rebased it against the latest revision.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user blrunner commented on a diff in the pull request:

          https://github.com/apache/tajo/pull/119#discussion_r16338875

          — Diff: tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/ColPartitionStoreExec.java —
          @@ -67,6 +79,15 @@ public ColPartitionStoreExec(TaskAttemptContext context, StoreTableNode plan, Ph
          meta = CatalogUtil.newTableMeta(plan.getStorageType());
          }

          + if (!(plan instanceof InsertNode)) {
          + String nullChar = context.getQueryContext().get(SessionVars.NULL_CHAR);
          + meta.putOption(StorageConstants.CSVFILE_NULL, nullChar);
          — End diff –

          You need to consider other null characters because of StorageConstants.SEQUENCEFILE_NULL and StorageConstants.RCFILE_NULL.

          Show
          githubbot ASF GitHub Bot added a comment - Github user blrunner commented on a diff in the pull request: https://github.com/apache/tajo/pull/119#discussion_r16338875 — Diff: tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/ColPartitionStoreExec.java — @@ -67,6 +79,15 @@ public ColPartitionStoreExec(TaskAttemptContext context, StoreTableNode plan, Ph meta = CatalogUtil.newTableMeta(plan.getStorageType()); } + if (!(plan instanceof InsertNode)) { + String nullChar = context.getQueryContext().get(SessionVars.NULL_CHAR); + meta.putOption(StorageConstants.CSVFILE_NULL, nullChar); — End diff – You need to consider other null characters because of StorageConstants.SEQUENCEFILE_NULL and StorageConstants.RCFILE_NULL.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user hyunsik opened a pull request:

          https://github.com/apache/tajo/pull/119

          TAJO-931: Output file can be punctuated depending on the file size.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/hyunsik/tajo TAJO-931

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/tajo/pull/119.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #119


          commit a3b78642abb6c160b147eae2f29a10e362c14cac
          Author: Hyunsik Choi <hyunsik@apache.org>
          Date: 2014-07-08T08:47:42Z

          Improve session variables to affect the query config.

          commit 0a0035d9b259a1a05ba790b7a778a745251d27bd
          Author: Hyunsik Choi <hyunsik@apache.org>
          Date: 2014-07-08T12:54:32Z

          Fixed.

          commit 3fb54a6dde89d2d8e972253c1eccd17f334180d4
          Author: Hyunsik Choi <hyunsik@apache.org>
          Date: 2014-07-09T02:23:28Z

          Completed output file rotating.

          commit 8028f5f876af2050bb602e277026e76ca802619a
          Author: Hyunsik Choi <hyunsik@apache.org>
          Date: 2014-07-15T03:57:29Z

          Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into OUTPUT_ROTATING

          Conflicts:
          tajo-core/src/main/java/org/apache/tajo/engine/planner/global/GlobalPlanner.java
          tajo-core/src/main/java/org/apache/tajo/master/querymaster/Repartitioner.java
          tajo-core/src/main/java/org/apache/tajo/master/querymaster/SubQuery.java

          commit 50f6af418b42704ba14a4c7a084372f80c7ce1ec
          Author: Hyunsik Choi <hyunsik@apache.org>
          Date: 2014-07-15T06:25:09Z

          Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into OUTPUT_ROTATING

          commit 4d0abc0dfbf6c5898bce6bd0e1ecd4c995108571
          Author: Hyunsik Choi <hyunsik@apache.org>
          Date: 2014-07-15T11:13:55Z

          Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into OUTPUT_ROTATING

          Conflicts:
          tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/HashBasedColPartitionStoreExec.java
          tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/SortBasedColPartitionStoreExec.java

          commit dd79f666d81875bf6a547478b76fc55b60f37d09
          Author: Hyunsik Choi <hyunsik@apache.org>
          Date: 2014-07-15T12:31:11Z

          Added estimatedwrittensize.

          commit da231ca89e5cf3638ea16faad281f8296854a9dd
          Author: Hyunsik Choi <hyunsik@apache.org>
          Date: 2014-07-17T03:03:37Z

          Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into OUTPUT_ROTATING

          commit c006382a3b16973872d753c9a0e0150da1c0f687
          Author: Hyunsik Choi <hyunsik@apache.org>
          Date: 2014-07-17T03:10:20Z

          Reflect session variables to GlobalPlanner, Repartitioner, and PhysicalPlannerImpl.

          commit 681aa25916f8de8a45f2b953215de76b023393a0
          Author: Hyunsik Choi <hyunsik@apache.org>
          Date: 2014-08-11T07:56:37Z

          Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into OUTPUT_ROTATING

          Conflicts:
          tajo-core/src/main/java/org/apache/tajo/engine/planner/PhysicalPlannerImpl.java
          tajo-core/src/main/java/org/apache/tajo/engine/planner/global/GlobalPlanner.java
          tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/SortBasedColPartitionStoreExec.java
          tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/StoreTableExec.java
          tajo-core/src/main/java/org/apache/tajo/engine/query/QueryContext.java
          tajo-core/src/main/java/org/apache/tajo/master/querymaster/Repartitioner.java
          tajo-core/src/main/java/org/apache/tajo/worker/TaskAttemptContext.java
          tajo-storage/src/main/java/org/apache/tajo/storage/Appender.java

          commit b7a73bb22df1198010e2b18f3e67aaeeec30f52f
          Author: Hyunsik Choi <hyunsik@apache.org>
          Date: 2014-08-11T08:33:51Z

          Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into OUTPUT_ROTATING

          Conflicts:
          tajo-core/src/main/java/org/apache/tajo/engine/planner/PhysicalPlannerImpl.java
          tajo-core/src/main/java/org/apache/tajo/engine/planner/global/GlobalPlanner.java
          tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/SortBasedColPartitionStoreExec.java
          tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/StoreTableExec.java
          tajo-core/src/main/java/org/apache/tajo/engine/query/QueryContext.java
          tajo-core/src/main/java/org/apache/tajo/master/querymaster/Repartitioner.java
          tajo-core/src/main/java/org/apache/tajo/worker/TaskAttemptContext.java
          tajo-storage/src/main/java/org/apache/tajo/storage/Appender.java

          commit 803fb6a677b6831faf5e602bf77961b31128b7cd
          Author: Hyunsik Choi <hyunsik@apache.org>
          Date: 2014-08-15T14:20:55Z

          Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into TAJO-931

          commit 8a6f782bb3c423ee7ef17f8522fd9206803da0cb
          Author: Hyunsik Choi <hyunsik@apache.org>
          Date: 2014-08-16T18:05:20Z

          TAJO-931: Output file can be punctuated depending on the file size.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user hyunsik opened a pull request: https://github.com/apache/tajo/pull/119 TAJO-931 : Output file can be punctuated depending on the file size. You can merge this pull request into a Git repository by running: $ git pull https://github.com/hyunsik/tajo TAJO-931 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tajo/pull/119.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #119 commit a3b78642abb6c160b147eae2f29a10e362c14cac Author: Hyunsik Choi <hyunsik@apache.org> Date: 2014-07-08T08:47:42Z Improve session variables to affect the query config. commit 0a0035d9b259a1a05ba790b7a778a745251d27bd Author: Hyunsik Choi <hyunsik@apache.org> Date: 2014-07-08T12:54:32Z Fixed. commit 3fb54a6dde89d2d8e972253c1eccd17f334180d4 Author: Hyunsik Choi <hyunsik@apache.org> Date: 2014-07-09T02:23:28Z Completed output file rotating. commit 8028f5f876af2050bb602e277026e76ca802619a Author: Hyunsik Choi <hyunsik@apache.org> Date: 2014-07-15T03:57:29Z Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into OUTPUT_ROTATING Conflicts: tajo-core/src/main/java/org/apache/tajo/engine/planner/global/GlobalPlanner.java tajo-core/src/main/java/org/apache/tajo/master/querymaster/Repartitioner.java tajo-core/src/main/java/org/apache/tajo/master/querymaster/SubQuery.java commit 50f6af418b42704ba14a4c7a084372f80c7ce1ec Author: Hyunsik Choi <hyunsik@apache.org> Date: 2014-07-15T06:25:09Z Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into OUTPUT_ROTATING commit 4d0abc0dfbf6c5898bce6bd0e1ecd4c995108571 Author: Hyunsik Choi <hyunsik@apache.org> Date: 2014-07-15T11:13:55Z Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into OUTPUT_ROTATING Conflicts: tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/HashBasedColPartitionStoreExec.java tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/SortBasedColPartitionStoreExec.java commit dd79f666d81875bf6a547478b76fc55b60f37d09 Author: Hyunsik Choi <hyunsik@apache.org> Date: 2014-07-15T12:31:11Z Added estimatedwrittensize. commit da231ca89e5cf3638ea16faad281f8296854a9dd Author: Hyunsik Choi <hyunsik@apache.org> Date: 2014-07-17T03:03:37Z Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into OUTPUT_ROTATING commit c006382a3b16973872d753c9a0e0150da1c0f687 Author: Hyunsik Choi <hyunsik@apache.org> Date: 2014-07-17T03:10:20Z Reflect session variables to GlobalPlanner, Repartitioner, and PhysicalPlannerImpl. commit 681aa25916f8de8a45f2b953215de76b023393a0 Author: Hyunsik Choi <hyunsik@apache.org> Date: 2014-08-11T07:56:37Z Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into OUTPUT_ROTATING Conflicts: tajo-core/src/main/java/org/apache/tajo/engine/planner/PhysicalPlannerImpl.java tajo-core/src/main/java/org/apache/tajo/engine/planner/global/GlobalPlanner.java tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/SortBasedColPartitionStoreExec.java tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/StoreTableExec.java tajo-core/src/main/java/org/apache/tajo/engine/query/QueryContext.java tajo-core/src/main/java/org/apache/tajo/master/querymaster/Repartitioner.java tajo-core/src/main/java/org/apache/tajo/worker/TaskAttemptContext.java tajo-storage/src/main/java/org/apache/tajo/storage/Appender.java commit b7a73bb22df1198010e2b18f3e67aaeeec30f52f Author: Hyunsik Choi <hyunsik@apache.org> Date: 2014-08-11T08:33:51Z Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into OUTPUT_ROTATING Conflicts: tajo-core/src/main/java/org/apache/tajo/engine/planner/PhysicalPlannerImpl.java tajo-core/src/main/java/org/apache/tajo/engine/planner/global/GlobalPlanner.java tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/SortBasedColPartitionStoreExec.java tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/StoreTableExec.java tajo-core/src/main/java/org/apache/tajo/engine/query/QueryContext.java tajo-core/src/main/java/org/apache/tajo/master/querymaster/Repartitioner.java tajo-core/src/main/java/org/apache/tajo/worker/TaskAttemptContext.java tajo-storage/src/main/java/org/apache/tajo/storage/Appender.java commit 803fb6a677b6831faf5e602bf77961b31128b7cd Author: Hyunsik Choi <hyunsik@apache.org> Date: 2014-08-15T14:20:55Z Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into TAJO-931 commit 8a6f782bb3c423ee7ef17f8522fd9206803da0cb Author: Hyunsik Choi <hyunsik@apache.org> Date: 2014-08-16T18:05:20Z TAJO-931 : Output file can be punctuated depending on the file size.
          Hide
          hyunsik Hyunsik Choi added a comment -

          If there is no objection, I'll include TAJO-935 in this work because TAJO-935 is trivial.

          Show
          hyunsik Hyunsik Choi added a comment - If there is no objection, I'll include TAJO-935 in this work because TAJO-935 is trivial.
          Hide
          hyunsik Hyunsik Choi added a comment -

          In order to get written file size, we need to modify ParquetFileWriter.

          Show
          hyunsik Hyunsik Choi added a comment - In order to get written file size, we need to modify ParquetFileWriter.

            People

            • Assignee:
              hyunsik Hyunsik Choi
              Reporter:
              hyunsik Hyunsik Choi
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development