Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-91

Use hdfs v-blocks instead of zero-padding stripes

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5.0
    • Component/s: None
    • Labels:
      None

      Description

      HDFS-3689 added variable length blocks to HDFS as a core feature.

      hsync(SyncFlag.END_BLOCK)

      can now end an HDFS block on disk without padding it out with zeros.

      The current space wasted in ORC to padding is 5%.

        Issue Links

          Activity

          Hide
          gopalv Gopal V added a comment - - edited

          There's one release where using any HDFS only APIs will break ORC (2.7.3) - due to HADOOP-14132

          Show
          gopalv Gopal V added a comment - - edited There's one release where using any HDFS only APIs will break ORC (2.7.3) - due to HADOOP-14132
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user t3rmin4t0r opened a pull request:

          https://github.com/apache/orc/pull/138

          ORC-91: Use hdfs v-blocks instead of zero-padding stripes

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/t3rmin4t0r/orc orc91

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/orc/pull/138.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #138


          commit 3f72171ee6ceb72c3506a1dcac5a9fe1d6a10265
          Author: Gopal V <t3rmin4t0r@users.noreply.github.com>
          Date: 2017-07-20T07:34:01Z

          ORC-214. Upgrade Aircompressor to 0.8

          commit ebd0e135ed0ade47434b118e9b17570c80e581c0
          Author: Gopal V <gopalv@apache.org>
          Date: 2017-07-25T04:15:57Z

          ORC-91: Use hdfs v-blocks instead of zero-padding stripes


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user t3rmin4t0r opened a pull request: https://github.com/apache/orc/pull/138 ORC-91 : Use hdfs v-blocks instead of zero-padding stripes You can merge this pull request into a Git repository by running: $ git pull https://github.com/t3rmin4t0r/orc orc91 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/138.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #138 commit 3f72171ee6ceb72c3506a1dcac5a9fe1d6a10265 Author: Gopal V <t3rmin4t0r@users.noreply.github.com> Date: 2017-07-20T07:34:01Z ORC-214 . Upgrade Aircompressor to 0.8 commit ebd0e135ed0ade47434b118e9b17570c80e581c0 Author: Gopal V <gopalv@apache.org> Date: 2017-07-25T04:15:57Z ORC-91 : Use hdfs v-blocks instead of zero-padding stripes
          Hide
          gopalv Gopal V added a comment -

          Testing this patch on 1Tb of data for confirmation of data sizes, for an actual measurement.

          Show
          gopalv Gopal V added a comment - Testing this patch on 1Tb of data for confirmation of data sizes, for an actual measurement.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on a diff in the pull request:

          https://github.com/apache/orc/pull/138#discussion_r131491100

          — Diff: java/core/src/java/org/apache/orc/impl/HadoopShims.java —
          @@ -124,6 +126,43 @@
          */
          public TextReaderShim getTextReaderShim(InputStream input) throws IOException;

          +
          + /**
          + * Block filler shim - make sure the DFS blocks ends at this offset.
          + */
          + public interface BlockFillerShim {
          — End diff –

          I think it would be better to create a HadoopShims_2_5 that has the simple fill method and make HadoopShimsCurrent have the full featured one.

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/138#discussion_r131491100 — Diff: java/core/src/java/org/apache/orc/impl/HadoopShims.java — @@ -124,6 +126,43 @@ */ public TextReaderShim getTextReaderShim(InputStream input) throws IOException; + + /** + * Block filler shim - make sure the DFS blocks ends at this offset. + */ + public interface BlockFillerShim { — End diff – I think it would be better to create a HadoopShims_2_5 that has the simple fill method and make HadoopShimsCurrent have the full featured one.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user omalley opened a pull request:

          https://github.com/apache/orc/pull/166

          ORC-91: Use hdfs v-blocks instead of zero-padding stripes

          Ok, this is my rebasing of Gopal's patch to use the zero padding after my patch for ORC-234.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/omalley/orc orc-91

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/orc/pull/166.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #166


          commit fb3f096e6724811bb9604c016b008519d34288b8
          Author: Gopal V <gopalv@apache.org>
          Date: 2017-07-25T04:15:57Z

          ORC-91: Use hdfs v-blocks instead of zero-padding stripes


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/166 ORC-91 : Use hdfs v-blocks instead of zero-padding stripes Ok, this is my rebasing of Gopal's patch to use the zero padding after my patch for ORC-234 . You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-91 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/166.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #166 commit fb3f096e6724811bb9604c016b008519d34288b8 Author: Gopal V <gopalv@apache.org> Date: 2017-07-25T04:15:57Z ORC-91 : Use hdfs v-blocks instead of zero-padding stripes
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on the issue:

          https://github.com/apache/orc/pull/138

          Ok, I've rebased this patch on top of ORC-234 in https://github.com/apache/orc/pull/166 . @t3rmin4t0r can you please review it?

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on the issue: https://github.com/apache/orc/pull/138 Ok, I've rebased this patch on top of ORC-234 in https://github.com/apache/orc/pull/166 . @t3rmin4t0r can you please review it?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on the issue:

          https://github.com/apache/orc/pull/166

          Changes:

          • Renamed HadoopShims_2_2 to HadoopShimsPre2_3 and moved to shims module.
          • Removed TextReaderShim since we aren't using it any more.
          • Pulled the padding method into the HadoopShim as padStreamToBlock.
          • Added HadoopShimsPre2_7.
          • Refactored a bit so that the shim classes share most of the code, while making sure that older shims never reference new shims.
          • Fixed some broken javadoc.
          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on the issue: https://github.com/apache/orc/pull/166 Changes: Renamed HadoopShims_2_2 to HadoopShimsPre2_3 and moved to shims module. Removed TextReaderShim since we aren't using it any more. Pulled the padding method into the HadoopShim as padStreamToBlock. Added HadoopShimsPre2_7. Refactored a bit so that the shim classes share most of the code, while making sure that older shims never reference new shims. Fixed some broken javadoc.
          Hide
          gopalv Gopal V added a comment - - edited

          Found four blocks in a terabyte of inserts which go over the block boundary.

          Stripe: offset: 264747438 data: 29405432 rows: 1085000 tail: 302 index: 23780
          

          went over the block boundary at 268435456.

          Looking into that based on current patch.

          (3.5Mb is 5.4% of the stripe size, just below the padding threshold)

          Show
          gopalv Gopal V added a comment - - edited Found four blocks in a terabyte of inserts which go over the block boundary. Stripe: offset: 264747438 data: 29405432 rows: 1085000 tail: 302 index: 23780 went over the block boundary at 268435456. Looking into that based on current patch. (3.5Mb is 5.4% of the stripe size, just below the padding threshold)
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user t3rmin4t0r closed the pull request at:

          https://github.com/apache/orc/pull/138

          Show
          githubbot ASF GitHub Bot added a comment - Github user t3rmin4t0r closed the pull request at: https://github.com/apache/orc/pull/138
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user t3rmin4t0r opened a pull request:

          https://github.com/apache/orc/pull/167

          ORC-91: Use hdfs v-blocks instead of zero-padding stripes

          Rebase over #166 and fix the adjustedStripeSize case

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/t3rmin4t0r/orc orc91.2

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/orc/pull/167.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #167


          commit e4bf536c4cf46b4c147ed06723504f0ea0db357c
          Author: Gopal V <gopalv@apache.org>
          Date: 2017-07-25T04:15:57Z

          ORC-91: Use hdfs v-blocks instead of zero-padding stripes


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user t3rmin4t0r opened a pull request: https://github.com/apache/orc/pull/167 ORC-91 : Use hdfs v-blocks instead of zero-padding stripes Rebase over #166 and fix the adjustedStripeSize case You can merge this pull request into a Git repository by running: $ git pull https://github.com/t3rmin4t0r/orc orc91.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/167.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #167 commit e4bf536c4cf46b4c147ed06723504f0ea0db357c Author: Gopal V <gopalv@apache.org> Date: 2017-07-25T04:15:57Z ORC-91 : Use hdfs v-blocks instead of zero-padding stripes
          Hide
          gopalv Gopal V added a comment -

          Owen O'Malley: fixed the errant block issue and now seeing proper block boundaries with HDFS.

          This is a flat 5% improvement for large files.

          Show
          gopalv Gopal V added a comment - Owen O'Malley : fixed the errant block issue and now seeing proper block boundaries with HDFS. This is a flat 5% improvement for large files.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/orc/pull/166

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/orc/pull/166
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/orc/pull/167

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/orc/pull/167
          Hide
          owen.omalley Owen O'Malley added a comment -

          I just committed this. Thanks, Gopal!

          Show
          owen.omalley Owen O'Malley added a comment - I just committed this. Thanks, Gopal!

            People

            • Assignee:
              gopalv Gopal V
              Reporter:
              gopalv Gopal V
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development