Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1533

[JAVA] realloc should consider the existing buffer capacity for computing target memory requirement

    Details

      Description

      We recently encountered a problem when we were trying to add JSON files with complex schema as datasets.

      Initially we started with a Float8Vector with default memory allocation of (4096 * 8) 32KB.
      Went through several iterations of setSafe() to trigger a realloc() from 32KB to 64KB.
      Another round of setSafe() calls to trigger a realloc() from 64KB to 128KB

      After that we encountered a BigInt and promoted our vector to UnionVector.

      This required us to create a UnionVector with BigIntVector and Float8Vector. The latter required us to transfer the Float8Vector we were earlier working with to the Float8Vector inside the Union.

      As part of transferTo(), the target Float8Vector got all the ArrowBuf state (capacity, buffer contents) etc transferred from the source vector.

      Later, a realloc was triggered on the Float8Vector inside the UnionVector.

      The computation inside realloc() to determine the amount of memory to be reallocated goes wrong since it makes the decision based on allocateSizeInBytes – although this vector was created as part of transfer() from 128KB source vector, allocateSizeInBytes is still at the initial/default value of 32KB

      We end up allocating a 64KB buffer and attempt to copy 128KB over 64KB and seg fault when invoking setBytes().

      There is a wrong assumption in realloc() that allocateSizeInBytes is always equal to data.capacity(). The particular scenario described above exposes where this assumption could go wrong.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user siddharthteotia commented on the issue:

          https://github.com/apache/arrow/pull/1097

          @jacques-n , added tests for fixed-width vector, variable width vector and bit vector. To further test correctness, I verified that without the code changes, unit tests fail with expected segmentation fault in reAlloc().

          At this point, I don't think we need tests for other vector types for the problem this patch is fixing.

          However, I have two test related patches in pipeline

          (1) tests for mapvector – we don't have any unit tests for MapVector
          (2) splitAndTransfer for all vector types – we recently added for bit vector, union vector, list vector. Need to cover other types too.

          I will ensure that those patches cover tests for this scenario as well.

          Show
          githubbot ASF GitHub Bot added a comment - Github user siddharthteotia commented on the issue: https://github.com/apache/arrow/pull/1097 @jacques-n , added tests for fixed-width vector, variable width vector and bit vector. To further test correctness, I verified that without the code changes, unit tests fail with expected segmentation fault in reAlloc(). At this point, I don't think we need tests for other vector types for the problem this patch is fixing. However, I have two test related patches in pipeline (1) tests for mapvector – we don't have any unit tests for MapVector (2) splitAndTransfer for all vector types – we recently added for bit vector, union vector, list vector. Need to cover other types too. I will ensure that those patches cover tests for this scenario as well.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user siddharthteotia commented on the issue:

          https://github.com/apache/arrow/pull/1097

          The unit tests added here cover both nullable and non-nullable vectors.

          Show
          githubbot ASF GitHub Bot added a comment - Github user siddharthteotia commented on the issue: https://github.com/apache/arrow/pull/1097 The unit tests added here cover both nullable and non-nullable vectors.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user siddharthteotia closed the pull request at:

          https://github.com/apache/arrow/pull/1097

          Show
          githubbot ASF GitHub Bot added a comment - Github user siddharthteotia closed the pull request at: https://github.com/apache/arrow/pull/1097
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user siddharthteotia opened a pull request:

          https://github.com/apache/arrow/pull/1112

          ARROW-1533: [JAVA] realloc should consider the existing buffer capacity for computing target memory requirement

          cc @jacques-n , This is same as https://github.com/apache/arrow/pull/1097

          The latter one was closed as I had to rename the branch correctly and use the correct JIRA number.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/siddharthteotia/arrow ARROW-1533

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/arrow/pull/1112.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #1112


          commit 66620f12f8b0cba359e13ab2e1f6cad21ef4daac
          Author: siddharth <siddharth@dremio.com>
          Date: 2017-09-12T21:58:07Z

          ARROW-1553: realloc should consider the existing buffer capacity for computing target memory requirement

          commit 75243251d4580febfc094f422fb2780191439a07
          Author: siddharth <siddharth@dremio.com>
          Date: 2017-09-17T21:35:38Z

          added tests


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user siddharthteotia opened a pull request: https://github.com/apache/arrow/pull/1112 ARROW-1533 : [JAVA] realloc should consider the existing buffer capacity for computing target memory requirement cc @jacques-n , This is same as https://github.com/apache/arrow/pull/1097 The latter one was closed as I had to rename the branch correctly and use the correct JIRA number. You can merge this pull request into a Git repository by running: $ git pull https://github.com/siddharthteotia/arrow ARROW-1533 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/arrow/pull/1112.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1112 commit 66620f12f8b0cba359e13ab2e1f6cad21ef4daac Author: siddharth <siddharth@dremio.com> Date: 2017-09-12T21:58:07Z ARROW-1553 : realloc should consider the existing buffer capacity for computing target memory requirement commit 75243251d4580febfc094f422fb2780191439a07 Author: siddharth <siddharth@dremio.com> Date: 2017-09-17T21:35:38Z added tests
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user wesm commented on the issue:

          https://github.com/apache/arrow/pull/1112

          @siddharthteotia it's not necessary to use a particular branch name, as long as the PR title is correct

          Show
          githubbot ASF GitHub Bot added a comment - Github user wesm commented on the issue: https://github.com/apache/arrow/pull/1112 @siddharthteotia it's not necessary to use a particular branch name, as long as the PR title is correct
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user siddharthteotia commented on the issue:

          https://github.com/apache/arrow/pull/1112

          @wesm , For easy self-tracking, I typically have the same local branch name as the one in my fork that I used to create the PR.

          For this JIRA (ARROW-1533), I incorrectly used JIRA# 1553 almost everywhere – branch name, commit message etc.

          I didn't realize this until I had to actually work on ARROW-1553. So renamed the local branch and did a force update on my fork to get everything proper for ARROW-1533.

          Show
          githubbot ASF GitHub Bot added a comment - Github user siddharthteotia commented on the issue: https://github.com/apache/arrow/pull/1112 @wesm , For easy self-tracking, I typically have the same local branch name as the one in my fork that I used to create the PR. For this JIRA ( ARROW-1533 ), I incorrectly used JIRA# 1553 almost everywhere – branch name, commit message etc. I didn't realize this until I had to actually work on ARROW-1553 . So renamed the local branch and did a force update on my fork to get everything proper for ARROW-1533 .
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user jacques-n commented on the issue:

          https://github.com/apache/arrow/pull/1112

          Good additional questions that we should address in ARROW-1463. +1 on getting this merged.

          Show
          githubbot ASF GitHub Bot added a comment - Github user jacques-n commented on the issue: https://github.com/apache/arrow/pull/1112 Good additional questions that we should address in ARROW-1463 . +1 on getting this merged.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user icexelloss commented on the issue:

          https://github.com/apache/arrow/pull/1112

          LGTM too.

          Show
          githubbot ASF GitHub Bot added a comment - Github user icexelloss commented on the issue: https://github.com/apache/arrow/pull/1112 LGTM too.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user wesm commented on the issue:

          https://github.com/apache/arrow/pull/1112

          +1

          Show
          githubbot ASF GitHub Bot added a comment - Github user wesm commented on the issue: https://github.com/apache/arrow/pull/1112 +1
          Hide
          wesmckinn Wes McKinney added a comment -

          Issue resolved by pull request 1112
          https://github.com/apache/arrow/pull/1112

          Show
          wesmckinn Wes McKinney added a comment - Issue resolved by pull request 1112 https://github.com/apache/arrow/pull/1112
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/arrow/pull/1112

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/arrow/pull/1112

            People

            • Assignee:
              siddteotia Siddharth Teotia
              Reporter:
              siddteotia Siddharth Teotia
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development