Uploaded image for project: 'Accumulo'
  1. Accumulo
  2. ACCUMULO-4483

NegativeArraySizeException from scan thread right after minor compaction

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.8.0
    • Fix Version/s: 1.8.1, 2.0.0
    • Component/s: tserver
    • Labels:
    • Environment:

      Accumulo 1.8.0
      Java 1.8.0_72

      Description

      I have been getting NegativeArraySizeExceptions after upgrading to Accumulo 1.8.0. I have tracked it down to 2 or more concurrent scans on a tablet that has just undergone minor compaction. The minor compaction thread writes the in-memory map to a local temporary rfile and tries to switch the current iterators to use it instead of the native map. The iterator code in the scan thread may also switch itself to use the local temporary rfile it if notices it before the minor compaction threads performs the switch. This all works up to this point. Shortly after the switch one of the iterator threads will get a NegativeArraySizeException from:
      SourceSwitchingIterator.seek() -> SourceSwitchingIterator.readNext() -> MemKeyConversionIterator.seek() -> MemKeyConversionIterator.getTopKeyValue() ->MemValue.decode(). I added a bunch of logging to find that the length of the byte array passed to MemValue.decode is zero.

        Issue Links

          Activity

          Hide
          dlmarion Dave Marion added a comment -

          I should note that I did not see this in 1.7. This might also be related to ACCUMULO-4391.

          Show
          dlmarion Dave Marion added a comment - I should note that I did not see this in 1.7. This might also be related to ACCUMULO-4391 .
          Hide
          milleruntime Michael Miller added a comment -

          Looks like MemValue was rewritten for ACCUMULO-3913. MemKeyConversionIter, MemValue and a bunch of changes were here https://github.com/apache/accumulo/commit/45f18c174612d1a41eed1d2eec9e77d3b3e71a82#diff-595dd73ef5f2b4c5579b534704f6ca3b

          Show
          milleruntime Michael Miller added a comment - Looks like MemValue was rewritten for ACCUMULO-3913 . MemKeyConversionIter, MemValue and a bunch of changes were here https://github.com/apache/accumulo/commit/45f18c174612d1a41eed1d2eec9e77d3b3e71a82#diff-595dd73ef5f2b4c5579b534704f6ca3b
          Hide
          elserj Josh Elser added a comment -

          Good find, Michael Miller! I assume you're not using sampling, Dave Marion?

          Show
          elserj Josh Elser added a comment - Good find, Michael Miller ! I assume you're not using sampling, Dave Marion ?
          Hide
          dlmarion Dave Marion added a comment -

          cc: Ivan Bella

          Further analysis reveals that the underlying iterator is returning the same K,V twice and that MemValue.decode is modifying the source Value (which it should not). So, MemValue.decode() succeeds the first time and fails the second.

          Show
          dlmarion Dave Marion added a comment - cc: Ivan Bella Further analysis reveals that the underlying iterator is returning the same K,V twice and that MemValue.decode is modifying the source Value (which it should not). So, MemValue.decode() succeeds the first time and fails the second.
          Hide
          dlmarion Dave Marion added a comment -

          Not using sampling. Didn't know it existed until now.

          Show
          dlmarion Dave Marion added a comment - Not using sampling. Didn't know it existed until now.
          Hide
          ivan.bella Ivan Bella added a comment - - edited

          The underlying iterator in this case is the RFile.LocalityGroupReader. The seek is called several times in sequence. The second time it manages to pass through and leave the RelativeKey rf and the Value val with the same value. This is OK, however the MemValue.decode method should not be calling Value.set on the original value. It should be fairly easy to reproduce this in a test case.

          Show
          ivan.bella Ivan Bella added a comment - - edited The underlying iterator in this case is the RFile.LocalityGroupReader. The seek is called several times in sequence. The second time it manages to pass through and leave the RelativeKey rf and the Value val with the same value. This is OK, however the MemValue.decode method should not be calling Value.set on the original value. It should be fairly easy to reproduce this in a test case.
          Hide
          elserj Josh Elser added a comment -

          Not using sampling. Didn't know it existed until now.

          A shiny, new feature in 1.8. It's actually pretty slick

          Further analysis reveals that the underlying iterator is returning the same K,V twice and that MemValue.decode is modifying the source Value (which it should not). So, MemValue.decode() succeeds the first time and fails the second.

          The underlying iterator in this case is the RFile.LocalityGroupReader. The seek is called several times in sequence. The second time it manages to pass through and leave the RelativeKey rf and the Value val with the same value. This is OK,

          Loving the teamwork here . Curious that we didn't uncover this in any testing for 1.8.0. This would be good to understand why (apparently?) our testing was insufficient to catch this.

          however the MemValue.decode method should not be calling Value.set on the original value. It should be fairly easy to reproduce this in a test case.

          The Value.set() doesn't appear to have been modified in Keith's commit though. So, is it the interactions of LocalityGroupReader with what MemValue.decode(..) is doing that is causing the problem?

          Feel free to ping for a review when you get a fix put together.

          Show
          elserj Josh Elser added a comment - Not using sampling. Didn't know it existed until now. A shiny, new feature in 1.8. It's actually pretty slick Further analysis reveals that the underlying iterator is returning the same K,V twice and that MemValue.decode is modifying the source Value (which it should not). So, MemValue.decode() succeeds the first time and fails the second. The underlying iterator in this case is the RFile.LocalityGroupReader. The seek is called several times in sequence. The second time it manages to pass through and leave the RelativeKey rf and the Value val with the same value. This is OK, Loving the teamwork here . Curious that we didn't uncover this in any testing for 1.8.0. This would be good to understand why (apparently?) our testing was insufficient to catch this. however the MemValue.decode method should not be calling Value.set on the original value. It should be fairly easy to reproduce this in a test case. The Value.set() doesn't appear to have been modified in Keith's commit though. So, is it the interactions of LocalityGroupReader with what MemValue.decode(..) is doing that is causing the problem? Feel free to ping for a review when you get a fix put together.
          Hide
          dlmarion Dave Marion added a comment -

          The Value.set() doesn't appear to have been modified in Keith's commit though.

          The old code called Value.set() on the input object and didn't return anything. The new code called Value.set() on the input object and returned a new MemValue. The calling code had been changed to use the return value, and I'm assuming reuse the input object.

          Curious that we didn't uncover this in any testing for 1.8.0. This would be good to understand why (apparently?) our testing was insufficient to catch this.

          Not sure. The issue arose on multiple clusters as soon as I upgraded to 1.8.0. We run continuous ingest for our tests, but are we continuously scanning during the ingest?

          Show
          dlmarion Dave Marion added a comment - The Value.set() doesn't appear to have been modified in Keith's commit though. The old code called Value.set() on the input object and didn't return anything. The new code called Value.set() on the input object and returned a new MemValue. The calling code had been changed to use the return value, and I'm assuming reuse the input object. Curious that we didn't uncover this in any testing for 1.8.0. This would be good to understand why (apparently?) our testing was insufficient to catch this. Not sure. The issue arose on multiple clusters as soon as I upgraded to 1.8.0. We run continuous ingest for our tests, but are we continuously scanning during the ingest?
          Hide
          dlmarion Dave Marion added a comment -

          FWIW, I committed a fix for this and merged it up to master. I tested the fix and my issue disappeared. In talking with Keith Turner he suggested removing the call to Value.set() in MemValue.decode().

          Show
          dlmarion Dave Marion added a comment - FWIW, I committed a fix for this and merged it up to master. I tested the fix and my issue disappeared. In talking with Keith Turner he suggested removing the call to Value.set() in MemValue.decode().
          Hide
          kturner Keith Turner added a comment -

          The Value.set() doesn't appear to have been modified in Keith's commit though.

          Looking at the diff Mike Miller posted earlier, it seems that I re-factored the code to move where a value was cloned. Before my change it was ok to call Value.set() because it happened after the clone. After my change Value.set() is called before the clone. I don't remember why I re-factored that code, but I made a big goof in the process.

          Show
          kturner Keith Turner added a comment - The Value.set() doesn't appear to have been modified in Keith's commit though. Looking at the diff Mike Miller posted earlier, it seems that I re-factored the code to move where a value was cloned. Before my change it was ok to call Value.set() because it happened after the clone. After my change Value.set() is called before the clone. I don't remember why I re-factored that code, but I made a big goof in the process.
          Hide
          elserj Josh Elser added a comment -

          Not sure. The issue arose on multiple clusters as soon as I upgraded to 1.8.0. We run continuous ingest for our tests, but are we continuously scanning during the ingest?

          If the batchwalkers, walkers, and/or scanners are started, yes.

          Show
          elserj Josh Elser added a comment - Not sure. The issue arose on multiple clusters as soon as I upgraded to 1.8.0. We run continuous ingest for our tests, but are we continuously scanning during the ingest? If the batchwalkers, walkers, and/or scanners are started, yes.
          Hide
          elserj Josh Elser added a comment -

          FWIW, I committed a fix for this and merged it up to master. I tested the fix and my issue disappeared. In talking with Keith Turner he suggested removing the call to Value.set() in MemValue.decode().

          What about a test case? Can something be made that is reasonable (not contrived)?

          Show
          elserj Josh Elser added a comment - FWIW, I committed a fix for this and merged it up to master. I tested the fix and my issue disappeared. In talking with Keith Turner he suggested removing the call to Value.set() in MemValue.decode(). What about a test case? Can something be made that is reasonable (not contrived)?
          Hide
          ivan.bella Ivan Bella added a comment -

          I hypothesize that a simple test case where you seek an iterator that uses the RFile.LocalityGroupReader underneath, use the top key, then seek to the same range and then get that key might actually trigger the error.

          Show
          ivan.bella Ivan Bella added a comment - I hypothesize that a simple test case where you seek an iterator that uses the RFile.LocalityGroupReader underneath, use the top key, then seek to the same range and then get that key might actually trigger the error.
          Hide
          elserj Josh Elser added a comment -

          Ping Dave Marion/Ivan Bella re: my test case question (read as "can one of you please write a test for this" – not the literal interpretation..)

          Show
          elserj Josh Elser added a comment - Ping Dave Marion / Ivan Bella re: my test case question (read as "can one of you please write a test for this" – not the literal interpretation..)
          Hide
          dlmarion Dave Marion added a comment -

          Test case added. Ivan Bella might add another.

          Show
          dlmarion Dave Marion added a comment - Test case added. Ivan Bella might add another.
          Hide
          elserj Josh Elser added a comment -

          Test case added. Ivan Bella might add another.

          Many thanks to both of you!

          Show
          elserj Josh Elser added a comment - Test case added. Ivan Bella might add another. Many thanks to both of you!
          Hide
          mjwall Michael Wall added a comment -

          Curious that we didn't uncover this in any testing for 1.8.0.

          Did we not find this bug in testing because we don't have custom iterators on our tables during testing?

          Show
          mjwall Michael Wall added a comment - Curious that we didn't uncover this in any testing for 1.8.0. Did we not find this bug in testing because we don't have custom iterators on our tables during testing?

            People

            • Assignee:
              dlmarion Dave Marion
              Reporter:
              dlmarion Dave Marion
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1.5h
                1.5h

                  Development