Derby
  1. Derby
  2. DERBY-4119

Compress on a large table fails with IllegalArgumentException - Illegal Capacity

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 10.5.1.1, 10.6.1.0
    • Component/s: Store
    • Labels:
      None

      Description

      When compressing a large table, Derby failed with the following exception:
      IllegalArgumentException; Illegal Capacity: -X

      I was able to access the database afterwards, but haven't yet checked if all the data is still available.
      The compress was started with CALL SYSCS_UTIL.SYSCS_COMPRESS_TABLE('schema', 'table', 1) from ij.

      The data in the table was inserted with 25 concurrent threads. This seems to cause excessive table growth, as the data inserted should weigh in at around 2 GB. The table size after the insert is ten times bigger, 20 GB.
      I have been able to generate the table and do a compress earlier, but then I have been using fewer insert threads.
      I have also been able to successfully compress the table when retrying after the failure occurred (shut down the database, then booted again and compressed).

      I'm trying to reproduce, and will post more information (like the stack trace) later.
      So far my attempts at reproducing has failed. Normally the data is generated and the compress is started without shutting down the database. My attempts this far has consisted of doing compress on the existing database (where the failure was first seen).

      1. overflow4.diff
        2 kB
        Knut Anders Hatlen
      2. overflow3.diff
        2 kB
        Knut Anders Hatlen
      3. overflow2.diff
        2 kB
        Knut Anders Hatlen
      4. overflow.diff
        0.6 kB
        Knut Anders Hatlen

        Activity

        Hide
        Kristian Waagan added a comment -

        I reran the application/workload where I saw the reported issue. I haven't observed the problem after the patch went in, so I'm closing the issue.

        Show
        Kristian Waagan added a comment - I reran the application/workload where I saw the reported issue. I haven't observed the problem after the patch went in, so I'm closing the issue.
        Hide
        Knut Anders Hatlen added a comment -

        Committed to 10.5 with revision 760809.

        Show
        Knut Anders Hatlen added a comment - Committed to 10.5 with revision 760809.
        Hide
        Kristian Waagan added a comment -

        Thanks for following up, Knut Anders

        I have started a test-run for trunk, and will do at least a few runs.
        When RC2 is out, I'll do some testing of that one too.

        Show
        Kristian Waagan added a comment - Thanks for following up, Knut Anders I have started a test-run for trunk, and will do at least a few runs. When RC2 is out, I'll do some testing of that one too.
        Hide
        Knut Anders Hatlen added a comment -

        Committed overflow4.diff to trunk with revision 760422.
        I also plan to backport it to 10.5.

        Show
        Knut Anders Hatlen added a comment - Committed overflow4.diff to trunk with revision 760422. I also plan to backport it to 10.5.
        Hide
        Knut Anders Hatlen added a comment -

        Thanks again for looking at the patch.

        It's true that catching the OOME means that we end up with an array that's a lot smaller than Integer.MAX_VALUE (it doesn't have to be a lot smaller, but with the current value of GROWTH_MULTIPLIER (2) it will be about half the size). But even if it is smaller than ideal, it will still be better than the current situation, where it'll crash. And it will also fix problems that may arise because the amount of available memory has become smaller than it was when we checked it in MergeInserter.insert() and decided to increase the maximum size (which was the problem the original check for newArray==null was supposed to fix).

        If we had a way to find the exact maximum array size supported by the JVM, I agree that we should use that limit instead of MAX_VALUE, but it doesn't sound that appealing to guess a limit that may be suboptimal for JVMs without the limitation, and possibly too large for JVMs we haven't tested. If it turns out to be a problem, we could improve this further at a later point by successively trying to allocate a slightly smaller array on OOME until we are able to allocate it. Hopefully most JVMs will have removed this limitation before it becomes common for Derby to run into it.

        As far as I can see, your assumption is correct that failing to increase the size of the array will lead to more stages in the sort, and not cause the sort to fail.

        Show
        Knut Anders Hatlen added a comment - Thanks again for looking at the patch. It's true that catching the OOME means that we end up with an array that's a lot smaller than Integer.MAX_VALUE (it doesn't have to be a lot smaller, but with the current value of GROWTH_MULTIPLIER (2) it will be about half the size). But even if it is smaller than ideal, it will still be better than the current situation, where it'll crash. And it will also fix problems that may arise because the amount of available memory has become smaller than it was when we checked it in MergeInserter.insert() and decided to increase the maximum size (which was the problem the original check for newArray==null was supposed to fix). If we had a way to find the exact maximum array size supported by the JVM, I agree that we should use that limit instead of MAX_VALUE, but it doesn't sound that appealing to guess a limit that may be suboptimal for JVMs without the limitation, and possibly too large for JVMs we haven't tested. If it turns out to be a problem, we could improve this further at a later point by successively trying to allocate a slightly smaller array on OOME until we are able to allocate it. Hopefully most JVMs will have removed this limitation before it becomes common for Derby to run into it. As far as I can see, your assumption is correct that failing to increase the size of the array will lead to more stages in the sort, and not cause the sort to fail.
        Hide
        Kristian Waagan added a comment -

        My only comment to catching OOME instead of guessing the largest supported array size, is that for JVMs that don't support arrays of size Integer.MAX_VALUE you may end up with a lot smaller array, right?
        Again, I haven't checked the code, but I assume this will lead to more stages in the sort, and not failing to do the sort itself?
        If so, the approach sounds good to me.

        Another nit is to add a comment where you catch the OOME, stating it was done to resolve the issue that some (older?) JVMs don't support arrays of size Integer.MAX_VALUE (or just point to DERBY-4119).
        Code catching OOME tends to raise suspicion

        Show
        Kristian Waagan added a comment - My only comment to catching OOME instead of guessing the largest supported array size, is that for JVMs that don't support arrays of size Integer.MAX_VALUE you may end up with a lot smaller array, right? Again, I haven't checked the code, but I assume this will lead to more stages in the sort, and not failing to do the sort itself? If so, the approach sounds good to me. Another nit is to add a comment where you catch the OOME, stating it was done to resolve the issue that some (older?) JVMs don't support arrays of size Integer.MAX_VALUE (or just point to DERBY-4119 ). Code catching OOME tends to raise suspicion
        Hide
        Knut Anders Hatlen added a comment -

        Attaching a new patch which catches OutOfMemoryError and returns null if the allocation fails, instead of checking that the return value from the allocation was null.

        Show
        Knut Anders Hatlen added a comment - Attaching a new patch which catches OutOfMemoryError and returns null if the allocation fails, instead of checking that the return value from the allocation was null.
        Hide
        Knut Anders Hatlen added a comment -

        Thanks for looking at the patch.

        I think you're right that the float->int cast in grow() in the first patch did the right thing, so the extra changes to that method in the second patch should not change the result in any way.

        Good point that some JVMs don't allow vectors/arrays whose size is close to Integer.MAX_VALUE. Instead of guessing the largest supported array size, perhaps we could catch OutOfMemoryError and stop growing if that happens. The old code did this:

        // Attempt to allocate a new array. If the allocation
        // fails, tell the caller that there are no more
        // nodes available.
        Node[] newArray = new Node[array.length * GROWTH_MULTIPLIER];
        if (newArray == null)
        return null;

        The checking for newArray==null didn't make any sense, since the array allocation is not allowed to return null. My guess is that catching an OutOfMemoryError here, and returning null, would do what was originally intended. And it would also solve the problem with JVMs that don't support that large arrays, as it seems like an OutOfMemoryError is what's being thrown in this situation.

        Show
        Knut Anders Hatlen added a comment - Thanks for looking at the patch. I think you're right that the float->int cast in grow() in the first patch did the right thing, so the extra changes to that method in the second patch should not change the result in any way. Good point that some JVMs don't allow vectors/arrays whose size is close to Integer.MAX_VALUE. Instead of guessing the largest supported array size, perhaps we could catch OutOfMemoryError and stop growing if that happens. The old code did this: // Attempt to allocate a new array. If the allocation // fails, tell the caller that there are no more // nodes available. Node[] newArray = new Node [array.length * GROWTH_MULTIPLIER] ; if (newArray == null) return null; The checking for newArray==null didn't make any sense, since the array allocation is not allowed to return null. My guess is that catching an OutOfMemoryError here, and returning null, would do what was originally intended. And it would also solve the problem with JVMs that don't support that large arrays, as it seems like an OutOfMemoryError is what's being thrown in this situation.
        Hide
        Kristian Waagan added a comment -

        Regarding your first fix, I thought Derby ended up casting a float to int, which would result in Integer.MAX_VALUE if the float was bigger than that.
        In any case, the way you solved it in the second patch is more readable.

        When it comes to the second patch, I haven't had the time to review it, but I do have one comment.
        Does the JVM really handle a Vector of size Integer.MAX_VALUE? At least it used to be Integer.MAX_VALUE - X.
        I tested quickly with a byte array, and for various JVMs I found X to be 0, 18 and 39. Would it make sense to subtract a small value from Integer.MAX_VALUE?

        This is an edge case, as you need quite a few gigs of heap to support an Object array with a capacity close to Integer.MAX_VALUE, but many machines these days do have enough memory for this. It is not clear to me what it takes for Derby to actually grow the vector to such sizes though.

        Show
        Kristian Waagan added a comment - Regarding your first fix, I thought Derby ended up casting a float to int, which would result in Integer.MAX_VALUE if the float was bigger than that. In any case, the way you solved it in the second patch is more readable. When it comes to the second patch, I haven't had the time to review it, but I do have one comment. Does the JVM really handle a Vector of size Integer.MAX_VALUE? At least it used to be Integer.MAX_VALUE - X. I tested quickly with a byte array, and for various JVMs I found X to be 0, 18 and 39. Would it make sense to subtract a small value from Integer.MAX_VALUE? This is an edge case, as you need quite a few gigs of heap to support an Object array with a capacity close to Integer.MAX_VALUE, but many machines these days do have enough memory for this. It is not clear to me what it takes for Derby to actually grow the vector to such sizes though.
        Hide
        Knut Anders Hatlen added a comment -

        All the regression tests passed with the overflow2 patch, and it's ready for review. Thanks.

        Show
        Knut Anders Hatlen added a comment - All the regression tests passed with the overflow2 patch, and it's ready for review. Thanks.
        Hide
        Knut Anders Hatlen added a comment -

        Thanks for testing. I'll run the regression tests and check in a fix if they pass.

        The suggested fix only reduces the chance of an overflow caused by an overflow in the intermediate result. It does not prevent an overflow if the final result is to large. This new patch (overflow2.diff) also addresses that problem by calculating the new max size as a long, and set it to Integer.MAX_VALUE if the value doesn't fit in an int. A similar fix is applied to the newNode() method because it may currently allocate an array larger than maxSize and therefore has the possibility to overflow even if we have maxSize under control.

        Show
        Knut Anders Hatlen added a comment - Thanks for testing. I'll run the regression tests and check in a fix if they pass. The suggested fix only reduces the chance of an overflow caused by an overflow in the intermediate result. It does not prevent an overflow if the final result is to large. This new patch (overflow2.diff) also addresses that problem by calculating the new max size as a long, and set it to Integer.MAX_VALUE if the value doesn't fit in an int. A similar fix is applied to the newNode() method because it may currently allocate an array larger than maxSize and therefore has the possibility to overflow even if we have maxSize under control.
        Hide
        Kristian Waagan added a comment -

        I managed to reproduce the error and get the stack trace, but this time it happened during the creation of an index. Larry reported to have seen the error during index creation.
        It happens in the same code area as suggested by Knut Anders. I also observed the error once during compress in a different run, but a reboot of the database had overwritten the log before I could look at it.

        I believe Knut's suggested fix will allow the vector to grow until it reaches Integer.MAX_VALUE, given there is enough heap memory to do so.
        I have only tested the calculation itself, as each of my test runs with the repro takes many hours.
        If we get this into the next release candidate, I can do some more test runs with it.

        2009-03-30 01:23:21.340 GMT Thread[DRDAConnThread_8,5,main] (XID = 12059858), (SESSIONID = 421), (DATABASE = db), (DRDAID = NF000001.F2EB-4196790664386933061

        {211}

        ), Failed Statement is: CREATE INDEX my_index ON my_table(my_col)
        java.lang.IllegalArgumentException: Illegal Capacity: -18547753
        at java.util.Vector.<init>(Vector.java:109)
        at java.util.Vector.<init>(Vector.java:124)
        at org.apache.derby.impl.store.access.sort.MergeSort.multiStageMerge(Unknown Source)
        at org.apache.derby.impl.store.access.sort.MergeSort.openSortRowSource(Unknown Source)
        at org.apache.derby.impl.store.access.RAMTransaction.openSortRowSource(Unknown Source)
        at org.apache.derby.impl.sql.execute.CreateIndexConstantAction.loadSorter(Unknown Source)
        at org.apache.derby.impl.sql.execute.CreateIndexConstantAction.executeConstantAction(Unknown Source)
        at org.apache.derby.impl.sql.execute.MiscResultSet.open(Unknown Source)
        at org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(Unknown Source)
        at org.apache.derby.impl.sql.GenericPreparedStatement.execute(Unknown Source)
        at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(Unknown Source)
        at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
        at org.apache.derby.impl.jdbc.EmbedStatement.executeUpdate(Unknown Source)
        at org.apache.derby.impl.drda.DRDAConnThread.parseEXCSQLIMM(Unknown Source)
        at org.apache.derby.impl.drda.DRDAConnThread.processCommands(Unknown Source)
        at org.apache.derby.impl.drda.DRDAConnThread.run(Unknown Source)
        Cleanup action completed

        Show
        Kristian Waagan added a comment - I managed to reproduce the error and get the stack trace, but this time it happened during the creation of an index. Larry reported to have seen the error during index creation. It happens in the same code area as suggested by Knut Anders. I also observed the error once during compress in a different run, but a reboot of the database had overwritten the log before I could look at it. I believe Knut's suggested fix will allow the vector to grow until it reaches Integer.MAX_VALUE, given there is enough heap memory to do so. I have only tested the calculation itself, as each of my test runs with the repro takes many hours. If we get this into the next release candidate, I can do some more test runs with it. 2009-03-30 01:23:21.340 GMT Thread [DRDAConnThread_8,5,main] (XID = 12059858), (SESSIONID = 421), (DATABASE = db), (DRDAID = NF000001.F2EB-4196790664386933061 {211} ), Failed Statement is: CREATE INDEX my_index ON my_table(my_col) java.lang.IllegalArgumentException: Illegal Capacity: -18547753 at java.util.Vector.<init>(Vector.java:109) at java.util.Vector.<init>(Vector.java:124) at org.apache.derby.impl.store.access.sort.MergeSort.multiStageMerge(Unknown Source) at org.apache.derby.impl.store.access.sort.MergeSort.openSortRowSource(Unknown Source) at org.apache.derby.impl.store.access.RAMTransaction.openSortRowSource(Unknown Source) at org.apache.derby.impl.sql.execute.CreateIndexConstantAction.loadSorter(Unknown Source) at org.apache.derby.impl.sql.execute.CreateIndexConstantAction.executeConstantAction(Unknown Source) at org.apache.derby.impl.sql.execute.MiscResultSet.open(Unknown Source) at org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(Unknown Source) at org.apache.derby.impl.sql.GenericPreparedStatement.execute(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.executeUpdate(Unknown Source) at org.apache.derby.impl.drda.DRDAConnThread.parseEXCSQLIMM(Unknown Source) at org.apache.derby.impl.drda.DRDAConnThread.processCommands(Unknown Source) at org.apache.derby.impl.drda.DRDAConnThread.run(Unknown Source) Cleanup action completed
        Hide
        Larry Hartsook added a comment -

        Hi Kristian,

        We have a lot of trouble reproducing it, too. In fact, we've only seen it a customer sites and haven't been able to construct a simple test case. What I've noticed is that, when the problem occurs, we are usually able to shutdown the database, restart it and run the alter table statement successfully. We run Derby in embedded mode and often have multiple different databases open simultaneously.

        Here are the stats you requested:
        page cache size 6250
        page size is 16384
        JVM max heap is anywhere from 8 GB to 20 GB.

        The most recent times I've seen the error has been at a customer site. They were running RedHat on a server with, I think, 64 GB RAM and 4 dual-core CPUs. The JVM was configured with a max heap of 20 GB

        Show
        Larry Hartsook added a comment - Hi Kristian, We have a lot of trouble reproducing it, too. In fact, we've only seen it a customer sites and haven't been able to construct a simple test case. What I've noticed is that, when the problem occurs, we are usually able to shutdown the database, restart it and run the alter table statement successfully. We run Derby in embedded mode and often have multiple different databases open simultaneously. Here are the stats you requested: page cache size 6250 page size is 16384 JVM max heap is anywhere from 8 GB to 20 GB. The most recent times I've seen the error has been at a customer site. They were running RedHat on a server with, I think, 64 GB RAM and 4 dual-core CPUs. The JVM was configured with a max heap of 20 GB
        Hide
        Kristian Waagan added a comment -

        Thanks for the feedback, Larry!

        I expect we'll investigate further and have a fix ready for the (expected) next 10.5 release candidate.
        How easily can you reproduce the bug?
        Can you post the values for the page cache size, and the minimum and maximum heap set for Java?

        I'm having problems reproducing the bug with the application used when I saw the bug, and due to the size of the data it takes a while to run it through.

        Show
        Kristian Waagan added a comment - Thanks for the feedback, Larry! I expect we'll investigate further and have a fix ready for the (expected) next 10.5 release candidate. How easily can you reproduce the bug? Can you post the values for the page cache size, and the minimum and maximum heap set for Java? I'm having problems reproducing the bug with the application used when I saw the bug, and due to the size of the data it takes a while to run it through.
        Hide
        Larry Hartsook added a comment -

        We've seen this problem in 10.4.2.0 when creating indexes and/or reestablishing foreign key constraints on large tables (in excess of some tens of millions of rows). A modification similar to the one in the diff seems to have solved the problem we were seeing.

        Show
        Larry Hartsook added a comment - We've seen this problem in 10.4.2.0 when creating indexes and/or reestablishing foreign key constraints on large tables (in excess of some tens of millions of rows). A modification similar to the one in the diff seems to have solved the problem we were seeing.
        Hide
        Knut Anders Hatlen added a comment -

        If you manage to reproduce it, you could see if this patch makes any difference.

        Show
        Knut Anders Hatlen added a comment - If you manage to reproduce it, you could see if this patch makes any difference.
        Hide
        Knut Anders Hatlen added a comment -

        One likely suspect is this vector allocation in impl.store.access.sort.MergeSort:

        leftovers = new Vector(mergeRuns.size() - maxMergeRuns);

        Now, this is protected by "while (mergeRuns.size() > maxMergeRuns)", so one should assume that the capacity argument is always positive. However, an integer overflow may make it negative if maxMergeRuns is negative. maxMergeRuns should of course never be negative, but I think there is a possibility that it can end up negative. Its value is taken from SortBuffer.capacity() which returns NodeAllocator.maxSize - 1. NodeAllocator has this method which changes the value of maxSize:

        /**
        Expand the node allocator's capacity by certain percent.
        **/
        public void grow(int percent)

        { if (percent > 0) // cannot shrink maxSize = maxSize * (100+percent)/100; }

        This method is always called with percent=100, and it will suffer from an integer overflow when the original maxSize is greater than ~11000000 because (maxSize * 200) will be greater than Integer.MAX_VALUE. It should be easy to change this calculation so that intermediate results are less likely to overflow.

        Show
        Knut Anders Hatlen added a comment - One likely suspect is this vector allocation in impl.store.access.sort.MergeSort: leftovers = new Vector(mergeRuns.size() - maxMergeRuns); Now, this is protected by "while (mergeRuns.size() > maxMergeRuns)", so one should assume that the capacity argument is always positive. However, an integer overflow may make it negative if maxMergeRuns is negative. maxMergeRuns should of course never be negative, but I think there is a possibility that it can end up negative. Its value is taken from SortBuffer.capacity() which returns NodeAllocator.maxSize - 1. NodeAllocator has this method which changes the value of maxSize: /** Expand the node allocator's capacity by certain percent. **/ public void grow(int percent) { if (percent > 0) // cannot shrink maxSize = maxSize * (100+percent)/100; } This method is always called with percent=100, and it will suffer from an integer overflow when the original maxSize is greater than ~11000000 because (maxSize * 200) will be greater than Integer.MAX_VALUE. It should be easy to change this calculation so that intermediate results are less likely to overflow.

          People

          • Assignee:
            Knut Anders Hatlen
            Reporter:
            Kristian Waagan
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development