Derby
  1. Derby
  2. DERBY-3788

Provide a zero-admin way of updating the statisitcs of an index

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 10.5.1.1
    • Fix Version/s: 10.8.1.2
    • Component/s: Store
    • Labels:
      None
    • Bug behavior facts:
      Performance

      Description

      DERBY-269 provided a manual way of updating the statistics using the new system stored procedure SYSCS_UTIL.SYSCS_UPDATE_STATISTICS. It will be good for Derby to provide an automatic way of updating the statistics without requiring to run the stored procedure manually. There was some discussion on DERBY-269 about providing the 0-admin way. I have copied it here for reference.

      *********************
      Kathey Marsden - 22/May/05 03:53 PM
      Some sort of zero admin solution for updating statistics would be prefferable to the manual 'update statistics'
      *********************

      *********************
      Mike Matrigali - 11/Jun/08 12:37 PM
      I have not seen any other suggestions, how about the following zero admin solution? It is not perfect - suggestions welcome.

      Along with the statistics storing, save how many rows were in the table when exact statistics were calculated. This number is 0 if none have been calculated because index creation happened on an empty table. At query compile time when we look up statistics we automatically recalculate the statistics at certain threshholds - say something like row count growing past next threshhold : 10, 100, 1000, 100000 - with upper limit being somewhere around how many rows we can process in some small amount of time - like 1 second on a modern laptop. If we are worried about response time, maybe we background queue the stat gathering rather than waiting with maybe some quick load if no stat has ever been gathered. The background gathering could be optimized to not interfere with locks by using read uncommitted.

      I think it would be useful to also have the manual call just to make it easy to support customers and debug issues in the field. There is proably always some dynamic data distribution change that in some case won't be picked up by the automatic algorithm. Also just very useful for those who have complete control of the create ddl, load data, run stats, deliver application process.
      *********************

      1. DERBY_3788_Mgr.java
        0.8 kB
        Mamta A. Satoor
      2. DERBY_3788_Repro.java
        0.3 kB
        Mamta A. Satoor
      3. DERBY3788_patch1_diff.txt
        14 kB
        Mamta A. Satoor
      4. DERBY3788_patch1_stat.txt
        1.0 kB
        Mamta A. Satoor
      5. DERBY3788_patch2_diff.txt
        19 kB
        Mamta A. Satoor
      6. DERBY3788_patch2_stat.txt
        1 kB
        Mamta A. Satoor
      7. DERBY3788_patch3_diff.txt
        17 kB
        Mamta A. Satoor
      8. DERBY3788_patch3_stat.txt
        1 kB
        Mamta A. Satoor
      9. DERBY3788_update_all_missing_stats_after_bind_patch4_stat.txt
        0.5 kB
        Mamta A. Satoor
      10. DERBY3788_update_all_missing_stats_after_bind_patch4_diff.txt
        20 kB
        Mamta A. Satoor

        Issue Links

          Activity

          Hide
          Knut Anders Hatlen added a comment -

          [bulk update] Close all resolved issues that haven't been updated for more than one year.

          Show
          Knut Anders Hatlen added a comment - [bulk update] Close all resolved issues that haven't been updated for more than one year.
          Hide
          Kristian Waagan added a comment -

          Resolving as duplicate (see comment above for the relevant JIRAs).

          Show
          Kristian Waagan added a comment - Resolving as duplicate (see comment above for the relevant JIRAs).
          Hide
          Kristian Waagan added a comment -

          There hasn't been work on this issue for a while. I'm adding a link to DERBY-4934, which supersedes this issue.
          Investigation and exploration was done under DERBY-4771, whereas DERBY-4934 tracks the actual addition of the feature (called istat during development for brevity).

          Show
          Kristian Waagan added a comment - There hasn't been work on this issue for a while. I'm adding a link to DERBY-4934 , which supersedes this issue. Investigation and exploration was done under DERBY-4771 , whereas DERBY-4934 tracks the actual addition of the feature (called istat during development for brevity).
          Hide
          Kristian Waagan added a comment -

          Clearing patch available flag, as it is my impression that there are no patches that are ready for commit.

          Show
          Kristian Waagan added a comment - Clearing patch available flag, as it is my impression that there are no patches that are ready for commit.
          Hide
          Kristian Waagan added a comment -

          Thanks for the writeup, Mamta

          I logged DERBY-4771 to track some work I did on the same problem.

          Show
          Kristian Waagan added a comment - Thanks for the writeup, Mamta I logged DERBY-4771 to track some work I did on the same problem.
          Hide
          Mamta A. Satoor added a comment -

          Just wanted to summarize what has been tried for this jira so far and a proposal for a possible solution worth trying
          1)First I tried to see if the missing stats could be created inline when the SELECT query compilation finds that missing but that failed because SELECT query compilation marked the data dictionary in read only mode where as update statistics needs to update the data dictionary(this marking of read only for data dictionary happens in GenericStatement.prepMinion). Because of this read-only marking, we can't create the statistics inline of the compilation phase of the SELECT query.
          2)Second approach was during the SELECT compile phase, if we come across missing stats, then schedule a task for later to update the stats. Finish the SELECT compilation and execution after the update stat task is scheduled on the queue (the scheduling mechanism used is only available with jdk 1.5 and higher). When the scheduled task was later fired, it would run no current connection exception. The reason for that was that there was no context set for update statistics to run. Tried resolving this by creating a connection context through following call InternalDriver.activeDriver().connect(url, info) I had the SELECT compilation pass the needed information to the background task so I can create the connection. This is pretty crude way of getting a connection to update the statistics. Along with the fact that the code to create the connection context was not very modular, some of the existing tests failed with this changes because of locking issues. This probably is because the locks held by the background thread for update stats interfered with locks required by the rest of the test.
          3)The third approach I tried was when the SELECT query detects missing stats, abort the query compilation, (thus removing the data dictionary from read-only mode), create the stats and then restart the original SELECT query compilation and execution. The hope with this approach was that it will avoid running into locking issues. If we decide to pursue this approach more, we need to make sure that we look at the possibility of stats still missing when the original query is compile the 2nd time. This can happen say because the user does not have privileges to update the stats, the update stat ran into locking issues. If we won't account for possibility of missing stats during the 2nd time through the original query, we can end up in an infinite loop. Another problem with this approach was that update stats were happening in the same transaction as the original SELECT query. Which means that the locks acquired by the update stat will not be released until the transaction is committed. Came across locking issues with this when I ran the existing junit and derbyall tests. One way to fix this could be to start a nested user transaction and do the update stat in that transaction and commit that transaction(I think something like this is done for identity columns. If yes, then we might find pointers there to use for update stats in nested user transaction). Thus any locks acquired by the update stat work will be released. And after that, go back to the original query again for compilation and execution.
          4)With any approach we take, we should detect the case where the table is empty and hence the stats might be missing. For such a case, we should not try to fire stats creation task. to rephrase, during the compile phase of the SELECT query, when we look for statistics, first check if the table is empty and if yes then skip the code for collecting stats. From what I recall, one can check if the table is empty by issuing open scan controller and asking it for the number of rows in the table. If we do not check for empty table, then we will end up in infinite loop.
          5)For read-only db we should detect that we are readonly and don't try to do update statistics
          6)Also, we probably want to create statistics only for user tables? Haven't thought enough about it.
          7)What if the user executing the query is not the owner of the table. should statistics still be created? I think if the user does not have enough privileges, then skip the step of creating stats since it's going to fail anyways.
          8)Once statistics are created, notify tabledescriptor about it because tabledescriptor cahces available statistics and it will never know of the new statistics since tabledescriptor holds on to that cache.
          9)Once we have update stats working, another improvement could be to identify the cached compiled queries that may benefit from updated stats and have those queries recompile when they are executed next time. In other words, the relevant queries should be invalidated as part of the statistics collection task.
          10)Look at other scenarios(in addition to step 7) above) where we know that update stats might fail. For those cases, don't spend time trying to create the stats. Instead just let the optimizer work with the information it has at the moment to come up with the query plan.

          Show
          Mamta A. Satoor added a comment - Just wanted to summarize what has been tried for this jira so far and a proposal for a possible solution worth trying 1)First I tried to see if the missing stats could be created inline when the SELECT query compilation finds that missing but that failed because SELECT query compilation marked the data dictionary in read only mode where as update statistics needs to update the data dictionary(this marking of read only for data dictionary happens in GenericStatement.prepMinion). Because of this read-only marking, we can't create the statistics inline of the compilation phase of the SELECT query. 2)Second approach was during the SELECT compile phase, if we come across missing stats, then schedule a task for later to update the stats. Finish the SELECT compilation and execution after the update stat task is scheduled on the queue (the scheduling mechanism used is only available with jdk 1.5 and higher). When the scheduled task was later fired, it would run no current connection exception. The reason for that was that there was no context set for update statistics to run. Tried resolving this by creating a connection context through following call InternalDriver.activeDriver().connect(url, info) I had the SELECT compilation pass the needed information to the background task so I can create the connection. This is pretty crude way of getting a connection to update the statistics. Along with the fact that the code to create the connection context was not very modular, some of the existing tests failed with this changes because of locking issues. This probably is because the locks held by the background thread for update stats interfered with locks required by the rest of the test. 3)The third approach I tried was when the SELECT query detects missing stats, abort the query compilation, (thus removing the data dictionary from read-only mode), create the stats and then restart the original SELECT query compilation and execution. The hope with this approach was that it will avoid running into locking issues. If we decide to pursue this approach more, we need to make sure that we look at the possibility of stats still missing when the original query is compile the 2nd time. This can happen say because the user does not have privileges to update the stats, the update stat ran into locking issues. If we won't account for possibility of missing stats during the 2nd time through the original query, we can end up in an infinite loop. Another problem with this approach was that update stats were happening in the same transaction as the original SELECT query. Which means that the locks acquired by the update stat will not be released until the transaction is committed. Came across locking issues with this when I ran the existing junit and derbyall tests. One way to fix this could be to start a nested user transaction and do the update stat in that transaction and commit that transaction(I think something like this is done for identity columns. If yes, then we might find pointers there to use for update stats in nested user transaction). Thus any locks acquired by the update stat work will be released. And after that, go back to the original query again for compilation and execution. 4)With any approach we take, we should detect the case where the table is empty and hence the stats might be missing. For such a case, we should not try to fire stats creation task. to rephrase, during the compile phase of the SELECT query, when we look for statistics, first check if the table is empty and if yes then skip the code for collecting stats. From what I recall, one can check if the table is empty by issuing open scan controller and asking it for the number of rows in the table. If we do not check for empty table, then we will end up in infinite loop. 5)For read-only db we should detect that we are readonly and don't try to do update statistics 6)Also, we probably want to create statistics only for user tables? Haven't thought enough about it. 7)What if the user executing the query is not the owner of the table. should statistics still be created? I think if the user does not have enough privileges, then skip the step of creating stats since it's going to fail anyways. 8)Once statistics are created, notify tabledescriptor about it because tabledescriptor cahces available statistics and it will never know of the new statistics since tabledescriptor holds on to that cache. 9)Once we have update stats working, another improvement could be to identify the cached compiled queries that may benefit from updated stats and have those queries recompile when they are executed next time. In other words, the relevant queries should be invalidated as part of the statistics collection task. 10)Look at other scenarios(in addition to step 7) above) where we know that update stats might fail. For those cases, don't spend time trying to create the stats. Instead just let the optimizer work with the information it has at the moment to come up with the query plan.
          Hide
          Bryan Pendleton added a comment -

          Yes, that's the technique I remembered. Thanks for tracking it down! That seems fine to me.

          Show
          Bryan Pendleton added a comment - Yes, that's the technique I remembered. Thanks for tracking it down! That seems fine to me.
          Hide
          Mamta A. Satoor added a comment -

          Bryan, thanks for taking the time for this jira entry. Yes, I was planning on using nowait so that we do not wait and eventually possibly locking issues with parent. This is exactly what we do in InsertResultSet.getSetAutoincrementValue as shown below
          try

          { /* If tcToUse == tc, then we are using parent xaction-- this can happen if for some reason we couldn't start a nested transaction */ newValue = dd.getSetAutoincrementValue( constants.autoincRowLocation[index], tcToUse, true, aiCache[index], (tcToUse == tc)); }

          catch (StandardException se)
          {
          if (tcToUse == tc)

          { /* we've using the parent xaction and we've timed out; just throw an error and exit. */ throw se; }

          if (se.getMessageId().equals(SQLState.LOCK_TIMEOUT))

          { // if we couldn't do this with a nested xaction, retry with // parent-- we need to wait this time! newValue = dd.getSetAutoincrementValue( constants.autoincRowLocation[index], tc, true, aiCache[index], true); }
          Show
          Mamta A. Satoor added a comment - Bryan, thanks for taking the time for this jira entry. Yes, I was planning on using nowait so that we do not wait and eventually possibly locking issues with parent. This is exactly what we do in InsertResultSet.getSetAutoincrementValue as shown below try { /* If tcToUse == tc, then we are using parent xaction-- this can happen if for some reason we couldn't start a nested transaction */ newValue = dd.getSetAutoincrementValue( constants.autoincRowLocation[index], tcToUse, true, aiCache[index], (tcToUse == tc)); } catch (StandardException se) { if (tcToUse == tc) { /* we've using the parent xaction and we've timed out; just throw an error and exit. */ throw se; } if (se.getMessageId().equals(SQLState.LOCK_TIMEOUT)) { // if we couldn't do this with a nested xaction, retry with // parent-- we need to wait this time! newValue = dd.getSetAutoincrementValue( constants.autoincRowLocation[index], tc, true, aiCache[index], true); }
          Hide
          Bryan Pendleton added a comment -

          Seems like the nested transactions can have locking problems, too; if the nested transaction's
          lock requests block, you can get an un-detected deadlock, I believe. I think we recently
          had to change some code which took locks in a nested transaction to take those locks
          "nowait", and if they couldn't be immediately satisfied, to retry the locks in the user transaction.
          Unfortunately I don't recall exactly which issue had this multi-step lock retry behavior.

          Show
          Bryan Pendleton added a comment - Seems like the nested transactions can have locking problems, too; if the nested transaction's lock requests block, you can get an un-detected deadlock, I believe. I think we recently had to change some code which took locks in a nested transaction to take those locks "nowait", and if they couldn't be immediately satisfied, to retry the locks in the user transaction. Unfortunately I don't recall exactly which issue had this multi-step lock retry behavior.
          Hide
          Mamta A. Satoor added a comment -

          I have an intermediate patch ready for review. The goal of the patch is to detect early on in the compile phase of a SELECT query if there are tables involved with missing statistics. If yes, then abandon the SELECT query compilation, create the missing statistics and then try to compile the original SELECT query again. This logic gets driven inside the GenericStatement.prepMinion code. GenericStatement.prepMinion is responsible for compile phase of a SQL. If the SQL involved is a SELECT statement, then we may go through the compile process of the SELECT query twice. If the SQL involved is non-SELECT, we will finish SQL compilation in the first pass. The details about the code flow for SELECT statement compilation is as follows.

          For a SELECT statement, we may have to go through the query compilation phase twice. Whether we go through one cycle or 2 cycles of compilation phase depends on wheter there are any tables involved in the query who have their statistics missing. During the first attempt of query compile, we will make a list of all the tables involved in the query during the bind phase of compile. After the bind phase, during the first pass through the SELECT query compilation, we will check if the query involved has any tables whose statistics are missing. If yes, then we will quit from the SELECT compile phase by throwing missing statistics exception. We will handle the missing stats exception by trying to update the missing statistics. After that, we will go through the second pass of original query compilation. During this second pass, we will not worry if the statistics are missing or not. We will just work with whatever statistics are available (one example where the statistics may still be missing after trying to create them is say a user who only has select privileges on a table finds the missing stats. But the statistics creation will fail because the user does not have sufficient privileges to create the statistics. There can be other reasons for statistics creation failure too. Another instance would be that we can't get the necessary locks to update the stats. In such a case, we will just quit out of updating the stats and move on to the second pass of the original query compilation.

          One of the issues with this patch I need to work on and will appreicate if anyone has any feedback on. The update statistics is happening in the user transaction. What that means is that any locks required to update statistics will stay in place until the user transaction has been committed/rolled back. This behavior can be an issue with user applications (since these are the locks the user was not expecting to get as part of a SELECT query execution). This behavior definitely causes problems with our tests where quite a handful of tests run into locking issues because of the additional locks acquired before of a SELECT query. I think one way to fix this would be to somehow run the statistics in a nested transaction which can be committed after the statistics are created. If we run into locking issues in the nested transaction then go ahead and run the statistic in the user transaction. This is what we do for GENERATED columns in InsertResultSet.getSetAutoincrementValue. One of the problems that I need to address before I can use nested transaction is to change the code that is written to fire statistics during SELECT compilation. Currently, in my patch, I am executing the stored procedure SYSCS_UTIL.SYSCS_UPDATE_STATISTICS which internally executes ALTER TABLE.. to update the statistics. I am at too high a level to be able to use nested transaction for updating the statistics (let me know if I am wrong here. May be there is a way to use nested transaction while firing the SYSCS_UTIL.SYSCS_UPDATE_STATISTICS that I am unaware of). I think I need to get to update statistics code in AlterTableConstantAction directly somehow without going through the stored procedure->ALTER TABLE.. sql path. I think if I can directly call the update statistics in AlterTableConstantAction, then I can try using the nested yransaction and if that does not succeed then user the user transaction.

          I will appreciate any input you might have on my overall approach to this jira entry and then how to address the locking issue.

          Show
          Mamta A. Satoor added a comment - I have an intermediate patch ready for review. The goal of the patch is to detect early on in the compile phase of a SELECT query if there are tables involved with missing statistics. If yes, then abandon the SELECT query compilation, create the missing statistics and then try to compile the original SELECT query again. This logic gets driven inside the GenericStatement.prepMinion code. GenericStatement.prepMinion is responsible for compile phase of a SQL. If the SQL involved is a SELECT statement, then we may go through the compile process of the SELECT query twice. If the SQL involved is non-SELECT, we will finish SQL compilation in the first pass. The details about the code flow for SELECT statement compilation is as follows. For a SELECT statement, we may have to go through the query compilation phase twice. Whether we go through one cycle or 2 cycles of compilation phase depends on wheter there are any tables involved in the query who have their statistics missing. During the first attempt of query compile, we will make a list of all the tables involved in the query during the bind phase of compile. After the bind phase, during the first pass through the SELECT query compilation, we will check if the query involved has any tables whose statistics are missing. If yes, then we will quit from the SELECT compile phase by throwing missing statistics exception. We will handle the missing stats exception by trying to update the missing statistics. After that, we will go through the second pass of original query compilation. During this second pass, we will not worry if the statistics are missing or not. We will just work with whatever statistics are available (one example where the statistics may still be missing after trying to create them is say a user who only has select privileges on a table finds the missing stats. But the statistics creation will fail because the user does not have sufficient privileges to create the statistics. There can be other reasons for statistics creation failure too. Another instance would be that we can't get the necessary locks to update the stats. In such a case, we will just quit out of updating the stats and move on to the second pass of the original query compilation. One of the issues with this patch I need to work on and will appreicate if anyone has any feedback on. The update statistics is happening in the user transaction. What that means is that any locks required to update statistics will stay in place until the user transaction has been committed/rolled back. This behavior can be an issue with user applications (since these are the locks the user was not expecting to get as part of a SELECT query execution). This behavior definitely causes problems with our tests where quite a handful of tests run into locking issues because of the additional locks acquired before of a SELECT query. I think one way to fix this would be to somehow run the statistics in a nested transaction which can be committed after the statistics are created. If we run into locking issues in the nested transaction then go ahead and run the statistic in the user transaction. This is what we do for GENERATED columns in InsertResultSet.getSetAutoincrementValue. One of the problems that I need to address before I can use nested transaction is to change the code that is written to fire statistics during SELECT compilation. Currently, in my patch, I am executing the stored procedure SYSCS_UTIL.SYSCS_UPDATE_STATISTICS which internally executes ALTER TABLE.. to update the statistics. I am at too high a level to be able to use nested transaction for updating the statistics (let me know if I am wrong here. May be there is a way to use nested transaction while firing the SYSCS_UTIL.SYSCS_UPDATE_STATISTICS that I am unaware of). I think I need to get to update statistics code in AlterTableConstantAction directly somehow without going through the stored procedure->ALTER TABLE.. sql path. I think if I can directly call the update statistics in AlterTableConstantAction, then I can try using the nested yransaction and if that does not succeed then user the user transaction. I will appreciate any input you might have on my overall approach to this jira entry and then how to address the locking issue.
          Hide
          Mamta A. Satoor added a comment -

          Currently, I am pursuing running the missing statistics inline in the compile phase of a SELECT query. The compile phase code is driven by GenericStatement.prepMinion.

          My first attempt was to run the update statistics inline in the optimize phase of a SELECT sql (because in the optimize phase of a query, we look for statistics for determining the query plan for execution and that is when we know that the statistics are missing) but there I ran into data dictionary being in read-only mode. In GenericStatement.prepMinion method, at line 307, we mark the data dictionary(DD) as read-only for the duration of bind and optimize phase and hence statistics can't be updated because DD has been marked readonly at this time.

          At line 422, after the optimize phase, we mark the DD as done reading and hence DD is not in read-only mode during the generate phase. Based on this, I tried running the statistics inline in the generate phase of SELECT query to establish the fact that the statistics can indeed be run in-;ine. I ran into only one coding issues DERBY-4048 (about AlterTableConstantAction not using execute transaction for updating the statistics). After resolving DERBY-4048, I am able to run the update statistics inline in the generate phase of a SELECT query for a simple test case as shown below
          create table t1 (c11 int, c12 char);
          create index i1 on t1(c11);
          insert into t1 values (1,'1'),(2,'2'),(3,'3');
          – no statistics will be found for t1.i1 because table was empty when index i1 was created
          select * from sys.sysstatistics;
          – during the generate phase of following query, we will run update statistics inline for t1.i1
          select * from t1 where c11=1;
          – this time, there will be statistics for t1.i1
          select * from sys.sysstatistics;

          Note, that there were no locking issues/privilege issues involved in the simple test case above. But when update statistics runs into some problem, the question is how do we just revert back the update statistics work and let the original SELECT query proceed with execution. When I ran junit, towards the very beginning, I saw lot of problems related to not having enough privileges, context stack issue etc because of update statistics.

          I think the right approach would be to start with the compile phase of SELECT query and during optimization, when we find that the statistics are missing, then quit the compilation of SELECT query there, run the update statistics and refire the compilation of the SELECT query. This way, update statistics will not run into any locking issues with SELECT query. I am going to try to spend some time working on this approach and see how it goes. Would appreicate if anyone has any feedback.

          Show
          Mamta A. Satoor added a comment - Currently, I am pursuing running the missing statistics inline in the compile phase of a SELECT query. The compile phase code is driven by GenericStatement.prepMinion. My first attempt was to run the update statistics inline in the optimize phase of a SELECT sql (because in the optimize phase of a query, we look for statistics for determining the query plan for execution and that is when we know that the statistics are missing) but there I ran into data dictionary being in read-only mode. In GenericStatement.prepMinion method, at line 307, we mark the data dictionary(DD) as read-only for the duration of bind and optimize phase and hence statistics can't be updated because DD has been marked readonly at this time. At line 422, after the optimize phase, we mark the DD as done reading and hence DD is not in read-only mode during the generate phase. Based on this, I tried running the statistics inline in the generate phase of SELECT query to establish the fact that the statistics can indeed be run in-;ine. I ran into only one coding issues DERBY-4048 (about AlterTableConstantAction not using execute transaction for updating the statistics). After resolving DERBY-4048 , I am able to run the update statistics inline in the generate phase of a SELECT query for a simple test case as shown below create table t1 (c11 int, c12 char); create index i1 on t1(c11); insert into t1 values (1,'1'),(2,'2'),(3,'3'); – no statistics will be found for t1.i1 because table was empty when index i1 was created select * from sys.sysstatistics; – during the generate phase of following query, we will run update statistics inline for t1.i1 select * from t1 where c11=1; – this time, there will be statistics for t1.i1 select * from sys.sysstatistics; Note, that there were no locking issues/privilege issues involved in the simple test case above. But when update statistics runs into some problem, the question is how do we just revert back the update statistics work and let the original SELECT query proceed with execution. When I ran junit, towards the very beginning, I saw lot of problems related to not having enough privileges, context stack issue etc because of update statistics. I think the right approach would be to start with the compile phase of SELECT query and during optimization, when we find that the statistics are missing, then quit the compilation of SELECT query there, run the update statistics and refire the compilation of the SELECT query. This way, update statistics will not run into any locking issues with SELECT query. I am going to try to spend some time working on this approach and see how it goes. Would appreicate if anyone has any feedback.
          Hide
          Mamta A. Satoor added a comment -

          I am attaching a new patch which incorporates following comments from Knut.
          2)Are the calls to EmbedConnection30.setupContextStack() and restoreContextStack() needed around the call to execute()? I thought execute() would call setup/restoreContextStack() itself.
          I removed the calls to setup and restore the stack since as Knut pointed out, they are unnecessary.
          3)Creating an EmbedConnection30 object directly breaks the modularity. Unless the calls to the internal methods are necessary, it may be better to use InternalDriver.activeDriver().connect(url, info) instead.
          I removed the direct creation of EmbedConnection30 and instead use InternalDriver.activeDriver().connect(url, info)
          5)There's a comment in DDImpl5.updateStatisticsInBackGround() saying that "cm is null the very first time, and whenever we aren't actually nested." I'm not sure I understand that comment. Why is it null the first time? And isn't the method always called in a nested context? And if it is null, wouldn't that cause a NullPointerException in EmbedConnection's constructor when url=null is passed in?
          In the case that ContextManager may be null, I do not fire statistics creation in background because that will cause NPE later on because url is null.
          6) DDImpl5.updateStatisticsInBackGround() updates the shared variable executorForUpdateStatistics if it is null. But it is not protected by synchronization, so race conditions are possible.
          I now have synchronization code around the code that can result in race conditions.
          7) DDImpl5.stop() should call super.stop().
          Did this.
          8) In BackgroundUpdateStatisticsTask, using a prepared statement with the table name and the index name parametrized would be better because it would handle quoting special characters correctly (not handled in
          the current patch) and it would reduce the number of entries in the statement cache.
          Now using PreparedStatement rather than Statement.

          The 2 comments from Knut that are not addressed in this patch are as follows
          4) I think that the creation of a new connection will reboot the database if it has been shut down in the user thread, which may lead to unpredictable behaviour. It also seems like it will preserve all connection attributes, like attributes to reencrypt the database or to start replication master.
          I think there might be an issue as raised by Knut's comment 4). I still have to trace down completely why some of the junit test failures with my patch say Database "Failed to start(/create) database". It might be because of the properties that are used to create the new Connection object to run the statistics in background. I am not sure yet if it happens during every run of the test jdbcapi.DriverMgrAuthenticationTest but I have seen that test to be one of the tests that fails with Failed to create database. I will work on trying to figure out at what point does the test fail with these errors. If anyone has any ideas on the failures or how we should create the new Connection object,
          I will highly appreciate that.
          9) As to the locking issues, I would have tried to call Connection.setTransactionIsolation(Connection.TRANSACTION_READ_UNCOMMITTED) in the background thread to see if that solved/reduced the issues.
          I am not 100% about this but I this using isolation Connection.TRANSACTION_READ_UNCOMMITTED had slowed down test runs on my machine. I would like to address number 4 first so that at least the test failures go down and then I can focus on usage of isolation level
          Connection.TRANSACTION_READ_UNCOMMITTED. There ARE few tests that fail because of lock time out and may be using the isolation level recommended by Knut will help.

          So, at this point, I will focus on 4) to try to narrow down where are the failures coming from. In the mean time, if community has time to run the junit tests to see if they notice noticeable performance problems with my patch, I will appreciate that feedback. Thanks

          Show
          Mamta A. Satoor added a comment - I am attaching a new patch which incorporates following comments from Knut. 2)Are the calls to EmbedConnection30.setupContextStack() and restoreContextStack() needed around the call to execute()? I thought execute() would call setup/restoreContextStack() itself. I removed the calls to setup and restore the stack since as Knut pointed out, they are unnecessary. 3)Creating an EmbedConnection30 object directly breaks the modularity. Unless the calls to the internal methods are necessary, it may be better to use InternalDriver.activeDriver().connect(url, info) instead. I removed the direct creation of EmbedConnection30 and instead use InternalDriver.activeDriver().connect(url, info) 5)There's a comment in DDImpl5.updateStatisticsInBackGround() saying that "cm is null the very first time, and whenever we aren't actually nested." I'm not sure I understand that comment. Why is it null the first time? And isn't the method always called in a nested context? And if it is null, wouldn't that cause a NullPointerException in EmbedConnection's constructor when url=null is passed in? In the case that ContextManager may be null, I do not fire statistics creation in background because that will cause NPE later on because url is null. 6) DDImpl5.updateStatisticsInBackGround() updates the shared variable executorForUpdateStatistics if it is null. But it is not protected by synchronization, so race conditions are possible. I now have synchronization code around the code that can result in race conditions. 7) DDImpl5.stop() should call super.stop(). Did this. 8) In BackgroundUpdateStatisticsTask, using a prepared statement with the table name and the index name parametrized would be better because it would handle quoting special characters correctly (not handled in the current patch) and it would reduce the number of entries in the statement cache. Now using PreparedStatement rather than Statement. The 2 comments from Knut that are not addressed in this patch are as follows 4) I think that the creation of a new connection will reboot the database if it has been shut down in the user thread, which may lead to unpredictable behaviour. It also seems like it will preserve all connection attributes, like attributes to reencrypt the database or to start replication master. I think there might be an issue as raised by Knut's comment 4). I still have to trace down completely why some of the junit test failures with my patch say Database "Failed to start(/create) database". It might be because of the properties that are used to create the new Connection object to run the statistics in background. I am not sure yet if it happens during every run of the test jdbcapi.DriverMgrAuthenticationTest but I have seen that test to be one of the tests that fails with Failed to create database. I will work on trying to figure out at what point does the test fail with these errors. If anyone has any ideas on the failures or how we should create the new Connection object, I will highly appreciate that. 9) As to the locking issues, I would have tried to call Connection.setTransactionIsolation(Connection.TRANSACTION_READ_UNCOMMITTED) in the background thread to see if that solved/reduced the issues. I am not 100% about this but I this using isolation Connection.TRANSACTION_READ_UNCOMMITTED had slowed down test runs on my machine. I would like to address number 4 first so that at least the test failures go down and then I can focus on usage of isolation level Connection.TRANSACTION_READ_UNCOMMITTED. There ARE few tests that fail because of lock time out and may be using the isolation level recommended by Knut will help. So, at this point, I will focus on 4) to try to narrow down where are the failures coming from. In the mean time, if community has time to run the junit tests to see if they notice noticeable performance problems with my patch, I will appreciate that feedback. Thanks
          Hide
          Mamta A. Satoor added a comment -

          Knut, couple more responses to some of your comments.
          3) Creating an EmbedConnection30 object directly breaks the
          modularity. Unless the calls to the internal methods are necessary, it
          may be better to use InternalDriver.activeDriver().connect(url, info)
          instead.
          There was no specific need to be directly creating EmbedConnection30 object, so I have used your recommendation to get the Connection object.

          5) There's a comment in DDImpl5.updateStatisticsInBackGround() saying
          that "cm is null the very first time, and whenever we aren't actually
          nested." I'm not sure I understand that comment. Why is it null the
          first time? And isn't the method always called in a nested context?
          And if it is null, wouldn't that cause a NullPointerException in
          EmbedConnection's constructor when url=null is passed in?
          I used the current model to get the ContextManager similar to how we get it in jdbc,InternalDriver:getConnectionContext and there we do a check for null ContextManager. Looking at that code, I thought there might be a case where ContextManager could be null. To address the NullPointerException that may result because of url being null in case of null ContextManager, I have changed the code in DDImpl5.updateStatisticsInBackGround() to fire the background update statistics only if ContextManager is not null. So the new code looks as follows
          public void updateStatisticsInBackGround(String schemaName,
          String tableName, String indexName) throws StandardException{
          String url = null;
          Properties info = null;
          if (executorForUpdateStatistics==null)

          { executorForUpdateStatistics = new ThreadPoolExecutor(5,5,0L, TimeUnit.MILLISECONDS, new LinkedBlockingQueue(5)); executorForUpdateStatistics.setRejectedExecutionHandler( new ThreadPoolExecutor.CallerRunsPolicy()); }

          ContextService csf = ContextService.getFactory();

          ContextManager cm = csf.getCurrentContextManager();
          ConnectionContext localCC = null;

          /*
          cm is null the very first time, and whenever
          we aren't actually nested.
          */
          if (cm != null)

          { localCC = (ConnectionContext) (cm.getContext(ConnectionContext.CONTEXT_ID)); TransactionResourceImpl tr = localCC.getTR(); url = tr.getUrl(); info = tr.getInfo(); executorForUpdateStatistics.execute(new BackgroundUpdateStatisticTask (schemaName, tableName, indexName, url, info)); }

          return;
          }
          }

          Will work on addressing the remaining comments.

          Show
          Mamta A. Satoor added a comment - Knut, couple more responses to some of your comments. 3) Creating an EmbedConnection30 object directly breaks the modularity. Unless the calls to the internal methods are necessary, it may be better to use InternalDriver.activeDriver().connect(url, info) instead. There was no specific need to be directly creating EmbedConnection30 object, so I have used your recommendation to get the Connection object. 5) There's a comment in DDImpl5.updateStatisticsInBackGround() saying that "cm is null the very first time, and whenever we aren't actually nested." I'm not sure I understand that comment. Why is it null the first time? And isn't the method always called in a nested context? And if it is null, wouldn't that cause a NullPointerException in EmbedConnection's constructor when url=null is passed in? I used the current model to get the ContextManager similar to how we get it in jdbc,InternalDriver:getConnectionContext and there we do a check for null ContextManager. Looking at that code, I thought there might be a case where ContextManager could be null. To address the NullPointerException that may result because of url being null in case of null ContextManager, I have changed the code in DDImpl5.updateStatisticsInBackGround() to fire the background update statistics only if ContextManager is not null. So the new code looks as follows public void updateStatisticsInBackGround(String schemaName, String tableName, String indexName) throws StandardException{ String url = null; Properties info = null; if (executorForUpdateStatistics==null) { executorForUpdateStatistics = new ThreadPoolExecutor(5,5,0L, TimeUnit.MILLISECONDS, new LinkedBlockingQueue(5)); executorForUpdateStatistics.setRejectedExecutionHandler( new ThreadPoolExecutor.CallerRunsPolicy()); } ContextService csf = ContextService.getFactory(); ContextManager cm = csf.getCurrentContextManager(); ConnectionContext localCC = null; /* cm is null the very first time, and whenever we aren't actually nested. */ if (cm != null) { localCC = (ConnectionContext) (cm.getContext(ConnectionContext.CONTEXT_ID)); TransactionResourceImpl tr = localCC.getTR(); url = tr.getUrl(); info = tr.getInfo(); executorForUpdateStatistics.execute(new BackgroundUpdateStatisticTask (schemaName, tableName, indexName, url, info)); } return; } } Will work on addressing the remaining comments.
          Hide
          Mamta A. Satoor added a comment -

          My understanding of DERBY-3892 is as follows
          1)A query is exectued and the best plan for it is picked up based on the current data in the database
          2)The data changes quite a bit which makes the plan chosen for the query not an optimal plan but during the next execution of the query, we continue to use the plan chosen in step 1).

          Based on the above understanding of DERBY-3892, the current work going on for DERBY-3788 is not going to fix DERBY-3892. This jira entry will go and build the statistics if during the query compilation it is found that the required statistics do not exist. Hope this answers your question about DERBY-3892, Dag.

          Show
          Mamta A. Satoor added a comment - My understanding of DERBY-3892 is as follows 1)A query is exectued and the best plan for it is picked up based on the current data in the database 2)The data changes quite a bit which makes the plan chosen for the query not an optimal plan but during the next execution of the query, we continue to use the plan chosen in step 1). Based on the above understanding of DERBY-3892 , the current work going on for DERBY-3788 is not going to fix DERBY-3892 . This jira entry will go and build the statistics if during the query compilation it is found that the required statistics do not exist. Hope this answers your question about DERBY-3892 , Dag.
          Hide
          Dag H. Wanvik added a comment -

          Would a solution to this issue solve DERBY-3892? If so, it might be good to say so there and link it to this issue.

          Show
          Dag H. Wanvik added a comment - Would a solution to this issue solve DERBY-3892 ? If so, it might be good to say so there and link it to this issue.
          Hide
          Mamta A. Satoor added a comment -

          Thanks again for your time, Knut. I will try to address some of the questions in this comment. I am planning on taking next week as vacation so will get back on this task after that.
          1)In this jira entry, missing statisitcs will be detected when the user executes a select. I had originally just for the sake of trying attempted to create the statistics in the select thread when I find that the statistics are missing. But executing an update inside of a select processing caused problem because data dictionary was not in write mode. Because of that, I couldn't update the statistics during the select processing. One very simple example where one will find missing statistics is as follows
          create table t1 (c11 int, c12 char);
          create index i1 on t1(c11);
          insert into t1 values (1,'1'),(2,'2'),(3,'3');
          select * from sys.sysstatistics;

          My goal is to fire the statistics in background when user executes say following sql
          select * from t1 where c11=1;

          2)As for the calls to setting the context and unsetting, I think you are right. I will remove them and see how it goes.

          Show
          Mamta A. Satoor added a comment - Thanks again for your time, Knut. I will try to address some of the questions in this comment. I am planning on taking next week as vacation so will get back on this task after that. 1)In this jira entry, missing statisitcs will be detected when the user executes a select. I had originally just for the sake of trying attempted to create the statistics in the select thread when I find that the statistics are missing. But executing an update inside of a select processing caused problem because data dictionary was not in write mode. Because of that, I couldn't update the statistics during the select processing. One very simple example where one will find missing statistics is as follows create table t1 (c11 int, c12 char); create index i1 on t1(c11); insert into t1 values (1,'1'),(2,'2'),(3,'3'); select * from sys.sysstatistics; My goal is to fire the statistics in background when user executes say following sql select * from t1 where c11=1; 2)As for the calls to setting the context and unsetting, I think you are right. I will remove them and see how it goes.
          Hide
          Knut Anders Hatlen added a comment -

          Thanks for the updated patch, Mamta.

          I'm afraid I cannot offer very much guidance, but I have some
          questions and comments:

          1) This patch addresses the problem with non-existing statistics, not
          the problem with outdated statistics. In which situations don't the
          statistics exist? If it doesn't happen very often, it might be fine to
          update the statistics in the same thread.

          2) Are the calls to EmbedConnection30.setupContextStack() and
          restoreContextStack() needed around the call to execute()? I thought
          execute() would call setup/restoreContextStack() itself.

          3) Creating an EmbedConnection30 object directly breaks the
          modularity. Unless the calls to the internal methods are necessary, it
          may be better to use InternalDriver.activeDriver().connect(url, info)
          instead.

          4) I think that the creation of a new connection will reboot the
          database if it has been shut down in the user thread, which may lead
          to unpredictable behaviour. It also seems like it will preserve all
          connection attributes, like attributes to reencrypt the database or to
          start replication master.

          5) There's a comment in DDImpl5.updateStatisticsInBackGround() saying
          that "cm is null the very first time, and whenever we aren't actually
          nested." I'm not sure I understand that comment. Why is it null the
          first time? And isn't the method always called in a nested context?
          And if it is null, wouldn't that cause a NullPointerException in
          EmbedConnection's constructor when url=null is passed in?

          6) DDImpl5.updateStatisticsInBackGround() updates the shared variable
          executorForUpdateStatistics if it is null. But it is not protected by
          synchronization, so race conditions are possible.

          7) DDImpl5.stop() should call super.stop().

          8) In BackgroundUpdateStatisticsTask, using a prepared statement with
          the table name and the index name parametrized would be better because
          it would handle quoting special characters correctly (not handled in
          the current patch) and it would reduce the number of entries in the
          statement cache.

          9) As to the locking issues, I would have tried to call
          Connection.setTransactionIsolation(Connection.TRANSACTION_READ_UNCOMMITTED)
          in the background thread to see if that solved/reduced the issues.

          Show
          Knut Anders Hatlen added a comment - Thanks for the updated patch, Mamta. I'm afraid I cannot offer very much guidance, but I have some questions and comments: 1) This patch addresses the problem with non-existing statistics, not the problem with outdated statistics. In which situations don't the statistics exist? If it doesn't happen very often, it might be fine to update the statistics in the same thread. 2) Are the calls to EmbedConnection30.setupContextStack() and restoreContextStack() needed around the call to execute()? I thought execute() would call setup/restoreContextStack() itself. 3) Creating an EmbedConnection30 object directly breaks the modularity. Unless the calls to the internal methods are necessary, it may be better to use InternalDriver.activeDriver().connect(url, info) instead. 4) I think that the creation of a new connection will reboot the database if it has been shut down in the user thread, which may lead to unpredictable behaviour. It also seems like it will preserve all connection attributes, like attributes to reencrypt the database or to start replication master. 5) There's a comment in DDImpl5.updateStatisticsInBackGround() saying that "cm is null the very first time, and whenever we aren't actually nested." I'm not sure I understand that comment. Why is it null the first time? And isn't the method always called in a nested context? And if it is null, wouldn't that cause a NullPointerException in EmbedConnection's constructor when url=null is passed in? 6) DDImpl5.updateStatisticsInBackGround() updates the shared variable executorForUpdateStatistics if it is null. But it is not protected by synchronization, so race conditions are possible. 7) DDImpl5.stop() should call super.stop(). 8) In BackgroundUpdateStatisticsTask, using a prepared statement with the table name and the index name parametrized would be better because it would handle quoting special characters correctly (not handled in the current patch) and it would reduce the number of entries in the statement cache. 9) As to the locking issues, I would have tried to call Connection.setTransactionIsolation(Connection.TRANSACTION_READ_UNCOMMITTED) in the background thread to see if that solved/reduced the issues.
          Hide
          Mamta A. Satoor added a comment -

          Knut, thanks for taking the time to review the patch. I took the day off yesterday so couldn't reply sooner. I have added a new patch which includes DataDictionaryImpl5.java Please let me know what do you think of the code, especially, the part where I create a EmbedConnection using different objects collected earlier before the background thread was fired.

          As for locking, I am no using any special isolation level for the update statistics when fired through the background thread. The background update statistics thread is using the same isolation that currently is used by update statistics when a user manually fires update statistics through CALL SYSCS_UTIL.SYSCS_UPDATE_STATISTICS(..,..,..)

          Show
          Mamta A. Satoor added a comment - Knut, thanks for taking the time to review the patch. I took the day off yesterday so couldn't reply sooner. I have added a new patch which includes DataDictionaryImpl5.java Please let me know what do you think of the code, especially, the part where I create a EmbedConnection using different objects collected earlier before the background thread was fired. As for locking, I am no using any special isolation level for the update statistics when fired through the background thread. The background update statistics thread is using the same isolation that currently is used by update statistics when a user manually fires update statistics through CALL SYSCS_UTIL.SYSCS_UPDATE_STATISTICS(..,..,..)
          Hide
          Knut Anders Hatlen added a comment -

          Hi Mamta,

          Sorry for posting the comments so late. I don't see the DataDictionaryImpl5 class in the patch. Did you forget to do svn add?

          Which isolation level does the background thread use when it updates the statistics? I see that Mike has suggested read uncommitted in order to minimize the impact on locking.

          Show
          Knut Anders Hatlen added a comment - Hi Mamta, Sorry for posting the comments so late. I don't see the DataDictionaryImpl5 class in the patch. Did you forget to do svn add? Which isolation level does the background thread use when it updates the statistics? I see that Mike has suggested read uncommitted in order to minimize the impact on locking.
          Hide
          Mamta A. Satoor added a comment -

          I was wondering if anyone got a chance to look at the patch and have any comments about how I am getting an EmbedConnection in the background thread to update the statistics. Thanks.

          Show
          Mamta A. Satoor added a comment - I was wondering if anyone got a chance to look at the patch and have any comments about how I am getting an EmbedConnection in the background thread to update the statistics. Thanks.
          Hide
          Mamta A. Satoor added a comment -

          First of all, let me start out by saying that this patch may not be ready for commit yet depending on what the community thinks about the approach taken to establish a Connection in a background thread to fire update statisitcs jobs.

          Background information
          There can be situations in Derby where the statistics might not exist for some of the indexes. The absence of statistics will hamper optimizer for picking up the best available path for query execution. We have added a manual way of updating the statistics through DERBY-269. But this is a manual process rather than zero-admin way of mainting statistics. I am attempting to have Derby fire the update statistics job in the background when it finds that statistics are not available for an index. Derby can detect the unavailability of statistics during query optimization phase where it queries SYSSTATISTICS table to see if statistics are available for an index. With my patch, if the statistics are missing, then we will schedule a background task to have the statistics available for next time around when optimizer needs to look at them. To schedule these tasks, rather than inventing our own way of managaing threads, I have relied on new classes available in jdk1.5 This means that the zero-admin way of maintaining statistics will be available in jdk 1.5 and higher versions. In order to make the update statistics funtionality available in jdk1.5 and higher, I had to subclass a new version of DataDictionaryImpl called DataDictionaryImpl5 and that class will be used only for jdk1.5 and higher. The background task creates an EmbedConnection of it's own and executes following sql on the JDBC Statement obtained from the EmbedConnection.
          CALL SYSCS_UTIL.SYSCS_UPDATE_STATISTICS(schemaName, tableName, indexName);

          This is where I would appreciate if the Derby community can see if my approach is the best way to execute update statistics ie by creating an EmbedConnection. What's even trickier is the way I get the correct instance of InternalDriver, the url and properties that were used to make the original JDBC connection which required us to fire this update statistics task. The code may not look very clean in how get these objects which are required to make an EmbedConnection. I tried looking through existing Derby code to see if we are required to execute our own sql in an independent thread but I didn't find any luck with that. So, please let me know if you agree with the approach or have ideas on how to achieve this in a different way, I will really appreciate it. The work of getting the required objects to create an EmbedConnection happens in DataDictionaryImpl5.

          Please let me know if I can provide more information to make the patch more clear. I haven't put much comments in the new code in the patch because I wanted to sure that we as a community agree on the approach taken to do the actual work of update statistics by creating EmbedConnection object.

          I did try running the existing junit tests with my patch and found handful of tests to fail. I analyzed few of those failures and I think they are locking failures. I have a feeling that when we try to update the statistics, we hold the locks which conflict with what the test might be doing at the same time that statistics are getting updated in a different thread. I wonder what should we do about these failures.

          Show
          Mamta A. Satoor added a comment - First of all, let me start out by saying that this patch may not be ready for commit yet depending on what the community thinks about the approach taken to establish a Connection in a background thread to fire update statisitcs jobs. Background information There can be situations in Derby where the statistics might not exist for some of the indexes. The absence of statistics will hamper optimizer for picking up the best available path for query execution. We have added a manual way of updating the statistics through DERBY-269 . But this is a manual process rather than zero-admin way of mainting statistics. I am attempting to have Derby fire the update statistics job in the background when it finds that statistics are not available for an index. Derby can detect the unavailability of statistics during query optimization phase where it queries SYSSTATISTICS table to see if statistics are available for an index. With my patch, if the statistics are missing, then we will schedule a background task to have the statistics available for next time around when optimizer needs to look at them. To schedule these tasks, rather than inventing our own way of managaing threads, I have relied on new classes available in jdk1.5 This means that the zero-admin way of maintaining statistics will be available in jdk 1.5 and higher versions. In order to make the update statistics funtionality available in jdk1.5 and higher, I had to subclass a new version of DataDictionaryImpl called DataDictionaryImpl5 and that class will be used only for jdk1.5 and higher. The background task creates an EmbedConnection of it's own and executes following sql on the JDBC Statement obtained from the EmbedConnection. CALL SYSCS_UTIL.SYSCS_UPDATE_STATISTICS(schemaName, tableName, indexName); This is where I would appreciate if the Derby community can see if my approach is the best way to execute update statistics ie by creating an EmbedConnection. What's even trickier is the way I get the correct instance of InternalDriver, the url and properties that were used to make the original JDBC connection which required us to fire this update statistics task. The code may not look very clean in how get these objects which are required to make an EmbedConnection. I tried looking through existing Derby code to see if we are required to execute our own sql in an independent thread but I didn't find any luck with that. So, please let me know if you agree with the approach or have ideas on how to achieve this in a different way, I will really appreciate it. The work of getting the required objects to create an EmbedConnection happens in DataDictionaryImpl5. Please let me know if I can provide more information to make the patch more clear. I haven't put much comments in the new code in the patch because I wanted to sure that we as a community agree on the approach taken to do the actual work of update statistics by creating EmbedConnection object. I did try running the existing junit tests with my patch and found handful of tests to fail. I analyzed few of those failures and I think they are locking failures. I have a feeling that when we try to update the statistics, we hold the locks which conflict with what the test might be doing at the same time that statistics are getting updated in a different thread. I wonder what should we do about these failures.
          Hide
          Mamta A. Satoor added a comment -

          I now do have the basic framework of using the ThreadPoolExecutor in new extended class DataDictionary5(DD5). I have called the method in DD5 as updateStatisticsInBackGround rather than scheduleUpdateStatisticstask. This method in DD5 right now just queues println tasks everytime an update statistics request is sent its ways. This is what the method code in DD5 looks like right now

          public void updateStatisticsInBackGround(String schemaName,
          String tableName, String indexName) {
          System.out.println("came to jdk 1.5 implementation");
          if (executorForUpdateStatistics==null)

          { executorForUpdateStatistics = new ThreadPoolExecutor(5,5,0L, TimeUnit.MILLISECONDS, new LinkedBlockingQueue(5)); executorForUpdateStatistics.setRejectedExecutionHandler( new ThreadPoolExecutor.CallerRunsPolicy()); }

          executorForUpdateStatistics.execute(new
          BackgroundUpdateStatisticTask (schemaName,
          tableName, indexName));
          return;
          }

          The new class BackgroundUpdateStatisticTask right now just does println as shown below
          class BackgroundUpdateStatisticTask implements Runnable{
          private String schemaName;
          private String tableName;
          private String indexName;

          public BackgroundUpdateStatisticTask(String schemaName,
          String tableName, String indexName)

          { this.schemaName=schemaName; this.tableName=tableName; this.indexName=indexName; }

          public void run()

          { System.out.println("Hello World : Updating statistics for " + schemaName + ":" + tableName + ":" + indexName); }

          }

          As the next step, I now want to actually do update statistics in BackgroundUpdateStatisticTask rather than just println. I thought I would be able to do something like following in the run method in BackgroundUpdateStatisticTask
          public void run()
          {
          System.out.println("Hello World : Updating statistics for "
          + schemaName + ":" + tableName + ":" + indexName);
          try

          { SystemProcedures.SYSCS_UPDATE_STATISTICS(schemaName, tableName, indexName); }

          catch (SQLException e)

          { System.out.println("got exception"); e.printStackTrace(); }

          }

          But a call to SystemProcedures.SYSCS_UPDATE_STATISTICS inside of the run method results in

          java.sql.SQLException: No current connection.
          at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(SQLExceptionFactory.java:45)
          at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Util.java:87)
          at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Util.java:103)
          at org.apache.derby.impl.jdbc.Util.noCurrentConnection(Util.java:207)
          at org.apache.derby.catalog.SystemProcedures.getDefaultConn(SystemProcedures.java:185)
          at org.apache.derby.catalog.SystemProcedures.SYSCS_UPDATE_STATISTICS(SystemProcedures.java:737)
          at org.apache.derby.impl.sql.catalog.BackgroundUpdateStatisticTask.run(DataDictionaryImpl5.java:90)
          at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
          at java.lang.Thread.run(Thread.java:595)

          So, it appears that I do not have the context setting done correctly in order to run the stored procedure implementation in SystemProcedures.SYSCS_UPDATE_STATISTICS. I am currently trying to see how we do the required setup when a JDBC connection is made but at this point, I am not clear on how might be able to use some of that code from JDBC connection time into BackgroundUpdateStatisticTask.run method. I will appreciate any help I can get to do the context setting so I can run the update statistics. thanks.

          Show
          Mamta A. Satoor added a comment - I now do have the basic framework of using the ThreadPoolExecutor in new extended class DataDictionary5(DD5). I have called the method in DD5 as updateStatisticsInBackGround rather than scheduleUpdateStatisticstask. This method in DD5 right now just queues println tasks everytime an update statistics request is sent its ways. This is what the method code in DD5 looks like right now public void updateStatisticsInBackGround(String schemaName, String tableName, String indexName) { System.out.println("came to jdk 1.5 implementation"); if (executorForUpdateStatistics==null) { executorForUpdateStatistics = new ThreadPoolExecutor(5,5,0L, TimeUnit.MILLISECONDS, new LinkedBlockingQueue(5)); executorForUpdateStatistics.setRejectedExecutionHandler( new ThreadPoolExecutor.CallerRunsPolicy()); } executorForUpdateStatistics.execute(new BackgroundUpdateStatisticTask (schemaName, tableName, indexName)); return; } The new class BackgroundUpdateStatisticTask right now just does println as shown below class BackgroundUpdateStatisticTask implements Runnable{ private String schemaName; private String tableName; private String indexName; public BackgroundUpdateStatisticTask(String schemaName, String tableName, String indexName) { this.schemaName=schemaName; this.tableName=tableName; this.indexName=indexName; } public void run() { System.out.println("Hello World : Updating statistics for " + schemaName + ":" + tableName + ":" + indexName); } } As the next step, I now want to actually do update statistics in BackgroundUpdateStatisticTask rather than just println. I thought I would be able to do something like following in the run method in BackgroundUpdateStatisticTask public void run() { System.out.println("Hello World : Updating statistics for " + schemaName + ":" + tableName + ":" + indexName); try { SystemProcedures.SYSCS_UPDATE_STATISTICS(schemaName, tableName, indexName); } catch (SQLException e) { System.out.println("got exception"); e.printStackTrace(); } } But a call to SystemProcedures.SYSCS_UPDATE_STATISTICS inside of the run method results in java.sql.SQLException: No current connection. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(SQLExceptionFactory.java:45) at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Util.java:87) at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Util.java:103) at org.apache.derby.impl.jdbc.Util.noCurrentConnection(Util.java:207) at org.apache.derby.catalog.SystemProcedures.getDefaultConn(SystemProcedures.java:185) at org.apache.derby.catalog.SystemProcedures.SYSCS_UPDATE_STATISTICS(SystemProcedures.java:737) at org.apache.derby.impl.sql.catalog.BackgroundUpdateStatisticTask.run(DataDictionaryImpl5.java:90) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675) at java.lang.Thread.run(Thread.java:595) So, it appears that I do not have the context setting done correctly in order to run the stored procedure implementation in SystemProcedures.SYSCS_UPDATE_STATISTICS. I am currently trying to see how we do the required setup when a JDBC connection is made but at this point, I am not clear on how might be able to use some of that code from JDBC connection time into BackgroundUpdateStatisticTask.run method. I will appreciate any help I can get to do the context setting so I can run the update statistics. thanks.
          Hide
          Mamta A. Satoor added a comment -

          As per my comment on this jira entry yesterday, I am trying to use ThreadPoolExecutor in Derby codeline so it can be used to fire the update statistics tasks in the background. ThreadPoolExecutor was added as part of jdk1.5 When I include the import of this class into DataDictionary.java(I will refer to it as DD), I ofcourse need to make sure that the DD.java gets compiled with 1.5 and higher. But at run time, requiring the user to have 1.5 or higher is not going to work.

          At this point, I am thinking of addressing this by having a subclass of DD, called say DD5 which will be loaded by the monitor code if we know that we are dealing with jdk 1.5 or higher. For the earlier versions, for now, trying to schedule update statistics tasks in the background will be a no-op. This way, I will be able to have the new code run(rather do no-op for jdk1.4 and lower) in all supported versions of jdks. If anyone has any feedback on my approach, please let me know.

          Some background information : iapi.sql.dictionary.TableDescriptor has an existing method called statisticsExist which will return true if the statistics exist. This method gets called by the query optimization phase along with AlterTableConstantAction. For the query optimization phase, we want to be able to schedule update statistics tasks in the background if TableDescriptor.statisticsExist () returns false. Following is the pseudo-code I have in mind:

          during the query optimization
          if (TableDescriptor.statisticsExist == false)

          { TableDescriptor.createStatisticsInBackGround(schemaname, tablename, indexname) <----- new method in TableDescriptor }

          TableDescriptor with new method
          createStatisticsInBackGround(.....)

          { DD.scheduleUpdateStatisticstask(...) <------ new method in DD. }

          DD5 will do the real work of scheduling the taks in background. In DD(ie when running with jdk1.4 and lower), this method will be no-op.

          DD5 with new method (for jdk1.5 and higher)
          scheduleUpdateStatisticstask(...)

          { create ThreadPoolExecutor if does not already exist queue update statistics task on the ThreadPoolExecutor }

          DD with new method (for jdk1.4 and lower)
          scheduleUpdateStatisticstask(...)

          { no-op }
          Show
          Mamta A. Satoor added a comment - As per my comment on this jira entry yesterday, I am trying to use ThreadPoolExecutor in Derby codeline so it can be used to fire the update statistics tasks in the background. ThreadPoolExecutor was added as part of jdk1.5 When I include the import of this class into DataDictionary.java(I will refer to it as DD), I ofcourse need to make sure that the DD.java gets compiled with 1.5 and higher. But at run time, requiring the user to have 1.5 or higher is not going to work. At this point, I am thinking of addressing this by having a subclass of DD, called say DD5 which will be loaded by the monitor code if we know that we are dealing with jdk 1.5 or higher. For the earlier versions, for now , trying to schedule update statistics tasks in the background will be a no-op. This way, I will be able to have the new code run(rather do no-op for jdk1.4 and lower) in all supported versions of jdks. If anyone has any feedback on my approach, please let me know. Some background information : iapi.sql.dictionary.TableDescriptor has an existing method called statisticsExist which will return true if the statistics exist. This method gets called by the query optimization phase along with AlterTableConstantAction. For the query optimization phase, we want to be able to schedule update statistics tasks in the background if TableDescriptor.statisticsExist () returns false. Following is the pseudo-code I have in mind: during the query optimization if (TableDescriptor.statisticsExist == false) { TableDescriptor.createStatisticsInBackGround(schemaname, tablename, indexname) <----- new method in TableDescriptor } TableDescriptor with new method createStatisticsInBackGround(.....) { DD.scheduleUpdateStatisticstask(...) <------ new method in DD. } DD5 will do the real work of scheduling the taks in background. In DD(ie when running with jdk1.4 and lower), this method will be no-op. DD5 with new method (for jdk1.5 and higher) scheduleUpdateStatisticstask(...) { create ThreadPoolExecutor if does not already exist queue update statistics task on the ThreadPoolExecutor } DD with new method (for jdk1.4 and lower) scheduleUpdateStatisticstask(...) { no-op }
          Hide
          Mamta A. Satoor added a comment -

          Based on the feedback from Mike, I am thinking of pursuing option 3)"give up on updating during compile and just use default, but schedule update after compile and somehow mark query to be recompiled again the next time (this might automatically happen if update of statistic already causes dependency to change on compiled query - i don't know. "

          I will concentrate on the first part of option 3 which is to schedule update statistics after query compile if the query compile finds that the required statistics are not available. As recommended by Dan on Derby dev list (http://www.nabble.com/higher-level-background-work-in-the-derby-server,-where-should-it-go--td18950929.html), I am trying to use classes in java.util.concurrent rather than some home-built background jobs mechanism.

          To break option 3) first part further into mini-steps, I have written a "Hello World" program using the java.util.concurrent package. This is totally out of the Derby code, just stand alone JAVA program. There are 2 physical java files(attached to this jira entry for reference). One(DERBY_3788_Mgr) creates the ThreadPoolExecutor and then adds 30 Hello World background tasks (DERBY_3788_Repro). In order to execute this, use following
          java org.apache.derbyTesting.functionTests.tests.jdbc4.DERBY_3788_Mgr

          One thing to note is that java.util.concurrent.ThreadPoolExecutor was introduced in jdk1.5 and hence need to run with jdk1.5 or higher version to run this program. We will have to discuss what do we want to do for jdks prior to 1.5

          As the next mini-step, I am going to see how I can put this ThreadPoolExecutor framework in our codeline when we detect that statistics are not available. At this step, I will just do some sort of println in the background thread for update statistics requirement. Once that is in place, I will try to see how I can actually fire update statistics code from the background threads.

          Any comments/feedback on what I have so far or what the plan is?

          Show
          Mamta A. Satoor added a comment - Based on the feedback from Mike, I am thinking of pursuing option 3)"give up on updating during compile and just use default, but schedule update after compile and somehow mark query to be recompiled again the next time (this might automatically happen if update of statistic already causes dependency to change on compiled query - i don't know. " I will concentrate on the first part of option 3 which is to schedule update statistics after query compile if the query compile finds that the required statistics are not available. As recommended by Dan on Derby dev list ( http://www.nabble.com/higher-level-background-work-in-the-derby-server,-where-should-it-go--td18950929.html ), I am trying to use classes in java.util.concurrent rather than some home-built background jobs mechanism. To break option 3) first part further into mini-steps, I have written a "Hello World" program using the java.util.concurrent package. This is totally out of the Derby code, just stand alone JAVA program. There are 2 physical java files(attached to this jira entry for reference). One(DERBY_3788_Mgr) creates the ThreadPoolExecutor and then adds 30 Hello World background tasks (DERBY_3788_Repro). In order to execute this, use following java org.apache.derbyTesting.functionTests.tests.jdbc4.DERBY_3788_Mgr One thing to note is that java.util.concurrent.ThreadPoolExecutor was introduced in jdk1.5 and hence need to run with jdk1.5 or higher version to run this program. We will have to discuss what do we want to do for jdks prior to 1.5 As the next mini-step, I am going to see how I can put this ThreadPoolExecutor framework in our codeline when we detect that statistics are not available. At this step, I will just do some sort of println in the background thread for update statistics requirement. Once that is in place, I will try to see how I can actually fire update statistics code from the background threads. Any comments/feedback on what I have so far or what the plan is?
          Hide
          Mike Matrigali added a comment -

          I don't know if any of these are good ideas but maybe someone with knowledge of code can discuss. They are
          options that came to mind:
          1) explore why the error is happening and see if updating stats can be special cased. It is different than a
          "real" ddl that might cause havoc changing during a comple.
          2) rather than update the stat, just inline calculate and use that for current compile and schedule a later task after
          compile to do the update.
          3) give up on updating during compile and just use default, but schedule update after compile and somehow mark query to be recompiled again the next time (this might automatically happen if update of statistic already causes dependency to change on compiled query - i don't know.
          4) when you find missing stat, give up on current compile, update stat, and then retry the compile.

          As I understood it, you were looking at inline update of stats as a incremental dev step - maybe it is not worth it
          given the problems and you should jump to long term strategy of delaying update.

          Show
          Mike Matrigali added a comment - I don't know if any of these are good ideas but maybe someone with knowledge of code can discuss. They are options that came to mind: 1) explore why the error is happening and see if updating stats can be special cased. It is different than a "real" ddl that might cause havoc changing during a comple. 2) rather than update the stat, just inline calculate and use that for current compile and schedule a later task after compile to do the update. 3) give up on updating during compile and just use default, but schedule update after compile and somehow mark query to be recompiled again the next time (this might automatically happen if update of statistic already causes dependency to change on compiled query - i don't know. 4) when you find missing stat, give up on current compile, update stat, and then retry the compile. As I understood it, you were looking at inline update of stats as a incremental dev step - maybe it is not worth it given the problems and you should jump to long term strategy of delaying update.
          Hide
          Mamta A. Satoor added a comment -

          I am looking to see if the statistics do not exist for an index during a query compile phase, then try to create them while compiling the query. But trying to create statistics during query compilation throw following exception.
          ERROR XCL21: You are trying to execute a Data Definition statement (CREATE, DROP, or ALTER) while preparing a different statement. This is not allowed. It can happen if you execute a Data Definition statement from within a static initializer of a Java class that is being used from within a SQL statement.

          The exception is being thrown from DataDictionaryImpl.startWriting(LanguageConnectionContext) through following piece of code
          /*

            • Don't allow DDL if we're binding a SQL statement.
              */
              if (lcc.getBindCount() != 0) { throw StandardException.newException(SQLState.LANG_DDL_IN_BIND); }

          My question is can we disable this say when statistics are getting updated while a query is getting executed? Is there any other way of achieving creating statistics on fly when we know that we are going to need them during the execution of the current query? Appreciate any feedack.

          Show
          Mamta A. Satoor added a comment - I am looking to see if the statistics do not exist for an index during a query compile phase, then try to create them while compiling the query. But trying to create statistics during query compilation throw following exception. ERROR XCL21: You are trying to execute a Data Definition statement (CREATE, DROP, or ALTER) while preparing a different statement. This is not allowed. It can happen if you execute a Data Definition statement from within a static initializer of a Java class that is being used from within a SQL statement. The exception is being thrown from DataDictionaryImpl.startWriting(LanguageConnectionContext) through following piece of code /* Don't allow DDL if we're binding a SQL statement. */ if (lcc.getBindCount() != 0) { throw StandardException.newException(SQLState.LANG_DDL_IN_BIND); } My question is can we disable this say when statistics are getting updated while a query is getting executed? Is there any other way of achieving creating statistics on fly when we know that we are going to need them during the execution of the current query? Appreciate any feedack.
          Hide
          Mamta A. Satoor added a comment -

          Currently, iapi.sql.dictionary.TableDescriptor has a method called statisticsExist which checks for the passed conglomerate if there is statistics availble or not. This method gets called
          1)during the optimization phase of a query by impl.sql.compile.FromBaseTable.estimateCost
          2)during the optimization phase of a query by impl.sql.compile.PredicateList.selectivity
          3)by impl.sql.execute.AlterTableConstantAction to determine if the statistics already exist and if yes, then drop and recreate it. This happens during the compress table request(not sure if anything other than compress table in AlterTableConstantAction calls it but we do not need to worry about it because we only want to create the statistics if they don't already exist during the compile phase of a sql query)

          What I am considering doing is adding another method to TableDescriptor called say createStatistics which will be called during the optimization phase of a query by the 2 classes impl.sql.compile.FromBaseTable and impl.sql.compile.PredicateList. The only awkward thing I am finding is the code required to update the statistics need many objects like LanguageConnectionContext, DataDictionary, TransactionController, etc (there may be more) and they are not available to TableDescriptor class so I will have to pass these objects when a call is made to the new method in TableDescriptor from impl.sql.compile. Does this sound like not a smooth way of getting the objects? Maybe this new method should be defined in DataDictionary rather than TableDescriptor. I will work more on what is the right place for the new method. In the mean time, if anyone has any thoughts, please post them here.

          Once we have this new method to create the statistics, we can hopefully remove the redudant code that already exists to create the statistics in 3 different classes, namely, AlterTableConstantAction, CreateIndexConstantAction and InsetResultSet.

          Show
          Mamta A. Satoor added a comment - Currently, iapi.sql.dictionary.TableDescriptor has a method called statisticsExist which checks for the passed conglomerate if there is statistics availble or not. This method gets called 1)during the optimization phase of a query by impl.sql.compile.FromBaseTable.estimateCost 2)during the optimization phase of a query by impl.sql.compile.PredicateList.selectivity 3)by impl.sql.execute.AlterTableConstantAction to determine if the statistics already exist and if yes, then drop and recreate it. This happens during the compress table request(not sure if anything other than compress table in AlterTableConstantAction calls it but we do not need to worry about it because we only want to create the statistics if they don't already exist during the compile phase of a sql query) What I am considering doing is adding another method to TableDescriptor called say createStatistics which will be called during the optimization phase of a query by the 2 classes impl.sql.compile.FromBaseTable and impl.sql.compile.PredicateList. The only awkward thing I am finding is the code required to update the statistics need many objects like LanguageConnectionContext, DataDictionary, TransactionController, etc (there may be more) and they are not available to TableDescriptor class so I will have to pass these objects when a call is made to the new method in TableDescriptor from impl.sql.compile. Does this sound like not a smooth way of getting the objects? Maybe this new method should be defined in DataDictionary rather than TableDescriptor. I will work more on what is the right place for the new method. In the mean time, if anyone has any thoughts, please post them here. Once we have this new method to create the statistics, we can hopefully remove the redudant code that already exists to create the statistics in 3 different classes, namely, AlterTableConstantAction, CreateIndexConstantAction and InsetResultSet.
          Hide
          Mamta A. Satoor added a comment -

          I am looking at creating the statistics automatically if they are found to be absent when needed. As a next step later, we can look at updating the statistics automatically under certain threshold if they appear to be out of sync.

          Show
          Mamta A. Satoor added a comment - I am looking at creating the statistics automatically if they are found to be absent when needed. As a next step later, we can look at updating the statistics automatically under certain threshold if they appear to be out of sync.
          Hide
          Mamta A. Satoor added a comment -

          Added a new test case with revision 681085 which shows that updating the statistics will make a query pickup better index compare to prior to statistics availability

          Show
          Mamta A. Satoor added a comment - Added a new test case with revision 681085 which shows that updating the statistics will make a query pickup better index compare to prior to statistics availability

            People

            • Assignee:
              Mamta A. Satoor
              Reporter:
              Mamta A. Satoor
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development