Derby
  1. Derby
  2. DERBY-4121

Documentation: more UPDATE_STATISTICS fixes needed for Reference Manual and Tuning Derby

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 10.5.1.1
    • Fix Version/s: 10.5.1.1, 10.6.1.0
    • Component/s: Documentation
    • Labels:
      None

      Description

      Kathey Marsden comments on DERBY-3787:

      Not a show stopper for the release, but in beginning my buddy testing,
      I noticed the examples should use CALL e.g.
      CALL SYSCS_UTIL.SYSCS_UPDATE_STATISTICS('SAMP','EMPLOYEE','PAY_DESC');
      CALL SYSCS_UTIL.SYSCS_UPDATE_STATISTICS('SAMP', 'EMPLOYEE', null);

      Also in the Tuning Guide under Working with Cardinality Statistics -> When Cardinality Statistics Go Stale we should refer users to the SYSCS_UTIL.SYSCS_UPDATE_STATISTICS stored procedure to update their statistics and cross reference the reference guide.

      Myrna van Lunteren suggests a new issue for this, so I'm filing it.

      1. DERBY-4121-2.zip
        4 kB
        Kim Haase
      2. DERBY-4121-2.diff
        4 kB
        Kim Haase
      3. DERBY-4121.zip
        4 kB
        Kim Haase
      4. DERBY-4121.stat
        0.1 kB
        Kim Haase
      5. DERBY-4121.diff
        3 kB
        Kim Haase

        Issue Links

          Activity

          Hide
          Kathey Marsden added a comment -

          Thank you Kim for filing this issue. I was holding off because I am still buddy testing and have more comments on the documentation. Specifically I wanted to include documentation that once statistics have been created they cannot be dropped and need to be subsequently updated regularly as the number of rows or distribution of the data changes.

          As soon as my testing is complete I will post any additional comments here. I should be done by the end of this week.

          Show
          Kathey Marsden added a comment - Thank you Kim for filing this issue. I was holding off because I am still buddy testing and have more comments on the documentation. Specifically I wanted to include documentation that once statistics have been created they cannot be dropped and need to be subsequently updated regularly as the number of rows or distribution of the data changes. As soon as my testing is complete I will post any additional comments here. I should be done by the end of this week.
          Hide
          Kim Haase added a comment -

          Thanks, Kathey. When the information is ready, either Laura can pick the issue up or I can – I have left it unassigned for the time being.

          Show
          Kim Haase added a comment - Thanks, Kathey. When the information is ready, either Laura can pick the issue up or I can – I have left it unassigned for the time being.
          Hide
          Mike Matrigali added a comment -

          cardinality statistics only track's the average number of duplicate values per value in an index. So it is unnecessary to update this if that value is not changing. Derby uses the indexes directly to
          track the distribution of values in given ranges and update statistics has no affect on this.
          So it is not necessary to update as the number of rows grows.

          The system has a zero admin way to track the number of rows in the system. This method is an
          estimate and a proposed change to update statistics will improve this estimate, but don't want users to think that they have to update statistics to get valid row counts.

          Show
          Mike Matrigali added a comment - cardinality statistics only track's the average number of duplicate values per value in an index. So it is unnecessary to update this if that value is not changing. Derby uses the indexes directly to track the distribution of values in given ranges and update statistics has no affect on this. So it is not necessary to update as the number of rows grows. The system has a zero admin way to track the number of rows in the system. This method is an estimate and a proposed change to update statistics will improve this estimate, but don't want users to think that they have to update statistics to get valid row counts.
          Hide
          Knut Anders Hatlen added a comment -

          My understanding of the code is that we store the number of unique values, not the average number of duplicates. The selectivity is calculated this way (from StatisticsImpl.java):

          public double selectivity(Object[] predicates)

          { if (numRows == 0.0) return 0.1; /* xxxSTATresolve: for small values of numRows, should we do something * special? */ return (double)(1/(double)numUnique); }

          If the number of rows is doubled, and the average number of duplicates per value doesn't change, this method will return a selectivity that is two times as high as it should be if we don't update the statistics.

          Show
          Knut Anders Hatlen added a comment - My understanding of the code is that we store the number of unique values, not the average number of duplicates. The selectivity is calculated this way (from StatisticsImpl.java): public double selectivity(Object[] predicates) { if (numRows == 0.0) return 0.1; /* xxxSTATresolve: for small values of numRows, should we do something * special? */ return (double)(1/(double)numUnique); } If the number of rows is doubled, and the average number of duplicates per value doesn't change, this method will return a selectivity that is two times as high as it should be if we don't update the statistics.
          Hide
          Kathey Marsden added a comment -

          Based on Knut's comment. is it correct then to say that SYSCS_UTIL.SYSCS_UPDATE_STATISTICS should be run if there is a significant change in the number of rows or the number of unique values in an index?

          Show
          Kathey Marsden added a comment - Based on Knut's comment. is it correct then to say that SYSCS_UTIL.SYSCS_UPDATE_STATISTICS should be run if there is a significant change in the number of rows or the number of unique values in an index?
          Hide
          Knut Anders Hatlen added a comment -

          That sounds correct to me.

          Show
          Knut Anders Hatlen added a comment - That sounds correct to me.
          Hide
          Mike Matrigali added a comment -

          Actually given the code that knut posted it seems to only matter if you change the number of unique values, or go from 0 to a non-zero number of rows. I knew the stat code remembered
          both numRows and numUnique - I didn't realize that we almost did nothing with numRows.

          Does anyone else think that the current way this number used is bad for a zero admin db. It seems like we should be giving back very different information is we run the stat gatherer and we find 1000 unique values and there were 1000 unique rows vs. 1000 unique values and 1,000,000 unique rows. If the assumption is that in general we are not going to require user intervention to gather stats what is a better guess :
          1) as number of rows changes we won't change number of unique values
          2) as number of rows changes the number of unique values will stay in the same proportion

          Obviously either assumption can be made wrong for a specific application, but what would a good default assumption be. It seems option 1 is the current default. I guess part of this question is what the resolve comment is referring to as this is a much bigger deal the smaller the sample size and thus the bigger the selectivity.

          Show
          Mike Matrigali added a comment - Actually given the code that knut posted it seems to only matter if you change the number of unique values, or go from 0 to a non-zero number of rows. I knew the stat code remembered both numRows and numUnique - I didn't realize that we almost did nothing with numRows. Does anyone else think that the current way this number used is bad for a zero admin db. It seems like we should be giving back very different information is we run the stat gatherer and we find 1000 unique values and there were 1000 unique rows vs. 1000 unique values and 1,000,000 unique rows. If the assumption is that in general we are not going to require user intervention to gather stats what is a better guess : 1) as number of rows changes we won't change number of unique values 2) as number of rows changes the number of unique values will stay in the same proportion Obviously either assumption can be made wrong for a specific application, but what would a good default assumption be. It seems option 1 is the current default. I guess part of this question is what the resolve comment is referring to as this is a much bigger deal the smaller the sample size and thus the bigger the selectivity.
          Hide
          Kathey Marsden added a comment -

          So given the current behavior and assuming we won't change it for the release, what is the best advice to offer users regarding maintaining statistics?

          Show
          Kathey Marsden added a comment - So given the current behavior and assuming we won't change it for the release, what is the best advice to offer users regarding maintaining statistics?
          Hide
          Knut Anders Hatlen added a comment -

          To Kathey's question, I think the most accurate advice is: "SYSCS_UTIL.SYSCS_UPDATE_STATISTICS should be invoked if the number of distinct values in the index has changed significantly."

          Show
          Knut Anders Hatlen added a comment - To Kathey's question, I think the most accurate advice is: "SYSCS_UTIL.SYSCS_UPDATE_STATISTICS should be invoked if the number of distinct values in the index has changed significantly."
          Hide
          Knut Anders Hatlen added a comment -

          To Mike's question, I think that (2) will be a better guess in more cases than (1). Of course, if you have a limited number of possible/likely values, the average number of duplicates will increase as the number of rows increases, but as long as there's a relatively high number of possible values (2) is more likely to hold. For columns like gender, city and state, I'd assume that (1) in many cases is a much better guess than (2), though.

          Show
          Knut Anders Hatlen added a comment - To Mike's question, I think that (2) will be a better guess in more cases than (1). Of course, if you have a limited number of possible/likely values, the average number of duplicates will increase as the number of rows increases, but as long as there's a relatively high number of possible values (2) is more likely to hold. For columns like gender, city and state, I'd assume that (1) in many cases is a much better guess than (2), though.
          Hide
          Kim Haase added a comment -

          Looks like this is ready to document, so I may as well pick it up. I have attached DERBY-4121.diff, DERBY-4121.stat, and DERBY-4121.zip, with changes to two files:

          M src/tuning/ctunstats849505.dita
          M src/ref/rrefupdatestatsproc.dita

          The Tuning change adds a paragraph to the suggested topic on statistics.

          The Ref change corrects the two examples; includes the information about when to call the procedure; and inserts the conrefs.dita phrase for Derby.

          I have thought of one more item for the Reference topic, though. The syntax has no indication of the data types of the three arguments:

          SYSCS_UTIL.SYSCS_UPDATE_STATISTICS(SCHEMANAME, TABLENAME, INDEXNAME)

          Shouldn't this be changed to something like the following?

          SYSCS_UTIL.SYSCS_UPDATE_STATISTICS(IN SCHEMANAME VARCHAR(128), IN TABLENAME VARCHAR(128), IN INDEXNAME VARCHAR(128))

          Compare http://db.apache.org/derby/docs/dev/ref/rrefimportproc.html and other procedure syntax statements.

          Thanks for any comments.

          Show
          Kim Haase added a comment - Looks like this is ready to document, so I may as well pick it up. I have attached DERBY-4121 .diff, DERBY-4121 .stat, and DERBY-4121 .zip, with changes to two files: M src/tuning/ctunstats849505.dita M src/ref/rrefupdatestatsproc.dita The Tuning change adds a paragraph to the suggested topic on statistics. The Ref change corrects the two examples; includes the information about when to call the procedure; and inserts the conrefs.dita phrase for Derby. I have thought of one more item for the Reference topic, though. The syntax has no indication of the data types of the three arguments: SYSCS_UTIL.SYSCS_UPDATE_STATISTICS(SCHEMANAME, TABLENAME, INDEXNAME) Shouldn't this be changed to something like the following? SYSCS_UTIL.SYSCS_UPDATE_STATISTICS(IN SCHEMANAME VARCHAR(128), IN TABLENAME VARCHAR(128), IN INDEXNAME VARCHAR(128)) Compare http://db.apache.org/derby/docs/dev/ref/rrefimportproc.html and other procedure syntax statements. Thanks for any comments.
          Hide
          Kathey Marsden added a comment -

          Thanks Kim for picking up this issue. I think you are right about the types being missing and the types as you describe them seem correct to me.

          The changes look good, but if you are making another round anyway, I think it would be good to preface the following statement:

          It is a good idea to call the SYSCS_UTIL.SYSCS_UPDATE_STATISTICS procedure when the number of distinct values in an index is likely to have changed significantly.

          with:

          Once statistics have been created, they cannot be dropped and should be maintained.

          Show
          Kathey Marsden added a comment - Thanks Kim for picking up this issue. I think you are right about the types being missing and the types as you describe them seem correct to me. The changes look good, but if you are making another round anyway, I think it would be good to preface the following statement: It is a good idea to call the SYSCS_UTIL.SYSCS_UPDATE_STATISTICS procedure when the number of distinct values in an index is likely to have changed significantly. with: Once statistics have been created, they cannot be dropped and should be maintained.
          Hide
          Kim Haase added a comment -

          Thanks very much, Kathey! I've attached a second patch (DERBY-4121-2.diff and DERBY-4121-2.zip) with the changes you approve and suggest to the Reference Manual topic.

          Show
          Kim Haase added a comment - Thanks very much, Kathey! I've attached a second patch ( DERBY-4121 -2.diff and DERBY-4121 -2.zip) with the changes you approve and suggest to the Reference Manual topic.
          Hide
          Kathey Marsden added a comment -

          Changes look good Kim. +1 to commit.

          Show
          Kathey Marsden added a comment - Changes look good Kim. +1 to commit.
          Hide
          Kim Haase added a comment -

          Thanks very much, Kathey.

          Committed patch DERBY-4121-2.diff to documentation trunk at revision 763965.
          Merged to 10.5 branch at revision 763968.

          Show
          Kim Haase added a comment - Thanks very much, Kathey. Committed patch DERBY-4121 -2.diff to documentation trunk at revision 763965. Merged to 10.5 branch at revision 763968.
          Hide
          Kim Haase added a comment -

          Closing, since fixes now appear in Latest Alpha Manuals.

          Show
          Kim Haase added a comment - Closing, since fixes now appear in Latest Alpha Manuals.

            People

            • Assignee:
              Kim Haase
              Reporter:
              Kim Haase
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development