Issue Details (XML | Word | Printable)

Key: DERBY-4121
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Kim Haase
Reporter: Kim Haase
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Derby

Documentation: more UPDATE_STATISTICS fixes needed for Reference Manual and Tuning Derby

Created: 26/Mar/09 02:28 PM   Updated: 04/May/09 06:22 PM
Component/s: Documentation
Affects Version/s: 10.5.1.1
Fix Version/s: 10.5.1.1, 10.6.0.0

Time Tracking:
Not Specified

File Attachments:
  Size
File Licensed for inclusion in ASF works DERBY-4121-2.diff 2009-04-09 01:40 PM Kim Haase 4 kB
Zip Archive Licensed for inclusion in ASF works DERBY-4121-2.zip 2009-04-09 01:40 PM Kim Haase 4 kB
File Licensed for inclusion in ASF works DERBY-4121.diff 2009-04-08 04:57 PM Kim Haase 3 kB
File Licensed for inclusion in ASF works DERBY-4121.stat 2009-04-08 04:57 PM Kim Haase 0.1 kB
Zip Archive Licensed for inclusion in ASF works DERBY-4121.zip 2009-04-08 04:57 PM Kim Haase 4 kB
Issue Links:
Dependants
 

Resolution Date: 10/Apr/09 04:14 PM


 Description  « Hide
Kathey Marsden comments on DERBY-3787:

Not a show stopper for the release, but in beginning my buddy testing,
 I noticed the examples should use CALL e.g.
CALL SYSCS_UTIL.SYSCS_UPDATE_STATISTICS('SAMP','EMPLOYEE','PAY_DESC');
CALL SYSCS_UTIL.SYSCS_UPDATE_STATISTICS('SAMP', 'EMPLOYEE', null);

Also in the Tuning Guide under Working with Cardinality Statistics -> When Cardinality Statistics Go Stale we should refer users to the SYSCS_UTIL.SYSCS_UPDATE_STATISTICS stored procedure to update their statistics and cross reference the reference guide.

Myrna van Lunteren suggests a new issue for this, so I'm filing it.


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Kathey Marsden added a comment - 26/Mar/09 04:26 PM
Thank you Kim for filing this issue. I was holding off because I am still buddy testing and have more comments on the documentation. Specifically I wanted to include documentation that once statistics have been created they cannot be dropped and need to be subsequently updated regularly as the number of rows or distribution of the data changes.

As soon as my testing is complete I will post any additional comments here. I should be done by the end of this week.


Kim Haase added a comment - 26/Mar/09 04:47 PM
Thanks, Kathey. When the information is ready, either Laura can pick the issue up or I can -- I have left it unassigned for the time being.

Mike Matrigali added a comment - 26/Mar/09 05:15 PM
cardinality statistics only track's the average number of duplicate values per value in an index. So it is unnecessary to update this if that value is not changing. Derby uses the indexes directly to
track the distribution of values in given ranges and update statistics has no affect on this.
So it is not necessary to update as the number of rows grows.

The system has a zero admin way to track the number of rows in the system. This method is an
estimate and a proposed change to update statistics will improve this estimate, but don't want users to think that they have to update statistics to get valid row counts.

Knut Anders Hatlen added a comment - 27/Mar/09 10:54 AM
My understanding of the code is that we store the number of unique values, not the average number of duplicates. The selectivity is calculated this way (from StatisticsImpl.java):

public double selectivity(Object[] predicates)
{
if (numRows == 0.0)
return 0.1;

/* xxxSTATresolve: for small values of numRows, should we do something
* special?
*/
return (double)(1/(double)numUnique);
}

If the number of rows is doubled, and the average number of duplicates per value doesn't change, this method will return a selectivity that is two times as high as it should be if we don't update the statistics.

Kathey Marsden added a comment - 31/Mar/09 08:00 PM
Based on Knut's comment. is it correct then to say that SYSCS_UTIL.SYSCS_UPDATE_STATISTICS should be run if there is a significant change in the number of rows or the number of unique values in an index?

Knut Anders Hatlen added a comment - 01/Apr/09 09:49 AM
That sounds correct to me.

Mike Matrigali added a comment - 01/Apr/09 05:00 PM
Actually given the code that knut posted it seems to only matter if you change the number of unique values, or go from 0 to a non-zero number of rows. I knew the stat code remembered
both numRows and numUnique - I didn't realize that we almost did nothing with numRows.

Does anyone else think that the current way this number used is bad for a zero admin db. It seems like we should be giving back very different information is we run the stat gatherer and we find 1000 unique values and there were 1000 unique rows vs. 1000 unique values and 1,000,000 unique rows. If the assumption is that in general we are not going to require user intervention to gather stats what is a better guess :
1) as number of rows changes we won't change number of unique values
2) as number of rows changes the number of unique values will stay in the same proportion

Obviously either assumption can be made wrong for a specific application, but what would a good default assumption be. It seems option 1 is the current default. I guess part of this question is what the resolve comment is referring to as this is a much bigger deal the smaller the sample size and thus the bigger the selectivity.

Kathey Marsden added a comment - 03/Apr/09 12:25 AM
So given the current behavior and assuming we won't change it for the release, what is the best advice to offer users regarding maintaining statistics?

Knut Anders Hatlen added a comment - 03/Apr/09 07:32 AM
To Kathey's question, I think the most accurate advice is: "SYSCS_UTIL.SYSCS_UPDATE_STATISTICS should be invoked if the number of distinct values in the index has changed significantly."

Knut Anders Hatlen added a comment - 03/Apr/09 11:11 AM
To Mike's question, I think that (2) will be a better guess in more cases than (1). Of course, if you have a limited number of possible/likely values, the average number of duplicates will increase as the number of rows increases, but as long as there's a relatively high number of possible values (2) is more likely to hold. For columns like gender, city and state, I'd assume that (1) in many cases is a much better guess than (2), though.

Kim Haase added a comment - 08/Apr/09 04:57 PM
Looks like this is ready to document, so I may as well pick it up. I have attached DERBY-4121.diff, DERBY-4121.stat, and DERBY-4121.zip, with changes to two files:

M src/tuning/ctunstats849505.dita
M src/ref/rrefupdatestatsproc.dita

The Tuning change adds a paragraph to the suggested topic on statistics.

The Ref change corrects the two examples; includes the information about when to call the procedure; and inserts the conrefs.dita phrase for Derby.

I have thought of one more item for the Reference topic, though. The syntax has no indication of the data types of the three arguments:

SYSCS_UTIL.SYSCS_UPDATE_STATISTICS(SCHEMANAME, TABLENAME, INDEXNAME)

Shouldn't this be changed to something like the following?

SYSCS_UTIL.SYSCS_UPDATE_STATISTICS(IN SCHEMANAME VARCHAR(128), IN TABLENAME VARCHAR(128), IN INDEXNAME VARCHAR(128))

Compare http://db.apache.org/derby/docs/dev/ref/rrefimportproc.html and other procedure syntax statements.

Thanks for any comments.

Kathey Marsden added a comment - 08/Apr/09 11:22 PM
Thanks Kim for picking up this issue. I think you are right about the types being missing and the types as you describe them seem correct to me.

The changes look good, but if you are making another round anyway, I think it would be good to preface the following statement:

It is a good idea to call the SYSCS_UTIL.SYSCS_UPDATE_STATISTICS procedure when the number of distinct values in an index is likely to have changed significantly.

with:

Once statistics have been created, they cannot be dropped and should be maintained.

Kim Haase added a comment - 09/Apr/09 01:40 PM
Thanks very much, Kathey! I've attached a second patch (DERBY-4121-2.diff and DERBY-4121-2.zip) with the changes you approve and suggest to the Reference Manual topic.

Kathey Marsden added a comment - 09/Apr/09 11:00 PM
Changes look good Kim. +1 to commit.

Kim Haase added a comment - 10/Apr/09 04:14 PM
Thanks very much, Kathey.

Committed patch DERBY-4121-2.diff to documentation trunk at revision 763965.
Merged to 10.5 branch at revision 763968.

Kim Haase added a comment - 13/Apr/09 01:52 PM
Closing, since fixes now appear in Latest Alpha Manuals.