|
cardinality statistics only track's the average number of duplicate values per value in an index. So it is unnecessary to update this if that value is not changing. Derby uses the indexes directly to
track the distribution of values in given ranges and update statistics has no affect on this. So it is not necessary to update as the number of rows grows. The system has a zero admin way to track the number of rows in the system. This method is an estimate and a proposed change to update statistics will improve this estimate, but don't want users to think that they have to update statistics to get valid row counts. My understanding of the code is that we store the number of unique values, not the average number of duplicates. The selectivity is calculated this way (from StatisticsImpl.java):
public double selectivity(Object[] predicates) { if (numRows == 0.0) return 0.1; /* xxxSTATresolve: for small values of numRows, should we do something * special? */ return (double)(1/(double)numUnique); } If the number of rows is doubled, and the average number of duplicates per value doesn't change, this method will return a selectivity that is two times as high as it should be if we don't update the statistics. Based on Knut's comment. is it correct then to say that SYSCS_UTIL.SYSCS_UPDATE_STATISTICS should be run if there is a significant change in the number of rows or the number of unique values in an index?
That sounds correct to me.
Actually given the code that knut posted it seems to only matter if you change the number of unique values, or go from 0 to a non-zero number of rows. I knew the stat code remembered
both numRows and numUnique - I didn't realize that we almost did nothing with numRows. Does anyone else think that the current way this number used is bad for a zero admin db. It seems like we should be giving back very different information is we run the stat gatherer and we find 1000 unique values and there were 1000 unique rows vs. 1000 unique values and 1,000,000 unique rows. If the assumption is that in general we are not going to require user intervention to gather stats what is a better guess : 1) as number of rows changes we won't change number of unique values 2) as number of rows changes the number of unique values will stay in the same proportion Obviously either assumption can be made wrong for a specific application, but what would a good default assumption be. It seems option 1 is the current default. I guess part of this question is what the resolve comment is referring to as this is a much bigger deal the smaller the sample size and thus the bigger the selectivity. So given the current behavior and assuming we won't change it for the release, what is the best advice to offer users regarding maintaining statistics?
To Kathey's question, I think the most accurate advice is: "SYSCS_UTIL.SYSCS_UPDATE_STATISTICS should be invoked if the number of distinct values in the index has changed significantly."
To Mike's question, I think that (2) will be a better guess in more cases than (1). Of course, if you have a limited number of possible/likely values, the average number of duplicates will increase as the number of rows increases, but as long as there's a relatively high number of possible values (2) is more likely to hold. For columns like gender, city and state, I'd assume that (1) in many cases is a much better guess than (2), though.
Looks like this is ready to document, so I may as well pick it up. I have attached
M src/tuning/ctunstats849505.dita M src/ref/rrefupdatestatsproc.dita The Tuning change adds a paragraph to the suggested topic on statistics. The Ref change corrects the two examples; includes the information about when to call the procedure; and inserts the conrefs.dita phrase for Derby. I have thought of one more item for the Reference topic, though. The syntax has no indication of the data types of the three arguments: SYSCS_UTIL.SYSCS_UPDATE_STATISTICS(SCHEMANAME, TABLENAME, INDEXNAME) Shouldn't this be changed to something like the following? SYSCS_UTIL.SYSCS_UPDATE_STATISTICS(IN SCHEMANAME VARCHAR(128), IN TABLENAME VARCHAR(128), IN INDEXNAME VARCHAR(128)) Compare http://db.apache.org/derby/docs/dev/ref/rrefimportproc.html and other procedure syntax statements. Thanks for any comments. Thanks Kim for picking up this issue. I think you are right about the types being missing and the types as you describe them seem correct to me.
The changes look good, but if you are making another round anyway, I think it would be good to preface the following statement: It is a good idea to call the SYSCS_UTIL.SYSCS_UPDATE_STATISTICS procedure when the number of distinct values in an index is likely to have changed significantly. with: Once statistics have been created, they cannot be dropped and should be maintained. Thanks very much, Kathey! I've attached a second patch (
Changes look good Kim. +1 to commit.
Thanks very much, Kathey.
Committed patch Merged to 10.5 branch at revision 763968. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
As soon as my testing is complete I will post any additional comments here. I should be done by the end of this week.