|
[
Permalink
| « Hide
]
Knut Anders Hatlen added a comment - 02/Jan/09 10:14 AM
Attaching a new JUnit performance test which tests SELECT DISTINCT performance. There is one test case which performs SELECT DISTINCT on a table with a non-indexed CHAR(5) column which contains all permutations of 'ABCDE'. This is the worst case for the current hash function, as all the values will end up in the same bucket. The other test case performs the same test on a table with a CHAR(5) FOR BIT DATA column with the same distribution.
Committed the performance test with revision 730689.
Here's a patch which makes SQLBinary.hashCode() and SQLChar.hashCode() calculate the hash codes in a way more similar to java.lang.String.hashCode(). In fact, SQLChar.hashCode() should now return the same hash code as getString().substring(0,lastNonPadChar+1).hashCode().
With these changes, the average time reported by SelectDistinctTest on my laptop went down from 6.6 seconds to 0.95 seconds on the CHAR(5) test, and from 2.7 seconds to 0.75 seconds on the CHAR(5) FOR BIT DATA test. I haven't run the regression tests on the patch yet. Derbyall and suites.All ran cleanly with the patch.
The patch looks like a good improvement to me.
+1 to commit. A few questions: 1) The hashcode isn't saved. Is this because of the reuse of the SQLChar object? 2) Is the hashCode method invoked as part of normal Derby usage for Clobs? I'm asking because SQLClob inherits the hashCode method of SQLChar, which causes the value to be materialized. 3) Not a question, but we are overriding both equals and hashCode. For SQLChar I believe the relationship between the two methods is correct (i.e. two equal objects must have the same hashcode, pad characters are "ignored" in both methods). Thanks for looking at the patch, Kristian. Please see my answers to
your questions below. > 1) The hashcode isn't saved. Is this because of the reuse of the > SQLChar object? It should be possible to save the hash code in a field and reset it when the SQLChar is reused. (Same goes for SQLBinary, I believe.) I don't think caching it will have any effect on the performance of the SELECT DISTINCT test, since hashCode() should only be called once on each object when they are inserted into the hash table. There may be other cases where hashCode() is called many times on the same object, though, so it might have a positive effect for some queries. I didn't change this since the old code also had to loop through the entire underlying array, so the need for caching shouldn't be much changed by the patch. Another thing to consider is that adding a field to a data type class increases the object size as reported by org.apache.derby.iapi.services.cache.ClassSize and makes it (slightly) less likely that the optimizer chooses a hash join strategy. > 2) Is the hashCode method invoked as part of normal Derby usage for > Clobs? I'm asking because SQLClob inherits the hashCode method > of SQLChar, which causes the value to be materialized. No, I don't think so. When I looked at 17/Dec/08) I concluded that clob (and long varchar) columns couldn't be compared without casting them to another data type (like char or varchar) first, so hashCode(), equals() and compareTo() are not called on SQLClob. > 3) Not a question, but we are overriding both equals and > hashCode. For SQLChar I believe the relationship between the two > methods is correct (i.e. two equal objects must have the same > hashcode, pad characters are "ignored" in both methods). Right. And SQLBinary.equals() uses SQLBinary.compare() which also ignores trailing pad bytes (0x20). Note however that the original SQLBinary.hashCode() ignored all occurrences of pad bytes, whereas the new implementation only ignores trailing pad bytes. Committed revision 731929.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||