Issue Details (XML | Word | Printable)

Key: DERBY-3981
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Knut Anders Hatlen
Reporter: Knut Anders Hatlen
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Derby

Improve distribution of hash codes in SQLBinary and SQLChar

Created: 09/Dec/08 11:52 AM   Updated: 30/Jun/09 04:06 PM
Component/s: SQL
Affects Version/s: 10.4.2.0
Fix Version/s: 10.5.1.1

Time Tracking:
Not Specified

File Attachments:
  Size
File Licensed for inclusion in ASF works d3981.diff 2009-01-02 11:55 AM Knut Anders Hatlen 3 kB
File Licensed for inclusion in ASF works distinct-test.diff 2009-01-02 10:14 AM Knut Anders Hatlen 7 kB

Issue & fix info: Newcomer
Bug behavior facts: Performance
Resolution Date: 06/Jan/09 12:52 PM


 Description  « Hide
SQLBinary.hashCode() and SQLChar.hashCode() use a very simple algorithm that just takes the sum of the values in the array. This gives a poor distribution of hash values because similar values will have a higher probability of mapping to the same hash code, and the higher bits won't be used unless the array is very long. We should change these methods so that they use an algorithm similar to the one used in java.lang.String.hashCode(), described here: <URL:http://java.sun.com/javase/6/docs/api/java/lang/String.html#hashCode()>. This may have a positive effect on the performance of hash scans as it will reduce the likelihood of collisions in the hash table.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Knut Anders Hatlen added a comment - 02/Jan/09 10:14 AM
Attaching a new JUnit performance test which tests SELECT DISTINCT performance. There is one test case which performs SELECT DISTINCT on a table with a non-indexed CHAR(5) column which contains all permutations of 'ABCDE'. This is the worst case for the current hash function, as all the values will end up in the same bucket. The other test case performs the same test on a table with a CHAR(5) FOR BIT DATA column with the same distribution.

Knut Anders Hatlen added a comment - 02/Jan/09 10:23 AM
Committed the performance test with revision 730689.

Knut Anders Hatlen added a comment - 02/Jan/09 11:55 AM
Here's a patch which makes SQLBinary.hashCode() and SQLChar.hashCode() calculate the hash codes in a way more similar to java.lang.String.hashCode(). In fact, SQLChar.hashCode() should now return the same hash code as getString().substring(0,lastNonPadChar+1).hashCode().

With these changes, the average time reported by SelectDistinctTest on my laptop went down from 6.6 seconds to 0.95 seconds on the CHAR(5) test, and from 2.7 seconds to 0.75 seconds on the CHAR(5) FOR BIT DATA test.

I haven't run the regression tests on the patch yet.

Knut Anders Hatlen added a comment - 02/Jan/09 08:24 PM
Derbyall and suites.All ran cleanly with the patch.

Kristian Waagan added a comment - 06/Jan/09 10:34 AM
The patch looks like a good improvement to me.
+1 to commit.

A few questions:
 1) The hashcode isn't saved. Is this because of the reuse of the SQLChar object?
 2) Is the hashCode method invoked as part of normal Derby usage for Clobs?
    I'm asking because SQLClob inherits the hashCode method of SQLChar, which causes the value to be materialized.
 3) Not a question, but we are overriding both equals and hashCode. For SQLChar I believe the relationship between the two methods is correct (i.e. two equal objects must have the same hashcode, pad characters are "ignored" in both methods).

Knut Anders Hatlen added a comment - 06/Jan/09 11:24 AM
Thanks for looking at the patch, Kristian. Please see my answers to
your questions below.

> 1) The hashcode isn't saved. Is this because of the reuse of the
> SQLChar object?

It should be possible to save the hash code in a field and reset it
when the SQLChar is reused. (Same goes for SQLBinary, I believe.) I
don't think caching it will have any effect on the performance of the
SELECT DISTINCT test, since hashCode() should only be called once on
each object when they are inserted into the hash table. There may be
other cases where hashCode() is called many times on the same object,
though, so it might have a positive effect for some queries. I didn't
change this since the old code also had to loop through the entire
underlying array, so the need for caching shouldn't be much changed by
the patch. Another thing to consider is that adding a field to a data
type class increases the object size as reported by
org.apache.derby.iapi.services.cache.ClassSize and makes it (slightly)
less likely that the optimizer chooses a hash join strategy.

> 2) Is the hashCode method invoked as part of normal Derby usage for
> Clobs? I'm asking because SQLClob inherits the hashCode method
> of SQLChar, which causes the value to be materialized.

No, I don't think so. When I looked at DERBY-3975 (see comment made on
17/Dec/08) I concluded that clob (and long varchar) columns couldn't
be compared without casting them to another data type (like char or
varchar) first, so hashCode(), equals() and compareTo() are not called
on SQLClob.

> 3) Not a question, but we are overriding both equals and
> hashCode. For SQLChar I believe the relationship between the two
> methods is correct (i.e. two equal objects must have the same
> hashcode, pad characters are "ignored" in both methods).

Right. And SQLBinary.equals() uses SQLBinary.compare() which also
ignores trailing pad bytes (0x20). Note however that the original
SQLBinary.hashCode() ignored all occurrences of pad bytes, whereas the
new implementation only ignores trailing pad bytes.

Knut Anders Hatlen added a comment - 06/Jan/09 12:52 PM
Committed revision 731929.