As a quick comment on the implementation, i notice that it is possible (with reasonable probability) for hash collisions to result in re-interning a pair of strings multiple times. For example, a codepath that traverses across 32 unique string datapoints (say, in an inner loop somewhere), would have a minimum 3% probability of colliding and re-interning 2 strings every time through the loop. Because of the birthday paradox, it becomes likely to have such a situation (50% probability with ~150 unique values).
These are the probabilities of collision, assuming random distribution and perfect hashing. In real life the distribution will not be so random (string.hashCode() & MASK) so these will be "best case".
2 datapoints: collision prob = 0.006104%
4 datapoints: collision prob = 0.036617%
8 datapoints: collision prob = 0.170779%
16 datapoints: collision prob = 0.729976%
32 datapoints: collision prob = 2.983863%
64 datapoints: collision prob = 11.591861%
128 datapoints: collision prob = 39.188158%
256 datapoints: collision prob = 86.501947%
Practically this may or may not matter. My thought is that some sort of fast LRU structure would be better, but of course creating something like this without locking may be tricky. Another idea might be to support some form of limited hash-chaining or probing in the table, which would mitigate the damage of a collision significantly.
for reference, python code for calculating birthday/hash collisions, shamelessly stolen:
def bp(n, d):
v = 1.0
for i in range:
v = v * (1 - float/d)
return 1 - v
for n in [2, 4, 8, 16, 32, 64, 128, 256]:
print "%i datapoints: collision prob = %f%%" % (n, bp(n, 16*1024)*100)