Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
The KiWi triple store currently generates unique IDs for nodes and triples using a kind of sequence generator. Snowflake is generally very fast, but to ensure that the same object always gets the same ID a lot of synchronization is necessary (immediate commit for nodes, triple registry for triples), which has a considerable performance impact, particularly in clustered environments.
A much faster approach would be to compute the ID from the objects themselves, e.g. using an efficient and good hashing function. With a 64bit hash, the probability for conflicts starts getting serious at around 2 billion objects (probability 10%), so it might make sense switching to 128bit keys as well.
A good overview over clash probabilities is given in:
http://preshing.com/20110504/hash-collision-probabilities/
Changes would affect the API for ID generation (IDGenerator) as well as the value factory. In addition, we would need to ignore duplicate IDs for database inserts, e.g. using triggers or merge. Finally, we need to rethink the behaviour of deleted/non-deleted triples.