Description
At now all tuple versions are placed inside index trees. CacheDataTree is used to link versions each to other (using their order inside an index page).
Despite the fact that this approach is easy to implement and preferable at the first point, it brings several disadvantages:
1) We need to iterate over tuple versions at update time under a read (or even write) lock on an index page which blocks other write (read) operations for a relatively long period of time.
2) Write amplification suffers not only Data Store layer, but indexes as well, which makes read/lookup ops into indexes much slower.
3) We cannot implement several important improvements (data streamer optimizations) because having several versions of one key in an index page doesn't allow using of Invoke operations.
Using versions linking at the Data Store only (like it do other vendors) solves or decreases impact of that issues.
So, the proposed changes:
1) Change data page layout adding two fields into its header: link (a link to the next tuple in a versions chain) and lock (a tx, which holds a write lock on the HEAD of the chain) There are several possible optimizations: 1) leave lock as is (in the cache index item) 2) use max version as lock version as well
2) Do not save all versions of a tuple in indexes; this mean removing version from key - newest version will overwrite an existing entry
There are two approaches with some pros and cons of how to link versions:
1) N2O (newer to older) - a reader (writer) gets the newest tuple version first and iterates over tuple versions from newer to older until it gets a position where it's snapshot placed between min and max versions of the examined tuple. Approach implies faster reads (more actual versions are get first) and necessity of updating all involved indexes on each write operation - slower writes in other words (may be optimized using logical pointers to the head of tuple versions chain). Cooperative VAC (update operations remove invisible for all readers tuple versions) is possible.
2) O2N (older to newer) - a reader gets the oldest visible tuple version and iterates over versions until it gets visible version. It allows not to update all indexes (except the case when an index value is changed), write operations become lighter. Cooperative VAC almost impossible.
We need to decide which approach to use depending on that load profile is preferable (OLTP/OLAP)
Attachments
Issue Links
- is duplicated by
-
IGNITE-9592 MVCC: Use linked lists to store multiple versions.
- Closed