[CASSANDRA-7447] New sstable format with support for columnar layout - ASF JIRA

Details

Type: Improvement
Status: Open
Priority: Normal
Resolution: Unresolved
Fix Version/s: 5.x
Component/s: Legacy/Local Write-Read Paths
Labels:
- performance
- storage

Description

Storage Format Proposal

C* has come a long way over the past few years, and unfortunately our storage format hasn't kept pace with the data models we are now encouraging people to utilise. This ticket proposes a collections of storage primitives that can be combined to serve these data models more optimally.

It would probably help to first state the data model at the most abstract level. We have a fixed three-tier structure: We have the partition key, the clustering columns, and the data columns. Each have their own characteristics and so require their own specialised treatment.

I should note that these changes will necessarily be delivered in stages, and that we will be making some assumptions about what the most useful features to support initially will be. Any features not supported will require sticking with the old format until we extend support to all C* functionality.

Partition Key

This really has two components: the partition, and the value. Although the partition is primarily used to distribute across nodes, it can also be used to optimise lookups for a given key within a node
Generally partitioning is by hash, and for the moment I want to focus this ticket on the assumption that this is the case
Given this, it makes sense to optimise our storage format to permit O(1) searching of a given partition. It may be possible to achieve this with little overhead based on the fact we store the hashes in order and know they are approximately randomly distributed, as this effectively forms an immutable contiguous split-ordered list (see Shalev/Shavit, or CASSANDRA-7282), so we only need to store an amount of data based on how imperfectly distributed the hashes are, or at worst a single value per block.
This should completely obviate the need for a separate key-cache, which will be relegated to supporting the old storage format only

Primary Key / Clustering Columns

Given we have a hierarchical data model, I propose the use of a cache-oblivious trie
The main advantage of the trie is that it is extremely compact and supports optimally efficient merges with other tries so that we can support more efficient reads when multiple sstables are touched
The trie will be preceded by a small amount of related data; the full partition key, a timestamp epoch (for offset-encoding timestamps) and any other partition level optimisation data, such as (potentially) a min/max timestamp to abort merges earlier
Initially I propose to limit the trie to byte-order comparable data types only (the number of which we can expand through translations of the important types that are not currently)
Crucially the trie will also encapsulate any range tombstones, so that these are merged early in the process and avoids re-iterating the same data
Results in true bidirectional streaming without having to read entire range into memory

Values

There are generally two approaches to storing rows of data: columnar, or row-oriented. The above two data structures can be combined with a value storage scheme that is based on either. However, given the current model we have of reading large 64Kb blocks for any read, I am inclined to focus on columnar support first, as this delivers order-of-magnitude benefits to those users with the correct workload, while for most workloads our 64Kb blocks are large enough to store row-oriented data in a column-oriented fashion without any performance degradation (I'm happy to consign very large row support to phase 2).

Since we will most likely target both behaviours eventually, I am currently inclined to suggest that static columns, sets and maps be targeted for a row-oriented release, as they don't naturally fit in a columnar layout without secondary heap-blocks. This may be easier than delivering heap-blocks also, as it keeps both implementations relatively clean. This is certainly open to debate, and I have no doubt there will be conflicting opinions here.

Focusing on our columnar layout, the goals are:

Support sparse and dense column storage
Efficient compression of tombstones, timestamps, ttls, etc. into near-zero space based on offset/delta/bitmap encoding
Normalisation of column names once per file
Per-file block-layout index, defining how each block's data is encoded, so we can index directly within a block for dense fields (permitting more efficient page cache utilisation)
Configurable grouping of fields per block, so that we can get closer to row-oriented or column-oriented performance, based on user goals

I have attached my slides from the ngcc for reference.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

storage_format.pdf
25/Jan/15 19:22
359 kB
Benedict Elliott Smith
ngcc-storage.odp
25/Jun/14 00:01
705 kB
Benedict Elliott Smith

Issue Links

relates to

CASSANDRA-5019 Still too much object allocation on reads

Resolved

CASSANDRA-7443 Extend Descriptor to include a format value and refactor reader/writer apis

Resolved

CASSANDRA-6709 Changes to KeyCache

Resolved

supercedes

CASSANDRA-6810 SSTable and Index Layout Improvements/Modifications

Resolved

CASSANDRA-6806 In AtomicBTreeColumns, construct list of unwinds after a race lazily

Resolved

CASSANDRA-6830 Changes to SSTable Index file

Resolved

(1 supercedes)

New sstable format with support for columnar layout

Details

Description

Storage Format Proposal

Partition Key

Primary Key / Clustering Columns

Values

Attachments

Attachments

Issue Links

Activity

People

Dates