[KUDU-2263] Consider removing PB descriptors from PBC header - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.7.0
Fix Version/s: None
Component/s: util
Labels:
None

Target Version/s:

1.8.0

Description

Looking at a cmeta file on disk, it seems the vast majority of the bytes are in the supplemental header. We currently serialize the entire descriptor set of the referenced file and its dependencies. This means that in each cmeta file, we end up serializing even things like the definition of SchemaPB – unnecessary to serialize the type at hand and quite large.

At a minimum we can prune the descriptors serialized to only include those that are transitively referenced by the PB type in the file. I think we should also consider doing away with this information entirely and instead allow 'kudu pbc dump' to take a descriptor set as external input – it's easy enough to generate a descriptor set from any kudu version source tree using the protoc command line.

One potential major improvement if we can get these files down to <4kb is that we could atomically rewrite them in a single disk IO using O_DIRECT rather than doing a rewrite-rename-fsync dance.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Todd Lipcon

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 17/Jan/18 07:11

Updated:: 22/Feb/18 20:44