Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
This request/suggestion was brought up by omalley during [https://www.apachecon.com/acna2022/|Apache Con 2022]. link title
When mutating/creating a large table, there could be a huge performance boost achieved if applications can bring in data from either other existing objects or older versions of the same object. Thus, effectively the same copy of the data can be transparently addressed from multiple objects or when an object is updated.
This capability can take many forms from an implementation standpoint, but we must design the API surface for applications first.
To make progress, we need to do
- Identify the API surface that needs to be exposed for applications such as iceberg or ORC writers to leverage this feature. Should be done via exposing underlying blocks or abstracting the blocks away and only addressing this as ranges in a file to be sourced from other files (and their corresponding ranges, similar to a scatter-gather list).
- Should this be an extension of vectorO APIs?
- Is there a need to expose the layout of sharable content
- Backend modeling of the API and how Ozone will make it work. This needs to be reasoned across EC and Replication.
- How would this be made available as an extension to S3 APIs in addition to OFS.
The https://issues.apache.org/jira/browse/HDDS-7288 is a duplicate of this one. Filling this in to capture the full context of the discussion.