Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Information Provided
-
None
-
None
-
None
Description
Today, we auto generate string keys of the form (HoodieRecord#generateSequenceId), which is highly compressible, esp compared to uuidv1, when we store as a string column inside a parquet file.
public static String generateSequenceId(String instantTime, int partitionId, long recordIndex) { return instantTime + "_" + partitionId + "_" + recordIndex; }
As a part of this task, we'd love to understand if
- Can uuid6 or 7, provide similar compressed storage footprint when written as a column in a parquet file.
- can the current format be represented as a 160-bit number i.e 2 longs, 1 int in storage? would that save us further in storage costs?
(Orthogonal consideration is the memory needed to hold the key string, which can be higher than a 160bits. We can discuss this later, once we understand storage footprint)
Resources: