Details
-
Task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
https://github.com/apache/hudi/pull/2263#discussion_r533653930 has the context. When sorting is specified as part of clustering, we use custom partitioner RDDCustomColumnsSortPartitioner. This deserializes schema to get values for sort columns. Check if its possible to avoid this and implement the suggestion in PR.
We tried another approach by adding SerializableSchema. But this is not working for nested schemas. See test failing here. Fix this serialization and use it in RDDCustomColumnsSortPartitioner