Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
GeometryUDT currently uses ArrayType(ByteType()) as the serialized data type for geometries. The array type in Spark is an array of objects and not primitive types. Every byte is boxed into a Byte object and the object reference is stored in the array. This adds a significant overhead. The more specialized BinaryType is an array of primitive bytes.
I did a quick benchmark chaining a bunch of st-functions, no joins. With BinaryType the performance increased by roughly 30%.
The old Apache commons-codec bundled with sernetcdf needs to be fixed first. Otherwise Spark fails when calling encodeHexString() as seen in https://github.com/apache/incubator-sedona/pull/704
Attachments
Issue Links
- is blocked by
-
SEDONA-194 Merge org.datasyslab.sernetcdf into Sedona
- Resolved
- links to