[SEDONA-205] Use BinaryType in GeometryUDT in Sedona Spark - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.4.0
Labels:
- pull-request-available

Description

GeometryUDT currently uses ArrayType(ByteType()) as the serialized data type for geometries. The array type in Spark is an array of objects and not primitive types. Every byte is boxed into a Byte object and the object reference is stored in the array. This adds a significant overhead. The more specialized BinaryType is an array of primitive bytes.

I did a quick benchmark chaining a bunch of st-functions, no joins. With BinaryType the performance increased by roughly 30%.

The old Apache commons-codec bundled with sernetcdf needs to be fixed first. Otherwise Spark fails when calling encodeHexString() as seen in https://github.com/apache/incubator-sedona/pull/704

Attachments

Issue Links

is blocked by

SEDONA-194 Merge org.datasyslab.sernetcdf into Sedona

Resolved

links to

GitHub Pull Request #734

Activity

People

Assignee:: Unassigned

Reporter:: Martin Andersson

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 30/Nov/22 19:51

Updated:: 13/Mar/23 21:38

Resolved:: 04/Jan/23 22:37

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

50m