Uploaded image for project: 'Apache Sedona'
  1. Apache Sedona
  2. SEDONA-205

Use BinaryType in GeometryUDT in Sedona Spark

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.4.0

    Description

      GeometryUDT currently uses ArrayType(ByteType()) as the serialized data type for geometries. The array type in Spark is an array of objects and not primitive types. Every byte is boxed into a Byte object and the object reference is stored in the array. This adds a significant overhead. The more specialized BinaryType is an array of primitive bytes.

       

      I did a quick benchmark chaining a bunch of st-functions, no joins. With BinaryType the performance increased by roughly 30%.

       

      The old Apache commons-codec bundled with sernetcdf needs to be fixed first. Otherwise Spark fails when calling encodeHexString() as seen in https://github.com/apache/incubator-sedona/pull/704

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              umartin Martin Andersson
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m