Uploaded image for project: 'Apache Sedona'
  1. Apache Sedona
  2. SEDONA-227

Python SerDe Performance Degradation

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.4.0

    Description

      With the new geometry serde in Sedona, there appears to be a fairly significant performance regression on the python side. The PR's author acknowledged a regression in the PR so this is expected, however my trials are showing a regression that is sometimes far higher than the 2x noted in the PR.

      For serialization, I'm seeing points and short linestrings taking about twice as long (as expected). Unfortunately, small polygons are taking about 7-8 times longer while long linestrings and large polygons are taking between 11-12 times longer.

      The news isn't all bad though. For me, short linestrings are consistently deserializing faster (about 25-30% faster) and points are deserializing at roughly the same rate as before. The other deserializations show regressions that are more or less in line with the results for serialization though.

      To test this, I'm strictly comparing the new serialize and deserialize sedona functions against shapely's wkb loads and dumps functions. Below you will find my most recent results (which have been fairly consistent) as well as the python code I used to generate it. I'm very open to critiques of my approach to measuring performance, and hope that some of this performance loss is due to my own error.

      Serialization results:

      short line serialize trial:
              Total Time (seconds):
                      Shapely: 1.7364926
                      Sedona: 5.4626863
                      Factor: 2.145816054730092        
              Average Time (nanoseconds):
                      Shapely: 8682.463
                      Sedona: 27313.4315
                      Factor: 2.145816054730092
      
      long line serialize trial:
              Total Time (seconds):
                      Shapely: 4.0879395
                      Sedona: 50.1508444
                      Factor: 11.268000639441949
              Average Time (nanoseconds):
                      Shapely: 40879.395
                      Sedona: 501508.444
                      Factor: 11.268000639441949
      
      point serialize trial:
              Total Time (seconds):
                      Shapely: 4.7864782
                      Sedona: 13.0319586
                      Factor: 1.7226612251153677
              Average Time (nanoseconds):
                      Shapely: 9572.9564
                      Sedona: 26063.9172
                      Factor: 1.7226612251153677
      
      small polygon serialize trial:
              Total Time (seconds):
                      Shapely: 1.8339082
                      Sedona: 14.9376628
                      Factor: 7.145262014750793
              Average Time (nanoseconds):
                      Shapely: 9169.541
                      Sedona: 74688.314
                      Factor: 7.145262014750793
      
      large polygon serialize trial:
              Total Time (seconds):
                      Shapely: 2.3705298
                      Sedona: 30.4154897
                      Factor: 11.830671734225826
              Average Time (nanoseconds):
                      Shapely: 23705.298
                      Sedona: 304154.897
                      Factor: 11.830671734225826 

      Deserialization results:

      short line deserialize trial:
              Total Time (seconds):
                      Shapely: 2.5166469
                      Sedona: 1.7909991
                      Factor: -0.28833913887562057
              Average Time (nanoseconds):
                      Shapely: 12583.2345
                      Sedona: 8954.9955
                      Factor: -0.28833913887562057
      
      long line deserialize trial:
              Total Time (seconds):
                      Shapely: 3.1818201
                      Sedona: 45.1792348
                      Factor: 13.199179519923204
              Average Time (nanoseconds):
                      Shapely: 31818.201
                      Sedona: 451792.348
                      Factor: 13.199179519923204
      
      point deserialize trial:
              Total Time (seconds):
                      Shapely: 5.7874722
                      Sedona: 5.3168965
                      Factor: -0.08130936680784402
              Average Time (nanoseconds):
                      Shapely: 11574.9444
                      Sedona: 10633.793
                      Factor: -0.08130936680784402
      
      small polygon deserialize trial:
              Total Time (seconds):
                      Shapely: 2.5079775
                      Sedona: 4.0216245
                      Factor: 0.6035329264317563
              Average Time (nanoseconds):
                      Shapely: 12539.8875
                      Sedona: 20108.1225
                      Factor: 0.6035329264317563
      
      large polygon deserialize trial:
              Total Time (seconds):
                      Shapely: 1.9952702
                      Sedona: 19.909025
                      Factor: 8.978109731704508
              Average Time (nanoseconds):
                      Shapely: 19952.702
                      Sedona: 199090.25
                      Factor: 8.978109731704508 

      Python code used to generate results:

      from sedona.utils.geometry_serde import serialize, deserialize
      from shapely.geometry import LineString, Point, Polygon
      from shapely.wkb import dumps, loads
      
      import time
      
      def run_serialize_trial(geom, number_iterations, name):
          print(f"{name} serialize trial:")
      
          start_time = time.perf_counter_ns()
          for _ in range(number_iterations):
              dumps(geom)
          shapely_time = time.perf_counter_ns() - start_time
      
          start_time = time.perf_counter_ns()
          for _ in range(number_iterations):
              serialize(geom)
          sedona_time = time.perf_counter_ns() - start_time
      
          print(f"\tTotal Time (seconds):")
          print(f"\t\tShapely: {shapely_time / 1e9}\n\t\tSedona: {sedona_time / 1e9}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n")
          print(f"\tAverage Time (nanoseconds):")
          print(f"\t\tShapely: {shapely_time / number_iterations}\n\t\tSedona: {sedona_time / number_iterations}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n")
      
      def run_deserialize_trial(geom, number_iterations, name):
          print(f"{name} deserialize trial:")
      
          shapely_serialized_geom = dumps(geom)
          sedona_serialized_geom = serialize(geom)
      
          start_time = time.perf_counter_ns()
          for _ in range(number_iterations):
              loads(shapely_serialized_geom)
          shapely_time = time.perf_counter_ns() - start_time
      
          start_time = time.perf_counter_ns()
          for _ in range(number_iterations):
              deserialize(sedona_serialized_geom)
          sedona_time = time.perf_counter_ns() - start_time
      
          print(f"\tTotal Time (seconds):")
          print(f"\t\tShapely: {shapely_time / 1e9}\n\t\tSedona: {sedona_time / 1e9}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n")
          print(f"\tAverage Time (nanoseconds):")
          print(f"\t\tShapely: {shapely_time / number_iterations}\n\t\tSedona: {sedona_time / number_iterations}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n")
      
      short_line_iterations = 200_000
      short_line = LineString([(10.0, 10.0), (20.0, 20.0)])
      
      long_line_iterations = 100_000
      long_line = LineString([(float(n), float(n)) for n in range(1000)])
      
      point_iterations = 500_000
      point = Point(12.3, 45.6)
      
      small_polygon_iterations = 200_000
      small_polygon = Polygon([(10.0, 10.0), (20.0, 10.0), (20.0, 20.0), (10.0, 20.0), (10.0, 10.0)])
      
      large_polygon_iterations = 100_000
      large_polygon = Polygon(
          [(0.0, float(n * 10)) for n in range(100)]
          + [(float(n * 10), 990.0) for n in range(100)]
          + [(990.0, float(n * 10)) for n in reversed(range(100))]
          + [(float(n * 10), 0.0) for n in reversed(range(100))]
      )
      
      run_serialize_trial(short_line, short_line_iterations, "short line")
      run_serialize_trial(long_line, long_line_iterations, "long line")
      run_serialize_trial(point, point_iterations, "point")
      run_serialize_trial(small_polygon, small_polygon_iterations, "small polygon")
      run_serialize_trial(large_polygon, large_polygon_iterations, "large polygon")
      
      run_deserialize_trial(short_line, short_line_iterations, "short line")
      run_deserialize_trial(long_line, long_line_iterations, "long line")
      run_deserialize_trial(point, point_iterations, "point")
      run_deserialize_trial(small_polygon, small_polygon_iterations, "small polygon")
      run_deserialize_trial(large_polygon, large_polygon_iterations, "large polygon")

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dougdennis Doug Dennis
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 5h 50m
                  5h 50m