Uploaded image for project: 'Apache Sedona'
  1. Apache Sedona
  2. SEDONA-231

Redundant Serde Removal

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.4.0

    Description

      Currently, Geometry objects are deserialized and reserialized during every evaluation of a function on a row in Spark. This amounts to a great deal of redundant serde during query execution. At times, objects are serialized just to be immediately deserialized.

      To demonstrate this in action, I placed print statements in the GeometrySerializer serialize and deserialize methods, the GeometryUDT serialize and deserialize methods, and in the eval methods of several functions. When the following is executed:
       

      val columns = Seq("input", "blade")
      val data = Seq(
        ("GEOMETRYCOLLECTION ( LINESTRING (0 0, 1.5 1.5, 2 2), LINESTRING (3 3, 4.5 4.5, 5 5))", "MULTIPOINT (0.5 0.5, 1 1, 3.5 3.5, 4 4)")
      )
      var df = spark.createDataFrame(data).toDF(columns:_*)     
      println(
        df.selectExpr("ST_Normalize(ST_Split(ST_GeomFromWKT(input), ST_GeomFromWKT(blade))) AS result").collect()(0).get(0).asInstanceOf[Geometry].toText()
      )

      I get the following output:

       **org.apache.spark.sql.sedona_sql.expressions.ST_Normalize**
       **org.apache.spark.sql.sedona_sql.expressions.ST_Split**
       **org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT**
      Inside GeometrySerializer.serialize
      Inside GeometrySerializer.serialize
      Inside GeometrySerializer.serialize
      Inside GeometrySerializer.deserialize
       **org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT**
      Inside GeometrySerializer.serialize
      Inside GeometrySerializer.deserialize
      Inside GeometrySerializer.serialize
      Inside GeometrySerializer.deserialize
      Inside GeometrySerializer.serialize
      Inside UDT deserialize.
      Inside GeometrySerializer.deserialize
      MULTILINESTRING ((0 0, 0.5 0.5), (0.5 0.5, 1 1), (1 1, 1.5 1.5, 2 2), (3 3, 3.5 3.5), (3.5 3.5, 4 4), (4 4, 4.5 4.5, 5 5))

      To explain what is happening:

      1. ST_Normalize.eval is called.
      2. ST_Normalize.eval calls ST_Split.eval.
      3. ST_Split.eval first calls the ST_GeomFromWKT that had the GEOMETRYCOLLECTION wkt.
      4. ST_GeomFromWKT processes the wkt string and generates a Geometry object.
      5. The Geometry object is passed to GeometrySerializer.serialize. This is the first call to serialize.
      6. This object is a GeometryCollection and the GeometrySerializer uses recursion to handle them so you see two more additional calls to serialize.
      7. The GeometryCollection is then immediately deserialized and returned to ST_Split.
      8. The second ST_GeomFromWKT is called (this one has a MULTIPOINT wkt).
      9. ST_GeomFromWKT processes the WKT and then serializes the geometry.
      10. That geometry is immediately deserialized and returned to ST_Split.
      11. ST_Split performs its operation and then serializes the geometry.
      12. That geometry is then immediately deserialized and returned to ST_Normalize.
      13. ST_Normalize normalizes the geometry object and then serializes it for good.
      14. Then the GeometryUDT.deserialize is called to handle the collect call which of course calls GeometrySerializer.deserialize.

      There are multiple instances here where geometry objects are serialized and then immediately deserialized to be further operated on. That is obviously pretty wasteful.

       

      I propose eliminating this redundancy through the following steps.

      • Create a trait called SerdeAware which has a single method called evalWithoutSerialization.
      • This trait is then added to the InferredUnaryExpression, InferredBinaryExpression, InferredTernaryExpression, UnaryGeometryExpression, and BinaryGeometryExpression abstract classes.
      • When a Sedona expression is evaluating its children expressions, it first checks the child for the SerdeAware trait. If the trait is detected then the parent expression calls the child's evalWithoutSerialization method. This method returns an actual geometry object without the child having serialized it.

      In the test implementation I created I was able to get the following output:

       **org.apache.spark.sql.sedona_sql.expressions.ST_Normalize**
       **org.apache.spark.sql.sedona_sql.expressions.ST_Split**
       **org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT**
       **org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT**
      Inside GeometrySerializer.serialize
      Inside UDT deserialize.
      Inside GeometrySerializer.deserialize
      MULTILINESTRING ((0 0, 0.5 0.5), (0.5 0.5, 1 1), (1 1, 1.5 1.5, 2 2), (3 3, 3.5 3.5), (3.5 3.5, 4 4), (4 4, 4.5 4.5, 5 5))

      You can see that only a single serialization was called and only at the very end of the computation.
       

      Edit: I updated the proposed method with Adam's suggestion. I also extended the proposal to include the other expression types.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dougdennis Doug Dennis
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 50m
                  1h 50m