Uploaded image for project: 'Apache Sedona'
  1. Apache Sedona
  2. SEDONA-318

SerDe for RasterUDT performs poorly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.5.0

    Description

      The SerDe for RasterUDT is barely usable. This won't be a big problem when running simple queries like RS_Envelope(RS_FromGeoTiff(content)) since the serde-aware expressions eliminated all the serialization. However, we'll run into problems when running queries involving raster serialization:

      df_geotiff.alias("a").join(df_geotiff2.alias("b"), col("a.id") == col("b.id")).show()
      

      Or simply collect a raster dataset:

      dfGeoTiff.collect()
      

      Each time we run such a query, the executors spawn several new threads. The job may hang or raise strange exceptions when processing large raster datasets. This is a thread dump captured on Spark UI after running several such queries:

      These threads were created by SerializableRenderedImage. SerializableRenderedImage object will launch a TCP server in a newly spawned thread when being serialized, and the deserialized version of SerializableRenderedImage will connect to the server to fetch raster data. This avoids copying the raster data when serializing the GridCoverage2D object, but it is the worst way to implement raster serialization when we have to process a large number of rasters in batches.

      SerializableRenderedImage is also buggy. It tracks the reference count of serialized objects in remoteReferenceCount. However, the reference counting mechanism was not correctly implemented so it leaks memory.

      We may want to create SerializableRenderedImage objects with useDeepCopy = true to avoid these problems, but it introduces a new problem: the finalizer of SerializableRenderedImage will always connect to the server to decrement the remote reference count, even though there is no "server" in deep copy mode. Tons of exceptions will be raised by the finalizer, which is quite annoying.

      INFO: IOException occurs when open the streams of the socket.
      javax.media.jai.util.ImagingException: IOException occurs when open the streams of the socket.
      	at javax.media.jai.remote.SerializableRenderedImage.closeClient(SerializableRenderedImage.java:1117)
      	at javax.media.jai.remote.SerializableRenderedImage.dispose(SerializableRenderedImage.java:1314)
      	at javax.media.jai.remote.SerializableRenderedImage.finalize(SerializableRenderedImage.java:1259)
      	at java.base/java.lang.System$2.invokeFinalize(System.java:2125)
      	at java.base/java.lang.ref.Finalizer.runFinalizer(Finalizer.java:87)
      	at java.base/java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:171)
      Caused by: java.net.SocketException: Connection reset
      	at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
      	at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
      	at java.base/java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2893)
      	at java.base/java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2909)
      	at java.base/java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3406)
      	at java.base/java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:932)
      	at java.base/java.io.ObjectInputStream.<init>(ObjectInputStream.java:375)
      	at javax.media.jai.remote.SerializableRenderedImage.closeClient(SerializableRenderedImage.java:1115)
      	... 5 more
      Caused by:
      java.net.SocketException: Connection reset
      	at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
      	at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
      	at java.base/java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2893)
      	at java.base/java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2909)
      	at java.base/java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3406)
      	at java.base/java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:932)
      	at java.base/java.io.ObjectInputStream.<init>(ObjectInputStream.java:375)
      	at javax.media.jai.remote.SerializableRenderedImage.closeClient(SerializableRenderedImage.java:1115)
      	at javax.media.jai.remote.SerializableRenderedImage.dispose(SerializableRenderedImage.java:1314)
      	at javax.media.jai.remote.SerializableRenderedImage.finalize(SerializableRenderedImage.java:1259)
      	at java.base/java.lang.System$2.invokeFinalize(System.java:2125)
      	at java.base/java.lang.ref.Finalizer.runFinalizer(Finalizer.java:87)
      	at java.base/java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:171)
      

      Attachments

        1. image-2023-07-05-23-06-50-328.png
          625 kB
          Kristin Cowalcijk

        Activity

          People

            Unassigned Unassigned
            kontinuation Kristin Cowalcijk
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 40m
                40m