Uploaded image for project: 'Apache Sedona'
  1. Apache Sedona
  2. SEDONA-495

Raster data source uses shared FileSystem connections which lead to race condition

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.4.0, 1.4.1, 1.5.0, 1.5.1
    • 1.5.2

    Description

      The raster data source's OutputWriter uses `new Path(savePath).getFileSystem(context.getConfiguration)` to get a Hadoop FileSystem instance and a OutputWriter instance is initiated per task. This function will return a shared connection among all tasks on an executor.

       

      https://github.com/apache/sedona/blob/master/spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/raster/RasterFileFormat.scala#L85

       

      It is common that a multi-core executor gets multiple concurrent tasks (one task per core). In the current implementation, if one task is completed, the connection is closed and all other tasks are having IO exception.

       

      The best practice is to use `FileSystem.newInstance` for each task.

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jiayu Jia Yu
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m