Uploaded image for project: 'Apache Sedona'
  1. Apache Sedona
  2. SEDONA-497

SpatialRDD read from multiple Shapefiles has incorrect fieldName property

Rank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.5.1
    • 1.5.2

    Description

      A user reported this issue on Discord. It could be easily reproduced using the following shapefiles provided by the user: debug_shapefiles.zip

      The following code loads a directory containing multiple shapefiles to SpatialRDD at once, and then use Adapter.toDF to convert the SpatialRDD to a Spark DataFrame:

      parcel_rdd = ShapefileReader.readToGeometryRDD(sc, parcel_path)
      parcel_df = Adapter.toDf(parcel_rdd, sedona)
      parcel_df.printSchema()
      parcel_df.show()
      

      The above code yields the following output:

      root
       |-- geometry: geometry (nullable = true)
       |-- id: string (nullable = true)
       |-- name id: string (nullable = true)
       |-- name id: string (nullable = true)
       |-- name: string (nullable = true)
      
      24/01/31 14:09:24 WARN TaskSetManager: Lost task 0.0 in stage 32.0 (TID 43) (172.20.0.130 executor 0): org.apache.spark.SparkRuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: Index 3 out of bounds for length 3
      if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else newInstance(class org.apache.spark.sql.sedona_sql.UDT.GeometryUDT).serialize AS geometry#275
      if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 1, id), StringType, ObjectType(class java.lang.String)), true, false, true) AS id#276
      if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 2, name id), StringType, ObjectType(class java.lang.String)), true, false, true) AS name id#277
      

      The reason why Adapter.toDf returns a dataframe with weird schema is because the fieldNames property of parcel_rdd is incorrect:

      >>> parcel_rdd.fieldNames
      ['id', 'name id', 'name id', 'name']
      

      The schema of the shapefiles should be ['id', 'name'], but it was strangely duplicated 3 times.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            kontinuation Kristin Cowalcijk
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 20m
                20m

                Slack

                  Issue deployment