[SEDONA-497] SpatialRDD read from multiple Shapefiles has incorrect fieldName property - ASF JIRA

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.5.1
Fix Version/s: 1.5.2
Labels:
- pull-request-available

Description

A user reported this issue on Discord. It could be easily reproduced using the following shapefiles provided by the user: debug_shapefiles.zip

The following code loads a directory containing multiple shapefiles to SpatialRDD at once, and then use Adapter.toDF to convert the SpatialRDD to a Spark DataFrame:

parcel_rdd = ShapefileReader.readToGeometryRDD(sc, parcel_path)
parcel_df = Adapter.toDf(parcel_rdd, sedona)
parcel_df.printSchema()
parcel_df.show()

The above code yields the following output:

root
 |-- geometry: geometry (nullable = true)
 |-- id: string (nullable = true)
 |-- name id: string (nullable = true)
 |-- name id: string (nullable = true)
 |-- name: string (nullable = true)

24/01/31 14:09:24 WARN TaskSetManager: Lost task 0.0 in stage 32.0 (TID 43) (172.20.0.130 executor 0): org.apache.spark.SparkRuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: Index 3 out of bounds for length 3
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else newInstance(class org.apache.spark.sql.sedona_sql.UDT.GeometryUDT).serialize AS geometry#275
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 1, id), StringType, ObjectType(class java.lang.String)), true, false, true) AS id#276
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 2, name id), StringType, ObjectType(class java.lang.String)), true, false, true) AS name id#277

The reason why Adapter.toDf returns a dataframe with weird schema is because the fieldNames property of parcel_rdd is incorrect:

>>> parcel_rdd.fieldNames
['id', 'name id', 'name id', 'name']

The schema of the shapefiles should be ['id', 'name'], but it was strangely duplicated 3 times.

Attachments

debug_shapefiles.zip
19/Feb/24 03:16
5 kB
Kristin Cowalcijk

Issue Links

Add Link

links to

GitHub Pull Request #1243

Delete this link

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Unassigned

Reporter:: Kristin Cowalcijk

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 19/Feb/24 03:22

Updated:: 28/Apr/24 08:13

Resolved:: 28/Apr/24 08:13

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

SpatialRDD read from multiple Shapefiles has incorrect fieldName property

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Time Tracking

Agile

Slack

Issue deployment