[SPARK-33184] spark doesn't read data source column if it is used as an index to an array under a struct - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.0.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

df = spark.createDataFrame([[1, [[1, 2]]]], schema='x:int,y:struct<a:array<int>>')
df.write.mode('overwrite').parquet('test')

# This causes an error "Caused by: java.lang.RuntimeException: Couldn't find x#720 in [y#721]"
spark.read.parquet('test').select(F.expr('y.a[x]')).show()

# Explain works fine, note it doesn't read x in ReadSchema
spark.read.parquet('test').select(F.expr('y.a[x]')).explain()

== Physical Plan ==
*(1) !Project [y#713.a[x#712] AS y.a AS `a`[x]#717]
+- FileScan parquet [y#713] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<y:struct<a:array<int>>>

The code works well if I

# manually select the column it misses
spark.read.parquet('test').select(F.expr('y.a[x]'), F.col('x')).show()

# use element_at function
spark.read.parquet('test').select(F.element_at('y.a', F.col('x') + 1)).show()

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: colin fang

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 19/Oct/20 18:06

Updated:: 20/Oct/20 04:40