Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33184

spark doesn't read data source column if it is used as an index to an array under a struct

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.0.0
    • None
    • SQL
    • None

    Description

      df = spark.createDataFrame([[1, [[1, 2]]]], schema='x:int,y:struct<a:array<int>>')
      df.write.mode('overwrite').parquet('test')
      
      # This causes an error "Caused by: java.lang.RuntimeException: Couldn't find x#720 in [y#721]"
      spark.read.parquet('test').select(F.expr('y.a[x]')).show()
      
      # Explain works fine, note it doesn't read x in ReadSchema
      spark.read.parquet('test').select(F.expr('y.a[x]')).explain()
      
      == Physical Plan ==
      *(1) !Project [y#713.a[x#712] AS y.a AS `a`[x]#717]
      +- FileScan parquet [y#713] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<y:struct<a:array<int>>>
      

      The code works well if I

      # manually select the column it misses
      spark.read.parquet('test').select(F.expr('y.a[x]'), F.col('x')).show()
      
      # use element_at function
      spark.read.parquet('test').select(F.element_at('y.a', F.col('x') + 1)).show()
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            colinfang colin fang
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: