Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35386

parquet read with schema should fail on non-existing columns

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.1
    • None
    • Input/Output, PySpark
    • None

    Description

      When read schema is specified as I user I would prefer/like if spark failed on missing columns.

      from pyspark.sql.dataframe import DoubleType, StructType
      
      spark: SparkSession = ...
      
      spark.read.parquet("/tmp/data.snappy.parquet")
      # inferred schema, includes 3 columns: col1, col2, new_col
      # DataFrame[col1: bigint, col2: bigint, new_col: bigint]
      
      # let's specify a custom read_schema, with **non nullable** col3 (which is not present):
      read_schema = StructType(fields=[StructField("col3",DoubleType(),False)])
      
      df = spark.read.schema(read_schema).parquet("/tmp/data.snappy.parquet")
      
      df.schema
      # we get a DataFrame with **nullable** col3:
      # StructType(List(StructField(col3,DoubleType,true)))
      
      df.count()
      # 0
      

      Is this a feature or a bug? In this case there's just a single parquet file, I have also tried option("mergeSchema", "true"), which doesn't help.

      Similar read pattern would fail on pandas (and likely dask).

      Attachments

        Activity

          People

            Unassigned Unassigned
            ravwojdyla Rafal Wojdyla
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: