[SPARK-35386] parquet read with schema should fail on non-existing columns - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.0.1
Fix Version/s: None
Component/s: Input/Output, PySpark
Labels:
None

Description

When read schema is specified as I user I would prefer/like if spark failed on missing columns.

from pyspark.sql.dataframe import DoubleType, StructType

spark: SparkSession = ...

spark.read.parquet("/tmp/data.snappy.parquet")
# inferred schema, includes 3 columns: col1, col2, new_col
# DataFrame[col1: bigint, col2: bigint, new_col: bigint]

# let's specify a custom read_schema, with **non nullable** col3 (which is not present):
read_schema = StructType(fields=[StructField("col3",DoubleType(),False)])

df = spark.read.schema(read_schema).parquet("/tmp/data.snappy.parquet")

df.schema
# we get a DataFrame with **nullable** col3:
# StructType(List(StructField(col3,DoubleType,true)))

df.count()
# 0

Is this a feature or a bug? In this case there's just a single parquet file, I have also tried option("mergeSchema", "true"), which doesn't help.

Similar read pattern would fail on pandas (and likely dask).

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Rafal Wojdyla

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 12/May/21 13:55

Updated:: 12/Dec/22 18:10