Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.2.2
-
None
-
Kubernetes v1.23.6
Spark 3.2.2
Java 1.8.0_312
Python 3.9.13
Aws dependencies:
aws-java-sdk-bundle-1.11.901.jar and hadoop-aws-3.3.1.jar
Description
I'm creating a Dataset with type date and saving it into s3. When I read it and try to use where() clause, I've noticed it doesn't return data even though it's there
Below is the code snippet I'm running
from pyspark.sql.types import Row from pyspark.sql.functions import * ds = spark.range(10).withColumn("date", lit("2022-01-01")).withColumn("date", col("date").cast("date")) ds.where("date = '2022-01-01'").show() ds.write.mode("overwrite").parquet("s3a://bucket/test") df = spark.read.format("parquet").load("s3a://bucket/test") df.where("date = '2022-01-01'").show()
The first show() returns data, while the second one - no.
I've noticed that it's Kubernetes master related, as the same code snipped works ok with master "local"
UPD: if the column is used as a partition and has the type "date" there is no filtering problem.