Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-39993

Spark on Kubernetes doesn't filter data by date

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.2.2
    • None
    • Kubernetes
    • Kubernetes v1.23.6

      Spark 3.2.2

      Java 1.8.0_312

      Python 3.9.13

      Aws dependencies:
      aws-java-sdk-bundle-1.11.901.jar and hadoop-aws-3.3.1.jar

    Description

      I'm creating a Dataset with type date and saving it into s3. When I read it and try to use where() clause, I've noticed it doesn't return data even though it's there

      Below is the code snippet I'm running

       

      from pyspark.sql.types import Row
      from pyspark.sql.functions import *
      ds = spark.range(10).withColumn("date", lit("2022-01-01")).withColumn("date", col("date").cast("date"))
      ds.where("date = '2022-01-01'").show()
      ds.write.mode("overwrite").parquet("s3a://bucket/test")
      df = spark.read.format("parquet").load("s3a://bucket/test")
      df.where("date = '2022-01-01'").show()
      

      The first show() returns data, while the second one - no.

      I've noticed that it's Kubernetes master related, as the same code snipped works ok with master "local"

      UPD: if the column is used as a partition and has the type "date" there is no filtering problem.

       

       

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            h.liashchuk Hanna Liashchuk

            Dates

              Created:
              Updated:

              Slack

                Issue deployment