Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-39900

Issue with querying dataframe produced by 'binaryFile' format using 'not' operator

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.2.1, 3.3.0
    • 3.1.4, 3.3.1, 3.2.3, 3.4.0
    • SQL
    • None

    Description

      When creating a dataframe using the binaryFile format I am encountering weird result when filtering/query with the 'not' operator.

       

      Here's a repo that will help describe and reproduce the issue.

      https://github.com/cccs-br/spark-binaryfile-issue

      git@github.com:cccs-br/spark-binaryfile-issue.git 

       

      Here's a very simple test case that illustrate what's going on:

      https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala

      TLDR;

         test("binary file dataframe") {
          // load files in directly into df using 'binaryFile' format.
          //     
          // - src/test/resources/files/
          //  - test1.csv
          //  - test2.json
          //  - test3.txt
          val df = spark
            .read
            .format("binaryFile")
            .load("src/test/resources/files")
      
          df.createOrReplaceTempView("files")
      
          // This works as expected.
          val like_count = spark.sql("select * from files where path like '%.csv'").count()
          assert(like_count === 1)
      
          // This does not work as expected.
          val not_like_count = spark.sql("select * from files where path not like '%.csv'").count()
          assert(not_like_count === 2)
      
          // This used to work in 3.2.1
          // df.filter(col("path").endsWith(".csv") === false).show()
        }

      Attachments

        Activity

          People

            Zing zzzzming95
            benoit_roy Benoit Roy
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: