Details
Description
When creating a dataframe using the binaryFile format I am encountering weird result when filtering/query with the 'not' operator.
Here's a repo that will help describe and reproduce the issue.
https://github.com/cccs-br/spark-binaryfile-issue
git@github.com:cccs-br/spark-binaryfile-issue.git
Here's a very simple test case that illustrate what's going on:
https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala
TLDR;
test("binary file dataframe") { // load files in directly into df using 'binaryFile' format. // // - src/test/resources/files/ // - test1.csv // - test2.json // - test3.txt val df = spark .read .format("binaryFile") .load("src/test/resources/files") df.createOrReplaceTempView("files") // This works as expected. val like_count = spark.sql("select * from files where path like '%.csv'").count() assert(like_count === 1) // This does not work as expected. val not_like_count = spark.sql("select * from files where path not like '%.csv'").count() assert(not_like_count === 2) // This used to work in 3.2.1 // df.filter(col("path").endsWith(".csv") === false).show() }