Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32672

Data corruption in some cached compressed boolean columns

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 2.3.4, 2.4.6, 3.0.0, 3.0.1, 3.1.0
    • 2.4.7, 3.0.1, 3.1.0
    • SQL

    Description

      I found that when sorting some boolean data into the cache that the results can change when the data is read back out.

      It needs to be a non-trivial amount of data, and it is highly dependent on the order of the data. If I disable compression in the cache the issue goes away. I was able to make this happen in 3.0.0. I am going to try and reproduce it in other versions too.

      I'll attach the parquet file with boolean data in an order that causes this to happen. As you can see after the data is cached a single null values switches over to be false.

      scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
      bad_order: org.apache.spark.sql.DataFrame = [b: boolean]                        
      
      scala> bad_order.groupBy("b").count.show
      +-----+-----+
      |    b|count|
      +-----+-----+
      | null| 7153|
      | true|54334|
      |false|54021|
      +-----+-----+
      
      
      scala> bad_order.cache()
      res1: bad_order.type = [b: boolean]
      
      scala> bad_order.groupBy("b").count.show
      +-----+-----+
      |    b|count|
      +-----+-----+
      | null| 7152|
      | true|54334|
      |false|54022|
      +-----+-----+
      
      
      scala> 
      
      

      Attachments

        1. small_bad.snappy.parquet
          2 kB
          Robert Joseph Evans
        2. bad_order.snappy.parquet
          8 kB
          Robert Joseph Evans

        Issue Links

          Activity

            People

              revans2 Robert Joseph Evans
              revans2 Robert Joseph Evans
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: