Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26745

Non-parsing Dataset.count() optimization causes inconsistent results for JSON inputs with empty lines

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.4.0, 3.0.0
    • Fix Version/s: 2.4.1, 3.0.0
    • Component/s: SQL
    • Labels:

      Description

      The optimization introduced by SPARK-24959 (improving performance of count() for DataFrames read from non-multiline JSON in PERMISSIVE mode) appears to cause count() to erroneously include empty lines in its result total if run prior to JSON parsing taking place.

      For the following input:

      { "a" : 1 , "b" : 2 , "c" : 3 }
      
              { "a" : 4 , "b" : 5 , "c" : 6 }
           
      { "a" : 7 , "b" : 8 , "c" : 9 }
      
      
      

      Spark 2.3:

      scala> val df = spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json")
      df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field]
      
      scala> df.count
      res0: Long = 3
      
      scala> df.cache.count
      res3: Long = 3
      

      Spark 2.4:

      scala> val df = spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json")
      df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field]
      
      scala> df.count
      res0: Long = 7
      
      scala> df.cache.count
      res1: Long = 3
      

      Since the count is apparently updated and cached when the Jackson parser runs, the optimization also causes the count to appear to be unstable upon cache/persist operations, as shown above.

      CSV inputs, also optimized via SPARK-24959, do not appear to be impacted by this effect.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                hyukjin.kwon Hyukjin Kwon
                Reporter:
                sumitsu Branden Smith
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: