Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
2.4.0, 3.0.0
Description
The optimization introduced by SPARK-24959 (improving performance of count() for DataFrames read from non-multiline JSON in PERMISSIVE mode) appears to cause count() to erroneously include empty lines in its result total if run prior to JSON parsing taking place.
For the following input:
{ "a" : 1 , "b" : 2 , "c" : 3 } { "a" : 4 , "b" : 5 , "c" : 6 } { "a" : 7 , "b" : 8 , "c" : 9 }
Spark 2.3:
scala> val df = spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json") df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field] scala> df.count res0: Long = 3 scala> df.cache.count res3: Long = 3
Spark 2.4:
scala> val df = spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json") df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field] scala> df.count res0: Long = 7 scala> df.cache.count res1: Long = 3
Since the count is apparently updated and cached when the Jackson parser runs, the optimization also causes the count to appear to be unstable upon cache/persist operations, as shown above.
CSV inputs, also optimized via SPARK-24959, do not appear to be impacted by this effect.
Attachments
Issue Links
- is caused by
-
SPARK-24959 Do not invoke the CSV/JSON parser for empty schema
- Resolved
- links to