Description
scala> spark.range(100).selectExpr("id % 10 p", "id").write.partitionBy("p").format("json").saveAsTable("testjson") scala> spark.table("testjson").queryExecution.optimizedPlan.statistics res6: org.apache.spark.sql.catalyst.plans.logical.Statistics = Statistics(sizeInBytes=0, isBroadcastable=false)
It shouldn't be 0. The issue is that in DataSource.scala, we do:
val fileCatalog = if (sparkSession.sqlContext.conf.manageFilesourcePartitions && catalogTable.isDefined && catalogTable.get.tracksPartitionsInCatalog) { new CatalogFileIndex( sparkSession, catalogTable.get, catalogTable.get.stats.map(_.sizeInBytes.toLong).getOrElse(0L)) } else { new InMemoryFileIndex(sparkSession, globbedPaths, options, Some(partitionSchema)) }
We shouldn't use 0L as the fallback.