Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.6.0, 3.0.0-beta-1, 2.5.10
Description
RegionSizeCalculator only considers the size of StoreFile and ignores the size of MemStore. For a new region that has only been written to MemStore and has not been flushed, will consider its size to be 0.
When we use TableInputFormat to read HBase table data in Spark.
spark.sparkContext.newAPIHadoopRDD( conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result]) }
Spark defaults to ignoring empty InputSplits, which is determined by the configurationĀ "spark.hadoopRDD.ignoreEmptySplits".
private[spark] val HADOOP_RDD_IGNORE_EMPTY_SPLITS = ConfigBuilder("spark.hadoopRDD.ignoreEmptySplits") .internal() .doc("When true, HadoopRDD/NewHadoopRDD will not create partitions for empty input splits.") .version("2.3.0") .booleanConf .createWithDefault(true)
The above reasons lead to Spark missing data. So we should consider both the size of the StoreFile and the MemStore in the RegionSizeCalculator.
Attachments
Issue Links
- links to