[HBASE-28756] RegionSizeCalculator ignored the size of memstore, which leads Spark miss data - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.6.0, 3.0.0-beta-1, 2.5.10
Fix Version/s: 3.0.0-beta-2, 2.6.1, 2.5.11
Component/s: mapreduce
Labels:
- pull-request-available

Description

RegionSizeCalculator only considers the size of StoreFile and ignores the size of MemStore. For a new region that has only been written to MemStore and has not been flushed, will consider its size to be 0.

When we use TableInputFormat to read HBase table data in Spark.

spark.sparkContext.newAPIHadoopRDD(
    conf,
    classOf[TableInputFormat],
    classOf[ImmutableBytesWritable],
    classOf[Result])
}

Spark defaults to ignoring empty InputSplits, which is determined by the configuration "spark.hadoopRDD.ignoreEmptySplits".

private[spark] val HADOOP_RDD_IGNORE_EMPTY_SPLITS =
  ConfigBuilder("spark.hadoopRDD.ignoreEmptySplits")
    .internal()
    .doc("When true, HadoopRDD/NewHadoopRDD will not create partitions for empty input splits.")
    .version("2.3.0")
    .booleanConf
    .createWithDefault(true)

The above reasons lead to Spark missing data. So we should consider both the size of the StoreFile and the MemStore in the RegionSizeCalculator.

Attachments

Issue Links

links to

GitHub Pull Request #6120

Activity

People

Assignee:: Sun Xin

Reporter:: Sun Xin

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 25/Jul/24 03:50

Updated:: 27/Jul/24 07:15

Resolved:: 26/Jul/24 08:02