Description
When running -fixHdfsOverlaps command due to overlap in the regions of the table ,it moves all the hfiles of overlapping regions into new region with start_key and end_key calculating based on minimum and maximum start_key and end_key of all overlapping regions.
When calculating start_key and end_key for new region,end_key with 'empty' is not considered which leads to data loss when scanned using 'startrow'.
For example:
1.create table 't'
2.Insert records {00,111,200} into the table 't'and flush the data
3.split the table 't' with split-key '100'
4.Now we have three regions( 1 parent and two daughter regions )
1.Region-1('Empty','Empty') => {00,111,200}
2.Region-2('Empty','100')=>{00}
3.Region-3('100','Empty')=>{111,200}
5.Make sure parent region is not deleted in file system and run -fixHdfsOverlaps command
This -fixHdfsOverlaps command will move all the hfiles of the three regions
{Region-1,Region- 2,Region-3} into a new region(Region-4) created with start_key='Empty' and end_key='100'
This is because it does not consider end_key='Empty' and considers end_key='100' as maximum which in turn makes all the hfiles of three regions to move into new region even if records in hfile is more than the end_key='100' and one empty region Region -5 (100,Empty) will be created because table region end key was not empty.
Now we have 2 regions:
1.Region-4(Empty,100)=>{00,111,200}
2.Region-5(100,Empty)=>{}
when the entire table scan is done, all the records will be displayed, there wont be any data loss but scan with start_key is done below are the results:
1.scan 't', { STARTROW => '00'} => {00,111,200}
2.scan 't', { STARTROW => '100'}=>{}
The second scan will give empty result because it searches the rows in
Region -5(100,Empty) which contains no records but records {111,200} is present in Region-4(Empty,100).
The problem exists only when end_key='Empty' is present in any of the overlapping regions.I think if end_key is present in any of the overlapping regions,we have to consider it as maximum end_key.
Attachments
Attachments
Issue Links
- is duplicated by
-
HBASE-22646 boundaries errors in the overlap after using hbck fix
- Resolved