[HBASE-21551] Memory leak when use scan with STREAM at server side - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.0.0-alpha-1, 2.2.0, 2.1.2, 2.0.4
Component/s: regionserver
Labels:
None

Hadoop Flags:

Reviewed
Release Note:

Hide

### Summary
HBase clusters will experience Region Server failures due to out of memory errors due to a leak given any of the following:

* User initiates Scan operations set to use the STREAM reading type
* User initiates Scan operations set to use the default reading type that read more than 4 * the block size of column families involved in the scan (e.g. by default 4*64KiB)
* Compactions run

### Root cause

When there are long running scans the Region Server process attempts to optimize access by using a different API geared towards sequential access. Due to an error in ~~HBASE-20704~~ for HBase 2.0+ the Region Server fails to release related resources when those scans finish. That same optimization path is always used for the HBase internal file compaction process.

### Workaround

Impact for this error can be minimized by setting the config value “hbase.storescanner.pread.max.bytes” to MAX_INT to avoid the optimization for default user scans. Clients should also be checked to ensure they do not pass the STREAM read type to the Scan API. This will have a severe impact on performance for long scans.

Compactions always use this sequential optimized reading mechanism so downstream users will need to periodically restart Region Server roles after compactions have happened.

Show
 ### Summary HBase clusters will experience Region Server failures due to out of memory errors due to a leak given any of the following: * User initiates Scan operations set to use the STREAM reading type * User initiates Scan operations set to use the default reading type that read more than 4 * the block size of column families involved in the scan (e.g. by default 4*64KiB) * Compactions run ### Root cause When there are long running scans the Region Server process attempts to optimize access by using a different API geared towards sequential access. Due to an error in HBASE-20704 for HBase 2.0+ the Region Server fails to release related resources when those scans finish. That same optimization path is always used for the HBase internal file compaction process. ### Workaround Impact for this error can be minimized by setting the config value “hbase.storescanner.pread.max.bytes” to MAX_INT to avoid the optimization for default user scans. Clients should also be checked to ensure they do not pass the STREAM read type to the Scan API. This will have a severe impact on performance for long scans. Compactions always use this sequential optimized reading mechanism so downstream users will need to periodically restart Region Server roles after compactions have happened.

Description

We open the RegionServerScanner with STREAM as following:

RegionScannerImpl#initializeScanners
      |---> HStore#getScanner
                    |----------> StoreScanner()
                                        |-------> StoreFileScanner#getScannersForStoreFiles
                                                          |------> HStoreFile#getStreamScanner      #1

In #1, we put the StoreFileReader into a concurrent hash map streamReaders, but not remove the StreamReader from streamReaders until closing the store file.

So if we scan with stream with so many times, the streamReaders hash map will be exploded. we can see the heap dump in the attached heap-dump.jpg.

I found this bug, because when i benchmark the scan performance by using YCSB in a cluster (heap size of RS is 50g), the Rs was easy to occur a long time full gc ( ~ 110 sec)....

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-21551.v1.patch
05/Dec/18 14:24
1 kB
Zheng Hu
HBASE-21551.v2.patch
05/Dec/18 15:12
5 kB
Zheng Hu
HBASE-21551.v3.patch
06/Dec/18 03:46
6 kB
Zheng Hu
heap-dump.jpg
05/Dec/18 13:18
433 kB
Zheng Hu

Issue Links

is caused by

HBASE-20704 Sometimes some compacted storefiles are not archived on region close

Closed

links to

Review board

Activity

People

Assignee:: Zheng Hu

Reporter:: Zheng Hu

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 05/Dec/18 13:16

Updated:: 23/Jun/22 18:45

Resolved:: 07/Dec/18 01:24