[HBASE-5416] Filter on one CF and if a match, then load and return full row (WAS: Improve performance of scans with some kind of filters) - ASF JIRA

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.90.4
Fix Version/s: 0.94.5, 0.95.0
Component/s: Filters, Performance, regionserver
Labels:
None

Hadoop Flags:

Reviewed
Release Note:

Hide
New method is added to Filter which allows filter to specify which CF is needed to it's operation.

public boolean isFamilyEssential(byte[] name);

When new row is considered, only data for essential family is loaded and filter applied. And only if filter accepts the row, rest of data is loaded.

This feature is off by default. You can use Scan.setLoadColumnFamiliesOnDemand() to enable it on a per Scan basis. If not indicated for the Scan, boolean value for "hbase.hregion.scan.loadColumnFamiliesOnDemand" would be used (default to false).

Show
New method is added to Filter which allows filter to specify which CF is needed to it's operation. public boolean isFamilyEssential(byte[] name); When new row is considered, only data for essential family is loaded and filter applied. And only if filter accepts the row, rest of data is loaded. This feature is off by default. You can use Scan.setLoadColumnFamiliesOnDemand() to enable it on a per Scan basis. If not indicated for the Scan, boolean value for "hbase.hregion.scan.loadColumnFamiliesOnDemand" would be used (default to false).

Description

When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed.

But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter.

For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed.

Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

5416-0.94-v1.txt
30/Dec/12 23:39
33 kB
Lars Hofhansl
5416-0.94-v2.txt
31/Dec/12 06:49
33 kB
Lars Hofhansl
5416-0.94-v3.txt
11/Jan/13 03:25
33 kB
Lars Hofhansl
5416-drop-new-method-from-filter.txt
23/Feb/13 15:22
5 kB
Ted Yu
5416-Filtered_scans_v6.patch
03/Jun/12 17:48
22 kB
Ted Yu
5416-TestJoinedScanners-0.94.txt
07/Apr/13 23:11
7 kB
Ted Yu
5416-v13.patch
27/Dec/12 17:22
59 kB
Ted Yu
5416-v14.patch
31/Dec/12 04:38
59 kB
Ted Yu
5416-v15.patch
31/Dec/12 05:42
59 kB
Ted Yu
5416-v16.patch
05/Jan/13 16:49
59 kB
Ted Yu
5416-v5.txt
24/Feb/12 14:46
16 kB
Ted Yu
5416-v6.txt
24/Feb/12 15:33
15 kB
Ted Yu
Filtered_scans_v2.patch
22/Feb/12 09:06
10 kB
Max Lapan
Filtered_scans_v3.patch
24/Feb/12 08:25
16 kB
Max Lapan
Filtered_scans_v4.patch
24/Feb/12 09:44
16 kB
Max Lapan
Filtered_scans_v5.1.patch
25/May/12 12:08
23 kB
Max Lapan
Filtered_scans_v5.patch
24/May/12 14:32
22 kB
Max Lapan
Filtered_scans_v7.patch
02/Jul/12 08:52
32 kB
Max Lapan
Filtered_scans.patch
20/Feb/12 14:42
8 kB
Max Lapan
HBASE-5416-v10.patch
19/Dec/12 00:31
60 kB
Sergey Shelukhin
HBASE-5416-v11.patch
19/Dec/12 23:07
60 kB
Sergey Shelukhin
HBASE-5416-v12.patch
21/Dec/12 03:23
64 kB
Sergey Shelukhin
HBASE-5416-v12.patch
21/Dec/12 01:26
64 kB
Sergey Shelukhin
HBASE-5416-v7-rebased.patch
14/Dec/12 03:31
32 kB
Sergey Shelukhin
HBASE-5416-v8.patch
15/Dec/12 00:26
59 kB
Sergey Shelukhin
HBASE-5416-v9.patch
17/Dec/12 18:54
59 kB
Sergey Shelukhin
org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt
11/Jan/13 02:36
4.23 MB
Ted Yu

Issue Links

blocks

HBASE-7383 create integration test for HBASE-5416 (improving scan performance for certain filters)

Closed

is broken by

HBASE-6499 StoreScanner's QueryMatcher not reset on store update

Closed

is related to

HBASE-8334 Enable essential column family support by default

Closed

relates to

HBASE-16731 Inconsistent results from the Get/Scan if we use the empty FilterList

Resolved

HBASE-7920 Move isFamilyEssential(byte[] name) out of Filter interface in 0.94

Closed

HBASE-74 [performance] When a get or scan request spans multiple columns, execute the reads in parallel

Closed

HBASE-16729 Define the behavior of (default) empty FilterList

Resolved

(2 relates to)

Sub-Tasks

1.

JoinedHeap for non essential column families should reseek instead of seek

Closed

Lars Hofhansl

Filter on one CF and if a match, then load and return full row (WAS: Improve performance of scans with some kind of filters)

Details

Description

Attachments

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates