Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-5416

Filter on one CF and if a match, then load and return full row (WAS: Improve performance of scans with some kind of filters)

    Details

    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      New method is added to Filter which allows filter to specify which CF is needed to it's operation.

      public boolean isFamilyEssential(byte[] name);

      When new row is considered, only data for essential family is loaded and filter applied. And only if filter accepts the row, rest of data is loaded.

      This feature is off by default. You can use Scan.setLoadColumnFamiliesOnDemand() to enable it on a per Scan basis. If not indicated for the Scan, boolean value for "hbase.hregion.scan.loadColumnFamiliesOnDemand" would be used (default to false).
      Show
      New method is added to Filter which allows filter to specify which CF is needed to it's operation. public boolean isFamilyEssential(byte[] name); When new row is considered, only data for essential family is loaded and filter applied. And only if filter accepts the row, rest of data is loaded. This feature is off by default. You can use Scan.setLoadColumnFamiliesOnDemand() to enable it on a per Scan basis. If not indicated for the Scan, boolean value for "hbase.hregion.scan.loadColumnFamiliesOnDemand" would be used (default to false).

      Description

      When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed.

      But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter.

      For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed.

      Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed.

        Attachments

        1. 5416-0.94-v1.txt
          33 kB
          Lars Hofhansl
        2. 5416-0.94-v2.txt
          33 kB
          Lars Hofhansl
        3. 5416-0.94-v3.txt
          33 kB
          Lars Hofhansl
        4. 5416-drop-new-method-from-filter.txt
          5 kB
          Ted Yu
        5. 5416-Filtered_scans_v6.patch
          22 kB
          Ted Yu
        6. 5416-TestJoinedScanners-0.94.txt
          7 kB
          Ted Yu
        7. 5416-v13.patch
          59 kB
          Ted Yu
        8. 5416-v14.patch
          59 kB
          Ted Yu
        9. 5416-v15.patch
          59 kB
          Ted Yu
        10. 5416-v16.patch
          59 kB
          Ted Yu
        11. 5416-v5.txt
          16 kB
          Ted Yu
        12. 5416-v6.txt
          15 kB
          Ted Yu
        13. Filtered_scans_v2.patch
          10 kB
          Max Lapan
        14. Filtered_scans_v3.patch
          16 kB
          Max Lapan
        15. Filtered_scans_v4.patch
          16 kB
          Max Lapan
        16. Filtered_scans_v5.1.patch
          23 kB
          Max Lapan
        17. Filtered_scans_v5.patch
          22 kB
          Max Lapan
        18. Filtered_scans_v7.patch
          32 kB
          Max Lapan
        19. Filtered_scans.patch
          8 kB
          Max Lapan
        20. HBASE-5416-v10.patch
          60 kB
          Sergey Shelukhin
        21. HBASE-5416-v11.patch
          60 kB
          Sergey Shelukhin
        22. HBASE-5416-v12.patch
          64 kB
          Sergey Shelukhin
        23. HBASE-5416-v12.patch
          64 kB
          Sergey Shelukhin
        24. HBASE-5416-v7-rebased.patch
          32 kB
          Sergey Shelukhin
        25. HBASE-5416-v8.patch
          59 kB
          Sergey Shelukhin
        26. HBASE-5416-v9.patch
          59 kB
          Sergey Shelukhin
        27. org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt
          4.23 MB
          Ted Yu

          Issue Links

            Activity

              People

              • Assignee:
                sershe Sergey Shelukhin
                Reporter:
                shmuma Max Lapan
              • Votes:
                0 Vote for this issue
                Watchers:
                29 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: