HBase
  1. HBase
  2. HBASE-867

If millions of columns in a column family, hbase scanner won't come up

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.20.0
    • Component/s: None
    • Labels:
      None

      Description

      Our Daniel has uploaded a table that has a column family with millions of columns in it. He can get items from the table promptly specifying row and column. Scanning is another matter. Thread dumping I see we're stuck in the scanner constructor nexting through cells.

        Issue Links

          Activity

          stack created issue -
          Hide
          stack added a comment -

          It looks like we should be breaking out of the main while loop in the next method after we pass out a column match – because store keys are sorted – but then we fall on the next while loop which just nexts until we hit the next row.

          Normally this is fine, if only a few columns in a row, but in Daniel's case its taking forever to move to the next row.

          Also, we won't split a region if only one row so its looking like his store files are large, 1G.

          Show
          stack added a comment - It looks like we should be breaking out of the main while loop in the next method after we pass out a column match – because store keys are sorted – but then we fall on the next while loop which just nexts until we hit the next row. Normally this is fine, if only a few columns in a row, but in Daniel's case its taking forever to move to the next row. Also, we won't split a region if only one row so its looking like his store files are large, 1G.
          Hide
          stack added a comment -

          To be clear, if thousands of columns plus – i.e. a canonical usage – hbase does not work. Here is some of the problem code form StoreFileScanner#next:

          ...
                    while ((keys[i] != null)
                        && (Bytes.compareTo(keys[i].getRow(), viableRow.getRow()) == 0)) {
          
                      // If we are doing a wild card match or there are multiple matchers
                      // per column, we need to scan all the older versions of this row
                      // to pick up the rest of the family members
                      if(!isWildcardScanner()
                          && !isMultipleMatchScanner()
                          && (keys[i].getTimestamp() != viableRow.getTimestamp())) {
                        break;
                      }
          
                      if (columnMatch(i)) {              
                        // We only want the first result for any specific family member
                        if(!results.containsKey(keys[i].getColumn())) {
                          results.put(keys[i].getColumn(), 
                              new Cell(vals[i], keys[i].getTimestamp()));
                          insertedItem = true;
                        }
                      } else {
                        // Content is sorted.  If column no longer matches, break.
                        break;
                      }
          
                      if (!getNext(i)) {
                        closeSubScanner(i);
                      }
                    }
          
                    // Advance the current scanner beyond the chosen row, to
                    // a valid timestamp, so we're ready next time.
                    while ((keys[i] != null) &&
                        ((Bytes.compareTo(keys[i].getRow(), viableRow.getRow()) <= 0)
                            || (keys[i].getTimestamp() > this.timestamp)
                            || (! columnMatch(i)))) {
                      getNext(i);
                    }
          ..
          

          The whiles find next row by getting cells until the row does not match. If many columns per row, then that can take for ever (as its doing in Daniel's case). Need to have a file format that has an index that says where next row is. An option would say whether to get to next row by nexting or instead asking index.

          Show
          stack added a comment - To be clear, if thousands of columns plus – i.e. a canonical usage – hbase does not work. Here is some of the problem code form StoreFileScanner#next: ... while ((keys[i] != null ) && (Bytes.compareTo(keys[i].getRow(), viableRow.getRow()) == 0)) { // If we are doing a wild card match or there are multiple matchers // per column, we need to scan all the older versions of this row // to pick up the rest of the family members if (!isWildcardScanner() && !isMultipleMatchScanner() && (keys[i].getTimestamp() != viableRow.getTimestamp())) { break ; } if (columnMatch(i)) { // We only want the first result for any specific family member if (!results.containsKey(keys[i].getColumn())) { results.put(keys[i].getColumn(), new Cell(vals[i], keys[i].getTimestamp())); insertedItem = true ; } } else { // Content is sorted. If column no longer matches, break . break ; } if (!getNext(i)) { closeSubScanner(i); } } // Advance the current scanner beyond the chosen row, to // a valid timestamp, so we're ready next time. while ((keys[i] != null ) && ((Bytes.compareTo(keys[i].getRow(), viableRow.getRow()) <= 0) || (keys[i].getTimestamp() > this .timestamp) || (! columnMatch(i)))) { getNext(i); } .. The whiles find next row by getting cells until the row does not match. If many columns per row, then that can take for ever (as its doing in Daniel's case). Need to have a file format that has an index that says where next row is. An option would say whether to get to next row by nexting or instead asking index.
          Hide
          stack added a comment -

          Marking this a critical issue. Only fix that I see is new mapfile type that keeps an index of each row offset.

          Show
          stack added a comment - Marking this a critical issue. Only fix that I see is new mapfile type that keeps an index of each row offset.
          stack made changes -
          Field Original Value New Value
          Priority Major [ 3 ] Critical [ 2 ]
          Hide
          stack added a comment -

          Its critical because this is canonical usage-pattern.

          Show
          stack added a comment - Its critical because this is canonical usage-pattern.
          Hide
          stack added a comment -

          Going to take a look at doing this for 0.19.0 since its embarrassing we don't do the canonical use case.

          Show
          stack added a comment - Going to take a look at doing this for 0.19.0 since its embarrassing we don't do the canonical use case.
          stack made changes -
          Fix Version/s 0.19.0 [ 12313364 ]
          stack made changes -
          Assignee stack [ stack ]
          Hide
          stack added a comment -

          Moving to 0.20.0. Need access to the MapFile index to do this fix. Currently its private.

          Show
          stack added a comment - Moving to 0.20.0. Need access to the MapFile index to do this fix. Currently its private.
          stack made changes -
          Fix Version/s 0.20.0 [ 12313474 ]
          Fix Version/s 0.19.0 [ 12313364 ]
          Hide
          stack added a comment -

          From #IRC, here is another case we need to be smarter about:

          01:13 < BenM>       keys[i] = new HStoreKey(HConstants.EMPTY_BYTE_ARRAY, this.store.getHRegionInfo());
          01:13 < BenM>       if (firstRow != null && firstRow.length != 0) {
          01:13 < BenM>         if (findFirstRow(i, firstRow)) {
          01:13 < BenM>           continue;
          01:13 < BenM>         }
          01:13 < BenM>       }
          01:13 < BenM>       while (getNext(i)) {
          01:13 < BenM>         if (columnMatch(i)) {
          01:13 < BenM>           break;
          01:13 < BenM>         }
          01:13 < BenM>       }
          

          Its setting up scanners after store files have been changed.

          If lots of entries for rows we don't care about, then these iterations will take a long time. Need to be smarter about the seek.

          Show
          stack added a comment - From #IRC, here is another case we need to be smarter about: 01:13 < BenM> keys[i] = new HStoreKey(HConstants.EMPTY_BYTE_ARRAY, this .store.getHRegionInfo()); 01:13 < BenM> if (firstRow != null && firstRow.length != 0) { 01:13 < BenM> if (findFirstRow(i, firstRow)) { 01:13 < BenM> continue ; 01:13 < BenM> } 01:13 < BenM> } 01:13 < BenM> while (getNext(i)) { 01:13 < BenM> if (columnMatch(i)) { 01:13 < BenM> break ; 01:13 < BenM> } 01:13 < BenM> } Its setting up scanners after store files have been changed. If lots of entries for rows we don't care about, then these iterations will take a long time. Need to be smarter about the seek.
          Hide
          Ben Maurer added a comment -

          I confirmed that this code is what was causing crashes for me. What happened is that I had a MR job that would launch multiple scanners on a region that made updates to the same column family as they were scanning on (but not the same column). As a result, there were lots of processes that had to grep through all of the irrelevent inserts many times as flushes occurred.

          I think that this case could be fixed in 0.19.0, and furthermore I think the fix might actually clean up the code a lot:

          (10:58:18 PM) BenM: yeah
          (10:58:21 PM) BenM: was just doing that
          (10:58:30 PM) BenM: IMHO, this is a somewhat easier issue to fix
          (10:58:38 PM) BenM: i think it could be done in a way that cleans up the code
          (10:58:50 PM) BenM: right now, the code just scans through each of the map files
          (10:59:02 PM) BenM: without regard to the relative key positions
          (10:59:12 PM) BenM: i think it could use a priority queue so that it only works on the relevent files
          (11:01:22 PM) St^Ack_: BenM: please expand, I don't follow exactly
          (11:01:50 PM) BenM: lets say we have two map files
          (11:02:09 PM) BenM: one with 1/foo:bar 2/foo:bar 3/foo:bar
          (11:02:17 PM) BenM: (row/family:col)
          (11:02:31 PM) BenM: and the other with 1000/blah:blah 1001/blah:blah
          (11:02:39 PM) BenM: the curent logic is
          (11:02:44 PM) BenM: for each map file:
          (11:02:56 PM) BenM: find the first potential row in this file
          (11:03:08 PM) BenM: look at min(all potential rows)
          (11:03:34 PM) BenM: the algorith should be:
          (11:03:43 PM) BenM: q = new PriorityQueue()
          (11:04:05 PM) BenM: for each map file: insert the HStoreKey of the first key in the file
          (11:04:17 PM) BenM: while(k = q.pop())

          { (11:04:37 PM) BenM: if (k is intersting) break; (11:04:37 PM) BenM: advance k (11:04:37 PM) BenM: q.push(k) (11:04:38 PM) BenM: }

          (11:05:00 PM) BenM: that way, we don't try to find a matching key in the larger rows

          Show
          Ben Maurer added a comment - I confirmed that this code is what was causing crashes for me. What happened is that I had a MR job that would launch multiple scanners on a region that made updates to the same column family as they were scanning on (but not the same column). As a result, there were lots of processes that had to grep through all of the irrelevent inserts many times as flushes occurred. I think that this case could be fixed in 0.19.0, and furthermore I think the fix might actually clean up the code a lot: (10:58:18 PM) BenM: yeah (10:58:21 PM) BenM: was just doing that (10:58:30 PM) BenM: IMHO, this is a somewhat easier issue to fix (10:58:38 PM) BenM: i think it could be done in a way that cleans up the code (10:58:50 PM) BenM: right now, the code just scans through each of the map files (10:59:02 PM) BenM: without regard to the relative key positions (10:59:12 PM) BenM: i think it could use a priority queue so that it only works on the relevent files (11:01:22 PM) St^Ack_: BenM: please expand, I don't follow exactly (11:01:50 PM) BenM: lets say we have two map files (11:02:09 PM) BenM: one with 1/foo:bar 2/foo:bar 3/foo:bar (11:02:17 PM) BenM: (row/family:col) (11:02:31 PM) BenM: and the other with 1000/blah:blah 1001/blah:blah (11:02:39 PM) BenM: the curent logic is (11:02:44 PM) BenM: for each map file: (11:02:56 PM) BenM: find the first potential row in this file (11:03:08 PM) BenM: look at min(all potential rows) (11:03:34 PM) BenM: the algorith should be: (11:03:43 PM) BenM: q = new PriorityQueue() (11:04:05 PM) BenM: for each map file: insert the HStoreKey of the first key in the file (11:04:17 PM) BenM: while(k = q.pop()) { (11:04:37 PM) BenM: if (k is intersting) break; (11:04:37 PM) BenM: advance k (11:04:37 PM) BenM: q.push(k) (11:04:38 PM) BenM: } (11:05:00 PM) BenM: that way, we don't try to find a matching key in the larger rows
          Ben Maurer made changes -
          Link This issue is related to HBASE-1206 [ HBASE-1206 ]
          Jonathan Gray made changes -
          Link This issue is related to HBASE-1249 [ HBASE-1249 ]
          Hide
          Jonathan Gray added a comment -

          Would like to get this solved as part of 1249 issues.

          Show
          Jonathan Gray added a comment - Would like to get this solved as part of 1249 issues.
          Hide
          Jonathan Gray added a comment -

          The idea Ben describes above is part of HBASE-1249. This issue will be resolved as part of 1249.

          Show
          Jonathan Gray added a comment - The idea Ben describes above is part of HBASE-1249 . This issue will be resolved as part of 1249.
          Jonathan Gray made changes -
          Assignee stack [ stack ] Jonathan Gray [ streamy ]
          Hide
          Jonathan Gray added a comment -

          Issue is probably resolved, but I'm going to continue testing. It's not possible to create JUnit tests because the psuedo-distr. cluster falls over under load this high.

          First simple test (gets though, not scans... scans next):

          Generated 1 Puts in 3068ms
                  1 rows, 8 bytes/row key
                  1000000 columns/row, 8 bytes/column key
                  8 bytes/column value
          Inserted Put in 23295ms
          Get0 (1000000 KVs) completed in 4031ms
          Get1 (1000000 KVs) completed in 2300ms
          Get2 (1000000 KVs) completed in 2831ms
          Get3 (1000000 KVs) completed in 1707ms
          Get4 (1000000 KVs) completed in 2588ms
          Get5 (1000000 KVs) completed in 2671ms
          Get6 (1000000 KVs) completed in 2442ms
          Get7 (1000000 KVs) completed in 2560ms
          Get8 (1000000 KVs) completed in 2462ms
          Get9 (1000000 KVs) completed in 2822ms
          
          Show
          Jonathan Gray added a comment - Issue is probably resolved, but I'm going to continue testing. It's not possible to create JUnit tests because the psuedo-distr. cluster falls over under load this high. First simple test (gets though, not scans... scans next): Generated 1 Puts in 3068ms 1 rows, 8 bytes/row key 1000000 columns/row, 8 bytes/column key 8 bytes/column value Inserted Put in 23295ms Get0 (1000000 KVs) completed in 4031ms Get1 (1000000 KVs) completed in 2300ms Get2 (1000000 KVs) completed in 2831ms Get3 (1000000 KVs) completed in 1707ms Get4 (1000000 KVs) completed in 2588ms Get5 (1000000 KVs) completed in 2671ms Get6 (1000000 KVs) completed in 2442ms Get7 (1000000 KVs) completed in 2560ms Get8 (1000000 KVs) completed in 2462ms Get9 (1000000 KVs) completed in 2822ms
          Hide
          stack added a comment -

          Whats above saying JGray? That you added single row with a million columns? And then you did 10 gets and they took 2.5ms on average to complete? Which column were you getting?

          Scans would be good to get numbers for. That was reason for original filing.

          Good stuff.

          Show
          stack added a comment - Whats above saying JGray? That you added single row with a million columns? And then you did 10 gets and they took 2.5ms on average to complete? Which column were you getting? Scans would be good to get numbers for. That was reason for original filing. Good stuff.
          Hide
          Jonathan Gray added a comment -

          I am doing tests for this issue on a 5+1 node cluster, each node is 2core/2gb and hosting two HDFS and two HBase instances (0.19 cluster still up but it's idle).

          Using a newer version of the HBench tool I posted in HBASE-1501, I'm able to run a number of different tests with high numbers of columns.

          My test is inserting 10 rows, each with 2M columns. I do it in 200 rounds, each round I insert 10k columns in each of the 10 rows.

          Qualifiers are incremented binary longs (1 -> 2M), so 8 bytes. Values are randomized binary data of fixed length. By varying the size of the value (have tried between 8 and 32 bytes per value), I can get different behavior.

          With not much memory to give the RS, I run into OOME problems when serializing the Result. I'm going to rerun tests at higher value sizes and get some clean logs to look at, making sure I have block caching disabled so it doesn't hog heap.

          However, with 8 byte values I'm able to import without a problem (causes several splits, in the end we have 5 regions for the 10 rows). In addition to the import test, I'm also scanning these 10 rows in two ways. A full scan (all in family) as well as a skip scan (i'm asking for two specific columns, qualifier=1 and qualifier=1888888, so beginning and end of each row).

          Inserted 10 rows each with 2000000 total columns in 344566ms (34456.6ms/row)
          
          Skip Scanner open
          Row [row0] Scanned, Contains 2 Columns (10155 ms)
          Row [row1] Scanned, Contains 2 Columns (9978 ms)
          Row [row2] Scanned, Contains 2 Columns (10675 ms)
          Row [row3] Scanned, Contains 2 Columns (9608 ms)
          Row [row4] Scanned, Contains 2 Columns (11703 ms)
          Row [row5] Scanned, Contains 2 Columns (12103 ms)
          Row [row6] Scanned, Contains 2 Columns (6828 ms)
          Row [row7] Scanned, Contains 2 Columns (6603 ms)
          Row [row8] Scanned, Contains 2 Columns (6331 ms)
          Row [row9] Scanned, Contains 2 Columns (6553 ms)
          Scanned 10 rows in 90551ms (9055.1ms/row)
          
          Full Scanner open
          Row [row0] Scanned, Contains 2000000 Columns (14374 ms)
          Row [row1] Scanned, Contains 2000000 Columns (14879 ms)
          Row [row2] Scanned, Contains 2000000 Columns (14053 ms)
          Row [row3] Scanned, Contains 2000000 Columns (14263 ms)
          Row [row4] Scanned, Contains 2000000 Columns (8811 ms)
          Row [row5] Scanned, Contains 2000000 Columns (10327 ms)
          Row [row6] Scanned, Contains 2000000 Columns (9757 ms)
          Row [row7] Scanned, Contains 2000000 Columns (9343 ms)
          Row [row8] Scanned, Contains 2000000 Columns (9526 ms)
          Row [row9] Scanned, Contains 2000000 Columns (10004 ms)
          Scanned 10 rows in 115342ms (11534.2ms/row)
          

          Repeated runs improve performance, and ordering of the two types of scans makes a difference. Block cache is off so we're seeing the effect of the linux file cache.

          Show
          Jonathan Gray added a comment - I am doing tests for this issue on a 5+1 node cluster, each node is 2core/2gb and hosting two HDFS and two HBase instances (0.19 cluster still up but it's idle). Using a newer version of the HBench tool I posted in HBASE-1501 , I'm able to run a number of different tests with high numbers of columns. My test is inserting 10 rows, each with 2M columns. I do it in 200 rounds, each round I insert 10k columns in each of the 10 rows. Qualifiers are incremented binary longs (1 -> 2M), so 8 bytes. Values are randomized binary data of fixed length. By varying the size of the value (have tried between 8 and 32 bytes per value), I can get different behavior. With not much memory to give the RS, I run into OOME problems when serializing the Result. I'm going to rerun tests at higher value sizes and get some clean logs to look at, making sure I have block caching disabled so it doesn't hog heap. However, with 8 byte values I'm able to import without a problem (causes several splits, in the end we have 5 regions for the 10 rows). In addition to the import test, I'm also scanning these 10 rows in two ways. A full scan (all in family) as well as a skip scan (i'm asking for two specific columns, qualifier=1 and qualifier=1888888, so beginning and end of each row). Inserted 10 rows each with 2000000 total columns in 344566ms (34456.6ms/row) Skip Scanner open Row [row0] Scanned, Contains 2 Columns (10155 ms) Row [row1] Scanned, Contains 2 Columns (9978 ms) Row [row2] Scanned, Contains 2 Columns (10675 ms) Row [row3] Scanned, Contains 2 Columns (9608 ms) Row [row4] Scanned, Contains 2 Columns (11703 ms) Row [row5] Scanned, Contains 2 Columns (12103 ms) Row [row6] Scanned, Contains 2 Columns (6828 ms) Row [row7] Scanned, Contains 2 Columns (6603 ms) Row [row8] Scanned, Contains 2 Columns (6331 ms) Row [row9] Scanned, Contains 2 Columns (6553 ms) Scanned 10 rows in 90551ms (9055.1ms/row) Full Scanner open Row [row0] Scanned, Contains 2000000 Columns (14374 ms) Row [row1] Scanned, Contains 2000000 Columns (14879 ms) Row [row2] Scanned, Contains 2000000 Columns (14053 ms) Row [row3] Scanned, Contains 2000000 Columns (14263 ms) Row [row4] Scanned, Contains 2000000 Columns (8811 ms) Row [row5] Scanned, Contains 2000000 Columns (10327 ms) Row [row6] Scanned, Contains 2000000 Columns (9757 ms) Row [row7] Scanned, Contains 2000000 Columns (9343 ms) Row [row8] Scanned, Contains 2000000 Columns (9526 ms) Row [row9] Scanned, Contains 2000000 Columns (10004 ms) Scanned 10 rows in 115342ms (11534.2ms/row) Repeated runs improve performance, and ordering of the two types of scans makes a difference. Block cache is off so we're seeing the effect of the linux file cache.
          Hide
          Jonathan Gray added a comment -

          Given that I can do the above on a smallish cluster without a problem (the only problems come from OOME when serializing these big rows if I grow value size), I am resolving this issue as fixed by HBASE-1304.

          To be able to scale further as far as columns in a row, we need intra-row scanning. I have filed HBASE-1537 currently targeted at 0.21 for now.

          There are still improvements to be made when not returning all columns in a row with millions of columns. That optimization is addressed in HBASE-1517 and slated for 0.20.1.

          Show
          Jonathan Gray added a comment - Given that I can do the above on a smallish cluster without a problem (the only problems come from OOME when serializing these big rows if I grow value size), I am resolving this issue as fixed by HBASE-1304 . To be able to scale further as far as columns in a row, we need intra-row scanning. I have filed HBASE-1537 currently targeted at 0.21 for now. There are still improvements to be made when not returning all columns in a row with millions of columns. That optimization is addressed in HBASE-1517 and slated for 0.20.1.
          Jonathan Gray made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          stack added a comment -

          I'm +1 on resolving this issue because of 1304 work (in spite of what the issue title says) and doing further improvements out in other issues. Let me add note that this issue was resolved to CHANGES.txt

          Show
          stack added a comment - I'm +1 on resolving this issue because of 1304 work (in spite of what the issue title says) and doing further improvements out in other issues. Let me add note that this issue was resolved to CHANGES.txt
          stack made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Jonathan Gray
              Reporter:
              stack
            • Votes:
              3 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development