HBase
  1. HBase
  2. HBASE-10102

CF.VERSIONS is not enforced with timerange scans

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Example brought up by Niels Basjes on the user list:
      If I do the following commands into the hbase shell

          create 't1', {NAME => 'c1', VERSIONS => 1}
          put 't1', 'r1', 'c1', 'One', 1000
          put 't1', 'r1', 'c1', 'Two', 2000
          put 't1', 'r1', 'c1', 'Three', 3000
          get 't1', 'r1'
          get 't1', 'r1' , {TIMERANGE => [0,1500]}
      
      the result is this:
      
          get 't1', 'r1'
          COLUMN                     CELL
           c1:                       timestamp=3000, value=Three
          1 row(s) in 0.0780 seconds
      
          get 't1', 'r1' , {TIMERANGE => [0,1500]}
          COLUMN                     CELL
           c1:                       timestamp=1000, value=One
          1 row(s) in 0.1390 seconds
      

        Activity

        Hide
        Lars Hofhansl added a comment -

        Currently the workflow in ScanQueryMatcher is something like this:

        1. <versions> = min(<CF versions>, <scan version>)
        2. filter by timerange
        3. filter out columns (i.e. columns not specified in the scan)
        4. apply customer filters
        5. filter by <versions>

        Every KV is passed through this filtering process.

        What we should do is this:

        1. filter by <CF versions>
        2. filter by timerange
        3. filter out columns (i.e. columns not specified in the scan)
        4. apply customer filters
        5. filter by <scan versions>

        I have a POC patch that does this. It does not slow scanning in a measurable way.

        Show
        Lars Hofhansl added a comment - Currently the workflow in ScanQueryMatcher is something like this: <versions> = min(<CF versions>, <scan version>) filter by timerange filter out columns (i.e. columns not specified in the scan) apply customer filters filter by <versions> Every KV is passed through this filtering process. What we should do is this: filter by <CF versions> filter by timerange filter out columns (i.e. columns not specified in the scan) apply customer filters filter by <scan versions> I have a POC patch that does this. It does not slow scanning in a measurable way.
        Hide
        Lars Hofhansl added a comment -

        POC patch. Just need to park it somewhere. Not tested.

        Show
        Lars Hofhansl added a comment - POC patch. Just need to park it somewhere. Not tested.
        Hide
        Lars Hofhansl added a comment -

        Vasu Mariyala just pointed out that even if we fixed this issue, we'd have to remove time based HFile selection in order to find the count the newer values even when we're only interested in the older ones. Clearly a no go.
        Closing as "Won't fix".

        Show
        Lars Hofhansl added a comment - Vasu Mariyala just pointed out that even if we fixed this issue, we'd have to remove time based HFile selection in order to find the count the newer values even when we're only interested in the older ones. Clearly a no go. Closing as "Won't fix".

          People

          • Assignee:
            Unassigned
            Reporter:
            Lars Hofhansl
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development