Hey George, thanks for the patch.
I have a question about how this improves performance over an
index layout similar to the SimpleIndexKeyGenerator. I have the same
requirements you mention above: namely I'd like to quickly finda all
rows in table A which have a value for COL1 of 'X'.
I build my index keys like <col1-value><sep><base-row-id> where <sep>
is a special byte sequence that does not occur in column values or row
keys. (Actually it can occur, if so I just escape it in the
index-row). Lets say <sep> is '__' in the example below
So if I have base rows:
ROW | COL_A
aaa | foo
bbb | bar
ccc | foo
ddd | zoo
Then my index would look like (just the rows are shown):
So for the query find all rows where COL_A == foo, I do an index scan
starting at 'foo_' and ending at 'foo*' (where * is the byte after
This will only scan through only the two index rows I wanted. Looks
like your patch will make it so rather than scanning two rows with on
cell each I scan one row with two cells each. I'm not 100% sure on the
specifics, but I think these two queries would generally be of the
same order of performance.
Do I understand things correctly? Is there a reason you could not use
the existing index mechanism for your needs?
I think we could do some work to make this pattern more obvious and
usable with the current infrastructure, but I'm a bit hesitant to add yet
another region/regionserver extension.
George, what do you think?
Slightly aside: When I read about AppEngine's index (a year ago or so), they said that they maintain N index rows for a single base row (1 per column being indexed). I've been wanting to rework this framework to support that as well, but it has not been a high priority as it would require a rewrite of our query stuff that uses the current indexing layer. The approach you take is the opposite: 1 index row for for N base rows. Not sure that really says anything, but ...