As first cut of Coprocessors (CP) implementation has been committed to trunk (
HBASE-2001 and HBASE-2002) I think there's a good opportunity to get going with this issue. I believe it's a good time for this effort and hope that CP-based implementation of region-level indexing will confirm that CP API is complete and has all one might need (for now).
I revised the design/approach of the IHBase contrib and have several questions to ask with regard to transforming the code based on CPs. It would be great if someone can help me with them!
1) Are coprocessors meant to be stateless? If not, then I assume that one instance is created and "assigned" to a region and that CP implementation should be thread-safe (e.g. multiple scanners can be handled at the same time for the regions). Otherwise, if coprocessors are meant to be stateless, I believe that CoprocessorEnvironment's get/put/remove methods are used to store intermediate data (aka attributes) between method calls (if we really need it). Is CoprocessorEnvironment instance is created one-per-region? I know, e.g. I can store some scan-related data using scanId passed to the scan-related callbacks (is it safe?), but what about region-related data (no problem with it in case cp env is one-per-region)?
In general, do I understand the CP's API correctly (based on assumptions I share in this point)?
2) During batch scan (smth which was added in trunk but wasn't supported in previous HBase versions, and hence current IHBase implementation doesn't take it into account) we need to return multiple rows from scan's next() method. It looks like if we apply current approach (from current IHBase implementation) of "fast forwarding" to next value we'll only fastforward scan to the first value of those to return. Others will be fetched using "usual" scan logic without using index which isn't efficient. There's not a lot we can do without changing scan (and deeper) code. Am I right here? Perhaps it's ok to have a lack of support for batch reads for the first version of CP-based IHBase? Or, it might me that we should change the approach?
3) Is it in general a good idea to take this initiave (transform IHBase implementation to CP-based one) by me? I fear that it might be that due to a lot of changes in HBase codebase (trunk versus e.g. 0.20.5) there are going to be severe changes in approach/design of indices implementation (from the current one, which I could use as a base), so poking you guys (HBase devs) from my side a lot (if really needed) to learn things about it isn't very efficient way to work on this issue ? Anyways, I'd be glad to work on the issue if someone can provide needed guidance.
4) Haven't dug into THBase contrib (as in IHBase). Are these contribs (IHBase and THBase) will be "transferred" to CP-based implementation as a single effort? I believe they won't be merged based on how differently they act now. Was it really meant to put the tasks for both into single JIRA issue?