HBase
  1. HBase
  2. HBASE-3247

Changes API: API for pulling edits from HBase

    Details

    • Type: Task Task
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Talking to Shay from Elastic Search, he was asking where the Changes API is in HBase. Talking more – there was a bit of beer involved so apologize up front – he wants to be able to bootstrap an index and thereafter ask HBase for changes since time t. We thought he could tie into the replication stream, but rather he wants to be able to pull rather than have it pushed to him (in case he crashes, etc. so on recovery he can start pulling again from last good edit received). He could do the bootstrap with a Scan. Thereafter, requests to pull from hbase would pass a marker of some sort. HBase would then give out edits that came in after this marker, in batches, along with an updated marker.

        Activity

        Hide
        Steven Noels added a comment -

        If this is really about robust (and distributed) pulling, wouldn't the RowLog mechanism as implemented in Lily would be a more solid approach - to the point that RowLog would leave in-row-process-status data in a non-user-visible column?

        I'm wondering, as I'm seeing a proliferation of alternative yet overlapping approaches to a certain number of issues (secondary indexes, change listening) which in the end could confuse new users.

        Show
        Steven Noels added a comment - If this is really about robust (and distributed) pulling, wouldn't the RowLog mechanism as implemented in Lily would be a more solid approach - to the point that RowLog would leave in-row-process-status data in a non-user-visible column? I'm wondering, as I'm seeing a proliferation of alternative yet overlapping approaches to a certain number of issues (secondary indexes, change listening) which in the end could confuse new users.
        Hide
        stack added a comment -

        @Steven Yes, we should start with RowLog (http://www.lilyproject.org/maven-site/0.1/apidocs/org/lilycms/rowlog/api/RowLog.html).

        I'm wondering, as I'm seeing a proliferation of alternative yet overlapping approaches to a certain number of issues (secondary indexes, change listening) which in the end could confuse new users.

        -1 to proliferation of alternate yet overlapping... things

        What you fellas suggest for bootstrapping system – doing a fat bulk load into the search index – and then cutting over to rowlog for incremental updates? Doesn't there have to exact transition so followers do not miss edits? You fellas have ideas for how to do that?

        Show
        stack added a comment - @Steven Yes, we should start with RowLog ( http://www.lilyproject.org/maven-site/0.1/apidocs/org/lilycms/rowlog/api/RowLog.html ). I'm wondering, as I'm seeing a proliferation of alternative yet overlapping approaches to a certain number of issues (secondary indexes, change listening) which in the end could confuse new users. -1 to proliferation of alternate yet overlapping... things What you fellas suggest for bootstrapping system – doing a fat bulk load into the search index – and then cutting over to rowlog for incremental updates? Doesn't there have to exact transition so followers do not miss edits? You fellas have ideas for how to do that?
        Hide
        Steven Noels added a comment -

        Well, we're doing mapreduce for initial SOLR population, which might be a bit too involved compared with soemthing like a Changes API. I reckon our Indexer could be made configurable to connect to ES as well. i'll have Evert look into this issue, and comment to it, he just did a writeup on the Rowlog on our blog today: http://outerthought.org/blog/449-ot.html

        The thing I would object to if I were a non-Lily person, would be that we need tracking/status data in user-visible columns.

        Show
        Steven Noels added a comment - Well, we're doing mapreduce for initial SOLR population, which might be a bit too involved compared with soemthing like a Changes API. I reckon our Indexer could be made configurable to connect to ES as well. i'll have Evert look into this issue, and comment to it, he just did a writeup on the Rowlog on our blog today: http://outerthought.org/blog/449-ot.html The thing I would object to if I were a non-Lily person, would be that we need tracking/status data in user-visible columns.
        Hide
        ryan rawson added a comment -

        why can't timestamp based scanning do this? Is it because of the
        missing deletes? Could there be a scan option to give more raw data?
        Not really a new API, but still kind of a half API.

        Show
        ryan rawson added a comment - why can't timestamp based scanning do this? Is it because of the missing deletes? Could there be a scan option to give more raw data? Not really a new API, but still kind of a half API.
        Hide
        Jonathan Gray added a comment -

        Scanning requires you to look at all the data (or at least, more than just the data you need). I think that would prove far to inefficient for something like keeping a search index up to date which you expect to be as "realtime" as possible.

        This is about only needing to see the deltas.

        Show
        Jonathan Gray added a comment - Scanning requires you to look at all the data (or at least, more than just the data you need). I think that would prove far to inefficient for something like keeping a search index up to date which you expect to be as "realtime" as possible. This is about only needing to see the deltas.
        Hide
        Evert Arckens added a comment -

        @stack
        With the Rowlog you can register a subscription and then all messages that are put on the rowlog will be kept for that subscription. If you then also register a listener (cfr RowLogMessageListener) on that subscription, the rowlog processor will start feeding the messages to the listener.
        If you can make a bulk load that only processes data that was changed before a certain point in time, you can let that run and in the meanwhile let the rowlog record all changes that are done after that point.

        Looking a bit further at how the Indexer in Lily uses the rowlog (http://docs.outerthought.org/lily-docs-current/415-lily.html) :
        When the indexer recieves a message it will use the record's current data and put that data in the index (IndexUpdater is the listener that is registered on the rowlog).
        An index rebuild will use map reduce to go over all the data again and update the index.
        It is allowed for both the bulk index rebuild and the index updater through the rowlog to run in parallel. Both will look at the current data of the record and put that in the index. So there is no need for a transition point from bulk to incremental.
        The indexer is written specifically to put Lily records into a Solr index. It is not designed yet to plug-in another index. But it should be do-able to use this same framework to have something non-Lily on the one hand and a non-Solr index on the other. If we look at the classes in the framework : the IndexUpdater is the implementation of the RowLogMessageListener which has knowledge about lily-records and decides 'what' to index. The Indexer class is responsible for mapping the Lily-schema onto the Solr-schema and maintains the communication with Solr.

        Show
        Evert Arckens added a comment - @stack With the Rowlog you can register a subscription and then all messages that are put on the rowlog will be kept for that subscription. If you then also register a listener (cfr RowLogMessageListener) on that subscription, the rowlog processor will start feeding the messages to the listener. If you can make a bulk load that only processes data that was changed before a certain point in time, you can let that run and in the meanwhile let the rowlog record all changes that are done after that point. Looking a bit further at how the Indexer in Lily uses the rowlog ( http://docs.outerthought.org/lily-docs-current/415-lily.html ) : When the indexer recieves a message it will use the record's current data and put that data in the index (IndexUpdater is the listener that is registered on the rowlog). An index rebuild will use map reduce to go over all the data again and update the index. It is allowed for both the bulk index rebuild and the index updater through the rowlog to run in parallel. Both will look at the current data of the record and put that in the index. So there is no need for a transition point from bulk to incremental. The indexer is written specifically to put Lily records into a Solr index. It is not designed yet to plug-in another index. But it should be do-able to use this same framework to have something non-Lily on the one hand and a non-Solr index on the other. If we look at the classes in the framework : the IndexUpdater is the implementation of the RowLogMessageListener which has knowledge about lily-records and decides 'what' to index. The Indexer class is responsible for mapping the Lily-schema onto the Solr-schema and maintains the communication with Solr.
        Hide
        stack added a comment -

        Thanks @Evert. BTW, how do we get documentation that is as fancy as yours?

        Show
        stack added a comment - Thanks @Evert. BTW, how do we get documentation that is as fancy as yours?

          People

          • Assignee:
            Unassigned
            Reporter:
            stack
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development