Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: search
    • Labels:
      None

      Description

      Provide a non point-in-time interface to get a document.
      For example, if you add a new document, you will be able to get it, regardless of if the searcher has been refreshed.

      1. SOLR-2656_distrib.patch
        7 kB
        Yonik Seeley
      2. SOLR-2656_test.patch
        11 kB
        Yonik Seeley
      3. SOLR-2656.patch
        22 kB
        Yonik Seeley

        Issue Links

          Activity

          Hide
          Yonik Seeley added a comment -

          This feature moves solr a little more down the nosql road (of being a real datastore), and the interface itself should be useful to SolrCloud (for recovery, one node needs to get updates from another node, etc). Also, if we do versioning in the future, that's another place that will need an up-to-date view of the index internally.

          This should probably be based on SearchHandler (since it should be distributed in the future), but should avoid calling getSearcher on the request object (or it should only use getSearcher when it knows a searcher is ready and it has the requested document(s)).
          Perhaps the UpdateHandler.reopenSearcher could instead be reopenReader, and the update handler could keep a reference to the last reader it opened.

          Show
          Yonik Seeley added a comment - This feature moves solr a little more down the nosql road (of being a real datastore), and the interface itself should be useful to SolrCloud (for recovery, one node needs to get updates from another node, etc). Also, if we do versioning in the future, that's another place that will need an up-to-date view of the index internally. This should probably be based on SearchHandler (since it should be distributed in the future), but should avoid calling getSearcher on the request object (or it should only use getSearcher when it knows a searcher is ready and it has the requested document(s)). Perhaps the UpdateHandler.reopenSearcher could instead be reopenReader, and the update handler could keep a reference to the last reader it opened.
          Hide
          Yonik Seeley added a comment -

          It was tricky tracking the newest reader in the update handler, so I'm going to try and track it in SolrCore and consult the update handler to see of there have been any changes.

          Show
          Yonik Seeley added a comment - It was tricky tracking the newest reader in the update handler, so I'm going to try and track it in SolrCore and consult the update handler to see of there have been any changes.
          Hide
          Yonik Seeley added a comment -

          Here's a draft patch (no tests yet) that tracks the latest reader in SolrCore and keeps track of a virtual clock and compares that to the virtual clock of the update handler.

          To try it out, add a document without committing, and then
          http://localhost:8983/solr/get?id=SOLR1000

          You can also use the "fl" param...
          http://localhost:8983/solr/get?id=SOLR1000&fl=id,name

          <response>
            <doc name="doc">
              <str name="id">SOLR1000</str>
              <str name="name">Solr, the Enterprise Search Server</str>
            </doc>
          </response>
          

          The "id" param accepts a single id (but you can have multiple parameters)

          You can also use an "ids" param, which is a comma separated list of ids. If you use "ids" or multiple "id" parameters, then the response will look like a normal doclist.

          Show
          Yonik Seeley added a comment - Here's a draft patch (no tests yet) that tracks the latest reader in SolrCore and keeps track of a virtual clock and compares that to the virtual clock of the update handler. To try it out, add a document without committing, and then http://localhost:8983/solr/get?id=SOLR1000 You can also use the "fl" param... http://localhost:8983/solr/get?id=SOLR1000&fl=id,name <response> <doc name= "doc" > <str name= "id" >SOLR1000</str> <str name= "name" >Solr, the Enterprise Search Server</str> </doc> </response> The "id" param accepts a single id (but you can have multiple parameters) You can also use an "ids" param, which is a comma separated list of ids. If you use "ids" or multiple "id" parameters, then the response will look like a normal doclist.
          Hide
          Michael McCandless added a comment -

          The approach here is to always reopen the reader on-demand when a RT get arrives, ie, if any changes had been made to the index with IndexWriter?

          Could you use IR.isCurrent instead of tracking your own generation?

          But, stepping back, this approach (open new NRT reader on demand) seems dangerous? Ie perf will be poor if a client has one thread constantly updating and another constantly doing RT get?

          Maybe we should use NRTManager, or its approach, here? Ie, rate limit the reopens, so that if there are too many gets, they are batched up and we only reopen "periodically" (which can still be relatively frequent).

          Maybe we should call this near-real-time get?

          Show
          Michael McCandless added a comment - The approach here is to always reopen the reader on-demand when a RT get arrives, ie, if any changes had been made to the index with IndexWriter? Could you use IR.isCurrent instead of tracking your own generation? But, stepping back, this approach (open new NRT reader on demand) seems dangerous? Ie perf will be poor if a client has one thread constantly updating and another constantly doing RT get? Maybe we should use NRTManager, or its approach, here? Ie, rate limit the reopens, so that if there are too many gets, they are batched up and we only reopen "periodically" (which can still be relatively frequent). Maybe we should call this near-real-time get?
          Hide
          Yonik Seeley added a comment -

          The approach here is to always reopen the reader on-demand when a RT get arrives, ie, if any changes had been made to the index with IndexWriter?

          I was thinking ahead to a more generic version where one could specify the clock (I think this will be needed for future distrib indexing support). I actually first added a version that took an explicit clock but then simplified it to always use the latest clock and marked it as experimental.

          But, stepping back, this approach (open new NRT reader on demand) seems dangerous? Ie perf will be poor if a client has one thread constantly updating and another constantly doing RT get?

          It's better than what we have today, and it can be optimized in the future.
          One way would be with a bloom filter of updates that are not yet visible. Another way will again relate to recovery in distributed indexing, when we'll need to ask another node what all the latest updates after clock x were (and since we'll have those on hand, we can check any realtime-get against that first).

          Maybe we should call this near-real-time get?

          That sort of defeats the purpose of the issue - it's supposed to be a 100% reliable get of the latest version of a document.

          Show
          Yonik Seeley added a comment - The approach here is to always reopen the reader on-demand when a RT get arrives, ie, if any changes had been made to the index with IndexWriter? I was thinking ahead to a more generic version where one could specify the clock (I think this will be needed for future distrib indexing support). I actually first added a version that took an explicit clock but then simplified it to always use the latest clock and marked it as experimental. But, stepping back, this approach (open new NRT reader on demand) seems dangerous? Ie perf will be poor if a client has one thread constantly updating and another constantly doing RT get? It's better than what we have today, and it can be optimized in the future. One way would be with a bloom filter of updates that are not yet visible. Another way will again relate to recovery in distributed indexing, when we'll need to ask another node what all the latest updates after clock x were (and since we'll have those on hand, we can check any realtime-get against that first). Maybe we should call this near-real-time get? That sort of defeats the purpose of the issue - it's supposed to be a 100% reliable get of the latest version of a document.
          Hide
          Michael McCandless added a comment -

          Maybe we should call this near-real-time get?

          That sort of defeats the purpose of the issue - it's supposed to be a 100% reliable get of the latest version of a document.

          Right, it will always return the last added doc under that ID; I'm not
          disputing that part.

          I am disputing that it's really "real-time" given that it's built on
          top of "near-real-time". Ie calling this real-time is over-selling
          it, I think; the performance will not be great?

          Another thing to consider is NRTCachingDir; it's good for reducing
          latency when you are frequently flushing tiny segments (make the
          reopen IO-less, except for the ID lookups, unless you use MemCodec, at
          which point the NRT open is fully IO free).

          The approach here is to always reopen the reader on-demand when a RT get arrives, ie, if any changes had been made to the index with IndexWriter?

          I was thinking ahead to a more generic version where one could specify the clock (I think this will be needed for future distrib indexing support). I actually first added a version that took an explicit clock but then simplified it to always use the latest clock and marked it as experimental.

          What kind of "clocks" would one want to plug in here? Do you mean you
          could choose to accept some staleness if you wanted (plug in a clock
          that only increments periodically if there had been updates)?

          But, stepping back, this approach (open new NRT reader on demand) seems dangerous? Ie perf will be poor if a client has one thread constantly updating and another constantly doing RT get?

          It's better than what we have today, and it can be optimized in the future.

          I agree, progress not perfection.

          One way would be with a bloom filter of updates that are not yet visible. Another way will again relate to recovery in distributed indexing, when we'll need to ask another node what all the latest updates after clock x were (and since we'll have those on hand, we can check any realtime-get against that first).

          Maybe Solr should use a transaction log (like ElasticSearch)? I think
          (not certain) that ES serves a RT get directly out of its transaction
          log if the doc is in it (else falls back to the reader)? Then
          simultaneous updates + gets should really be real-time. But I
          realize that'd be a much bigger change...

          Show
          Michael McCandless added a comment - Maybe we should call this near-real-time get? That sort of defeats the purpose of the issue - it's supposed to be a 100% reliable get of the latest version of a document. Right, it will always return the last added doc under that ID; I'm not disputing that part. I am disputing that it's really "real-time" given that it's built on top of "near-real-time". Ie calling this real-time is over-selling it, I think; the performance will not be great? Another thing to consider is NRTCachingDir; it's good for reducing latency when you are frequently flushing tiny segments (make the reopen IO-less, except for the ID lookups, unless you use MemCodec, at which point the NRT open is fully IO free). The approach here is to always reopen the reader on-demand when a RT get arrives, ie, if any changes had been made to the index with IndexWriter? I was thinking ahead to a more generic version where one could specify the clock (I think this will be needed for future distrib indexing support). I actually first added a version that took an explicit clock but then simplified it to always use the latest clock and marked it as experimental. What kind of "clocks" would one want to plug in here? Do you mean you could choose to accept some staleness if you wanted (plug in a clock that only increments periodically if there had been updates)? But, stepping back, this approach (open new NRT reader on demand) seems dangerous? Ie perf will be poor if a client has one thread constantly updating and another constantly doing RT get? It's better than what we have today, and it can be optimized in the future. I agree, progress not perfection. One way would be with a bloom filter of updates that are not yet visible. Another way will again relate to recovery in distributed indexing, when we'll need to ask another node what all the latest updates after clock x were (and since we'll have those on hand, we can check any realtime-get against that first). Maybe Solr should use a transaction log (like ElasticSearch)? I think (not certain) that ES serves a RT get directly out of its transaction log if the doc is in it (else falls back to the reader)? Then simultaneous updates + gets should really be real-time. But I realize that'd be a much bigger change...
          Hide
          Yonik Seeley added a comment -

          I am disputing that it's really "real-time" given that it's built on top of "near-real-time". Ie calling this real-time is over-selling

          it, I think; the performance will not be great?

          Another way of thinking about the naming is that NRT returns you data with a low degree of staleness. Realtime means no staleness.
          no-staleness-get doesn't quite have the same catchiness as realtime-get

          Another thing to consider is NRTCachingDir

          I had sort of assumed that would become a lucene default. If not, we should definitely make it available in Solr.

          What kind of "clocks" would one want to plug in here?

          A client could use this to accept some degree of staleness (more useful if clocks have a relation to real time, or if the internal clock on updates was returned to clients). I was more thinking of future internal uses though - like if we need to retrieve the version of a doc (or other information about it), and we keep track of the last update clock that updated a block of ids, then we can use that to avoid unnecessary re-opens.

          Maybe Solr should use a transaction log

          Yep, that's the idea (we need it for both recovery and durability).

          Show
          Yonik Seeley added a comment - I am disputing that it's really "real-time" given that it's built on top of "near-real-time". Ie calling this real-time is over-selling it, I think; the performance will not be great? Another way of thinking about the naming is that NRT returns you data with a low degree of staleness. Realtime means no staleness. no-staleness-get doesn't quite have the same catchiness as realtime-get Another thing to consider is NRTCachingDir I had sort of assumed that would become a lucene default. If not, we should definitely make it available in Solr. What kind of "clocks" would one want to plug in here? A client could use this to accept some degree of staleness (more useful if clocks have a relation to real time, or if the internal clock on updates was returned to clients). I was more thinking of future internal uses though - like if we need to retrieve the version of a doc (or other information about it), and we keep track of the last update clock that updated a block of ids, then we can use that to avoid unnecessary re-opens. Maybe Solr should use a transaction log Yep, that's the idea (we need it for both recovery and durability).
          Hide
          Yonik Seeley added a comment -

          I coded up a test and then factored it out since it's probably even a good test even before we get realtime get committed (with the percent of realtime queries set to 0).

          Bad news is, I'm getting a hang for some reason (just the test w/ straight trunk). Currently looking into it further, but I thought I'd put up the patch in the meantime anyway.

          Show
          Yonik Seeley added a comment - I coded up a test and then factored it out since it's probably even a good test even before we get realtime get committed (with the percent of realtime queries set to 0). Bad news is, I'm getting a hang for some reason (just the test w/ straight trunk). Currently looking into it further, but I thought I'd put up the patch in the meantime anyway.
          Hide
          Yonik Seeley added a comment -

          Whew, it was just a test bug. A tripped assert (that I had backwards) doesn't trigger a catch(Exception e), so the read threads that decrement the counter all exited, leaving the write threads spinning forever.

          Show
          Yonik Seeley added a comment - Whew, it was just a test bug. A tripped assert (that I had backwards) doesn't trigger a catch(Exception e), so the read threads that decrement the counter all exited, leaving the write threads spinning forever.
          Hide
          Yonik Seeley added a comment -

          I just committed the implementation attached to SOLR-2700.
          Since the transaction logging does not yet provide durability, realtime-get is the actual feature competed and hence I used this issue number in CHANGES.

          Show
          Yonik Seeley added a comment - I just committed the implementation attached to SOLR-2700 . Since the transaction logging does not yet provide durability, realtime-get is the actual feature competed and hence I used this issue number in CHANGES.
          Hide
          Yonik Seeley added a comment -

          Here's a patch that enables distributed search support for realtime get.

          Tests still TBD.

          Show
          Yonik Seeley added a comment - Here's a patch that enables distributed search support for realtime get. Tests still TBD.

            People

            • Assignee:
              Yonik Seeley
              Reporter:
              Yonik Seeley
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development