Solr
  1. Solr
  2. SOLR-445

Update Handlers abort with bad documents

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: master, 6.1
    • Component/s: update
    • Labels:
      None

      Description

      This issue adds a new TolerantUpdateProcessorFactory making it possible to configure solr updates so that they are "tolerant" of individual errors in an update request...

        <processor class="solr.TolerantUpdateProcessorFactory">
          <int name="maxErrors">10</int>
        </processor>
      

      When a chain with this processor is used, but maxErrors isn't exceeded, here's what the response looks like...

      $ curl 'http://localhost:8983/solr/techproducts/update?update.chain=tolerant-chain&wt=json&indent=true&maxErrors=-1' -H "Content-Type: application/json" --data-binary '{"add" : { "doc":{"id":"1","foo_i":"bogus"}}, "delete": {"query":"malformed:["}}'
      {
        "responseHeader":{
          "errors":[{
              "type":"ADD",
              "id":"1",
              "message":"ERROR: [doc=1] Error adding field 'foo_i'='bogus' msg=For input string: \"bogus\""},
            {
              "type":"DELQ",
              "id":"malformed:[",
              "message":"org.apache.solr.search.SyntaxError: Cannot parse 'malformed:[': Encountered \"<EOF>\" at line 1, column 11.\nWas expecting one of:\n    <RANGE_QUOTED> ...\n    <RANGE_GOOP> ...\n    "}],
          "maxErrors":-1,
          "status":0,
          "QTime":1}}
      

      Note in the above example that:

      • maxErrors can be overridden on a per-request basis
      • an effective maxErrors==-1 (either from config, or request param) means "unlimited" (under the covers it's using Integer.MAX_VALUE)

      If/When maxErrors is reached for a request, then the first exception that the processor caught is propagated back to the user, and metadata is set on that exception with all of the same details about all the tolerated errors.

      This next example is the same as the previous except that instead of maxErrors=-1 the request param is now maxErrors=1...

      $ curl 'http://localhost:8983/solr/techproducts/update?update.chain=tolerant-chain&wt=json&indent=true&maxErrors=1' -H "Content-Type: application/json" --data-binary '{"add" : { "doc":{"id":"1","foo_i":"bogus"}}, "delete": {"query":"malformed:["}}'
      {
        "responseHeader":{
          "errors":[{
              "type":"ADD",
              "id":"1",
              "message":"ERROR: [doc=1] Error adding field 'foo_i'='bogus' msg=For input string: \"bogus\""},
            {
              "type":"DELQ",
              "id":"malformed:[",
              "message":"org.apache.solr.search.SyntaxError: Cannot parse 'malformed:[': Encountered \"<EOF>\" at line 1, column 11.\nWas expecting one of:\n    <RANGE_QUOTED> ...\n    <RANGE_GOOP> ...\n    "}],
          "maxErrors":1,
          "status":400,
          "QTime":1},
        "error":{
          "metadata":[
            "org.apache.solr.common.ToleratedUpdateError--ADD:1","ERROR: [doc=1] Error adding field 'foo_i'='bogus' msg=For input string: \"bogus\"",
            "org.apache.solr.common.ToleratedUpdateError--DELQ:malformed:[","org.apache.solr.search.SyntaxError: Cannot parse 'malformed:[': Encountered \"<EOF>\" at line 1, column 11.\nWas expecting one of:\n    <RANGE_QUOTED> ...\n    <RANGE_GOOP> ...\n    ",
            "error-class","org.apache.solr.common.SolrException",
            "root-error-class","java.lang.NumberFormatException"],
          "msg":"ERROR: [doc=1] Error adding field 'foo_i'='bogus' msg=For input string: \"bogus\"",
          "code":400}}
      

      ...the added exception metadata ensures that even in client code like the various SolrJ SolrClient implementations, which throw a (client side) exception on non-200 responses, the end user can access info on all the tolerated errors that were ignored before the maxErrors threshold was reached.


      Original Jira Request

      Has anyone run into the problem of handling bad documents / failures mid batch. Ie:

      <add>
      <doc>
      <field name="id">1</field>
      </doc>
      <doc>
      <field name="id">2</field>
      <field name="myDateField">I_AM_A_BAD_DATE</field>
      </doc>
      <doc>
      <field name="id">3</field>
      </doc>
      </add>

      Right now solr adds the first doc and then aborts. It would seem like it should either fail the entire batch or log a message/return a code and then continue on to add doc 3. Option 1 would seem to be much harder to accomplish and possibly require more memory while Option 2 would require more information to come back from the API. I'm about to dig into this but I thought I'd ask to see if anyone had any suggestions, thoughts or comments.

      1. SOLR-445_3x.patch
        46 kB
        Erick Erickson
      2. SOLR-445.patch
        88 kB
        Hoss Man
      3. SOLR-445.patch
        83 kB
        Hoss Man
      4. SOLR-445.patch
        40 kB
        Anshum Gupta
      5. SOLR-445.patch
        41 kB
        Anshum Gupta
      6. SOLR-445.patch
        40 kB
        Anshum Gupta
      7. SOLR-445.patch
        44 kB
        Erick Erickson
      8. SOLR-445.patch
        45 kB
        Erick Erickson
      9. SOLR-445.patch
        18 kB
        Erick Erickson
      10. SOLR-445.patch
        42 kB
        Erick Erickson
      11. solr-445.xml
        0.7 kB
        Erick Erickson
      12. SOLR-445-3_x.patch
        20 kB
        Erick Erickson
      13. SOLR-445-alternative.patch
        36 kB
        Tomás Fernández Löbbe
      14. SOLR-445-alternative.patch
        26 kB
        Tomás Fernández Löbbe
      15. SOLR-445-alternative.patch
        23 kB
        Tomás Fernández Löbbe
      16. SOLR-445-alternative.patch
        23 kB
        Tomás Fernández Löbbe

        Issue Links

          Activity

          Hide
          Ryan McKinley added a comment -

          yup... as is, it keeps anything it had before it fails.

          Check http://wiki.apache.org/solr/UpdateRequestProcessor for a way to intercect the standard processing. One could easily write an UpdateRequestProcessor to make sure all documents are valid before tryinig to commit them. (that is queue everything up until finish())

          Another approach may be to keep track of the docs that were added correctly, then remove them if something goes wrong....

          Show
          Ryan McKinley added a comment - yup... as is, it keeps anything it had before it fails. Check http://wiki.apache.org/solr/UpdateRequestProcessor for a way to intercect the standard processing. One could easily write an UpdateRequestProcessor to make sure all documents are valid before tryinig to commit them. (that is queue everything up until finish()) Another approach may be to keep track of the docs that were added correctly, then remove them if something goes wrong....
          Hide
          Yonik Seeley added a comment -

          Option 2 seems like a good approach... return a list of documents that failed, why they failed, etc, and that could be logged for a human to check later.

          Show
          Yonik Seeley added a comment - Option 2 seems like a good approach... return a list of documents that failed, why they failed, etc, and that could be logged for a human to check later.
          Hide
          Grant Ingersoll added a comment -

          Option 2 seems like a good approach... return a list of documents that failed, why they failed, etc, and that could be logged for a human to check later.

          +1

          And it should not just be for the XML version, right? Anything that can take a batch of documents should be able to return a list of those that failed along with the reason they failed.

          Show
          Grant Ingersoll added a comment - Option 2 seems like a good approach... return a list of documents that failed, why they failed, etc, and that could be logged for a human to check later. +1 And it should not just be for the XML version, right? Anything that can take a batch of documents should be able to return a list of those that failed along with the reason they failed.
          Hide
          Grant Ingersoll added a comment -

          Is it reasonable to use the AddUpdateCommand to communicate out of the UpdateHandler that a given document failed? For instance, in the update handler, it could catch any exception, and then add that exception onto the command (the next reuse would have to reset it) and then the various RequestHandler (XML/CSV) can check to see if the exception is set, add it to a list of failed docs and then add the failed docs to the response, which can then be written out as needed by the writers?

          Show
          Grant Ingersoll added a comment - Is it reasonable to use the AddUpdateCommand to communicate out of the UpdateHandler that a given document failed? For instance, in the update handler, it could catch any exception, and then add that exception onto the command (the next reuse would have to reset it) and then the various RequestHandler (XML/CSV) can check to see if the exception is set, add it to a list of failed docs and then add the failed docs to the response, which can then be written out as needed by the writers?
          Hide
          Shalin Shekhar Mangar added a comment -

          Marking for 1.5

          Show
          Shalin Shekhar Mangar added a comment - Marking for 1.5
          Hide
          Hoss Man added a comment -

          Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

          A unique token for finding these 240 issues in the future: hossversioncleanup20100527

          Show
          Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
          Hide
          Erick Erickson added a comment -

          Here's a cut at an improvement at least.

          The attached XML file contains an <add> packet with a number of documents illustrating a number of errors. The xml file can be POSTed Solr to index via the post.jar file so you can see the output.

          This patch attempts to report back to the user the following for each document that failed:
          1> the ordinal position in the file where the error occurred (e.g. the first, second, etc <doc> tag).
          2> the <uniqueKey> if available.
          3> the error.

          The general idea is to accrue the errors in a StringBuilder and eventually re-throw the error after processing as far as possible.

          Issues:
          1> the reported format in the log file is kind of hard to read. I pipe-delimited the various <doc> tags, but they run together in a Windows DOS window. What happens on Unix I'm not quite sure. Suggestions welcome.
          2> From the original post, rolling this back will be tricky. Very tricky. The autocommit feature makes it indeterminate what's been committed to the index, so I don't know how to even approach rolling back everything.
          3> The intent here is to give the user a clue where to start when figuring out what document(s) failed so they don't have to guess.
          4> Tests fail, but I have no clue why. I checked out a new copy of trunk and that fails as well, so I don't think that this patch is the cause of the errors. But let's not commit this until we can be sure.
          5> What do you think about limiting the number of docs that fail before quitting? One could imagine some ratio (say 10%) that have to fail before quitting (with some safeguards, like don't bother calculating the ratio until 20 docs had been processed or...). Or an absolute number. Should this be a parameter? Or hard-coded? The assumption here is that if 10 (or 100 or..) docs fail, there's something pretty fundamentally wrong and it's a waste to keep on. I don't have any strong feeling here, I can argue it either way....
          6> Sorry, all, but I reflexively hit the reformat keystrokes so the raw patch may be hard to read. But I'm pretty well in the camp that you have to reformat as you go or the code will be held hostage to the last person who didn't format properly. I'm pretty sure I'm using the right codestyle.xml file, but let me know if not.
          7> I doubt that this has any bearing on, say, SolrJ indexing. Should that be another bug (or is there one already)? Anybody got a clue where I'd look for that since I'm in the area anyway?

          Erick

          Show
          Erick Erickson added a comment - Here's a cut at an improvement at least. The attached XML file contains an <add> packet with a number of documents illustrating a number of errors. The xml file can be POSTed Solr to index via the post.jar file so you can see the output. This patch attempts to report back to the user the following for each document that failed: 1> the ordinal position in the file where the error occurred (e.g. the first, second, etc <doc> tag). 2> the <uniqueKey> if available. 3> the error. The general idea is to accrue the errors in a StringBuilder and eventually re-throw the error after processing as far as possible. Issues: 1> the reported format in the log file is kind of hard to read. I pipe-delimited the various <doc> tags, but they run together in a Windows DOS window. What happens on Unix I'm not quite sure. Suggestions welcome. 2> From the original post, rolling this back will be tricky. Very tricky. The autocommit feature makes it indeterminate what's been committed to the index, so I don't know how to even approach rolling back everything. 3> The intent here is to give the user a clue where to start when figuring out what document(s) failed so they don't have to guess. 4> Tests fail, but I have no clue why. I checked out a new copy of trunk and that fails as well, so I don't think that this patch is the cause of the errors. But let's not commit this until we can be sure. 5> What do you think about limiting the number of docs that fail before quitting? One could imagine some ratio (say 10%) that have to fail before quitting (with some safeguards, like don't bother calculating the ratio until 20 docs had been processed or...). Or an absolute number. Should this be a parameter? Or hard-coded? The assumption here is that if 10 (or 100 or..) docs fail, there's something pretty fundamentally wrong and it's a waste to keep on. I don't have any strong feeling here, I can argue it either way.... 6> Sorry, all, but I reflexively hit the reformat keystrokes so the raw patch may be hard to read. But I'm pretty well in the camp that you have to reformat as you go or the code will be held hostage to the last person who didn't format properly. I'm pretty sure I'm using the right codestyle.xml file, but let me know if not. 7> I doubt that this has any bearing on, say, SolrJ indexing. Should that be another bug (or is there one already)? Anybody got a clue where I'd look for that since I'm in the area anyway? Erick
          Hide
          Lance Norskog added a comment -

          2> From the original post, rolling this back will be tricky. Very tricky. The autocommit feature makes it indeterminate what's been committed to the index, so I don't know how to even approach rolling back everything.

          Don't allow autocommits during an update. Simple. Or, rather, all update requests block at the beginning during an autocommit. If an update request has too many documents, don't do so many documents in an update.

          Show
          Lance Norskog added a comment - 2> From the original post, rolling this back will be tricky. Very tricky. The autocommit feature makes it indeterminate what's been committed to the index, so I don't know how to even approach rolling back everything. Don't allow autocommits during an update. Simple. Or, rather, all update requests block at the beginning during an autocommit. If an update request has too many documents, don't do so many documents in an update.
          Hide
          Erick Erickson added a comment -

          I think it's ready for review, both trunk and 3_x. Would someone look this over and commit it if they think it's ready?

          Note to self: do NOT call initCore in a test case just because you need a different schema.

          The problem I was having with running tests was because I needed a schema file with a required field so I naively called initCore with schema11.xml in spite of the fact that @BeforeClass called it with just schema.xml. Which apparently does bad things with the state of something and caused other tests to fail... I can get TestDistributedSearch to fail on unchanged source code simply by calling initCore with schema11.xml and doing nothing else in a new test case in BasicFunctionalityTest. So I put my new tests that required schema11 in a new file instead.

          The XML file attached is not intended to be committed, it is just a convenience for anyone checking out this patch to run against a Solr instance to see what is returned.

          This seems to return the data in the SolrJ case as well.

          NOTE: This does change the behavior of Solr. Without this patch, the first document that is incorrect stops processing. Now, it continues merrily on adding documents as it can. Is this desirable behavior? It would be easy to abort on first error if that's the consensus, and I could take some tedious record-keeping out. I think there's no big problem with continuing on, since the state of committed documents is indeterminate already when errors occur so worrying about this should be part of a bigger issue.

          Show
          Erick Erickson added a comment - I think it's ready for review, both trunk and 3_x. Would someone look this over and commit it if they think it's ready? Note to self: do NOT call initCore in a test case just because you need a different schema. The problem I was having with running tests was because I needed a schema file with a required field so I naively called initCore with schema11.xml in spite of the fact that @BeforeClass called it with just schema.xml. Which apparently does bad things with the state of something and caused other tests to fail... I can get TestDistributedSearch to fail on unchanged source code simply by calling initCore with schema11.xml and doing nothing else in a new test case in BasicFunctionalityTest. So I put my new tests that required schema11 in a new file instead. The XML file attached is not intended to be committed, it is just a convenience for anyone checking out this patch to run against a Solr instance to see what is returned. This seems to return the data in the SolrJ case as well. NOTE: This does change the behavior of Solr. Without this patch, the first document that is incorrect stops processing. Now, it continues merrily on adding documents as it can. Is this desirable behavior? It would be easy to abort on first error if that's the consensus, and I could take some tedious record-keeping out. I think there's no big problem with continuing on, since the state of committed documents is indeterminate already when errors occur so worrying about this should be part of a bigger issue.
          Hide
          Simon Rosenthal added a comment -

          Don't allow autocommits during an update. Simple. Or, rather, all update requests block at the beginning during an autocommit. If an update request has too many documents, don't do so many documents in an update. (Lance)

          Lance - How do you (dynamically ) disable autocommits during a specific update ? That functionality would also be useful in other use cases, but that's another issue).

          NOTE: This does change the behavior of Solr. Without this patch, the first document that is incorrect stops processing. Now, it continues merrily on adding documents as it can. Is this desirable behavior? It would be easy to abort on first error if that's the consensus, and I could take some tedious record-keeping out. I think there's no big problem with continuing on, since the state of committed documents is indeterminate already when errors occur so worrying about this should be part of a bigger issue.

          I think it should be an option, if possible. I can see use cases where abort-on-first-error is desirable, but also situations where you know one or two documents may be erroneous, and its worth continuing on in order to index the other 99%

          Show
          Simon Rosenthal added a comment - Don't allow autocommits during an update. Simple. Or, rather, all update requests block at the beginning during an autocommit. If an update request has too many documents, don't do so many documents in an update. (Lance) Lance - How do you (dynamically ) disable autocommits during a specific update ? That functionality would also be useful in other use cases, but that's another issue). NOTE: This does change the behavior of Solr. Without this patch, the first document that is incorrect stops processing. Now, it continues merrily on adding documents as it can. Is this desirable behavior? It would be easy to abort on first error if that's the consensus, and I could take some tedious record-keeping out. I think there's no big problem with continuing on, since the state of committed documents is indeterminate already when errors occur so worrying about this should be part of a bigger issue. I think it should be an option, if possible. I can see use cases where abort-on-first-error is desirable, but also situations where you know one or two documents may be erroneous, and its worth continuing on in order to index the other 99%
          Hide
          Erick Erickson added a comment -

          From several discussions, it sounds like my test issues are unrelated to the latest patch (I did fix the test case to not do bad things). So I guess if anyone wants to pick this up, it's ready for review and/or commit.

          Unless we want to make some parameter like <abortOnFirstIndexingFailure> or some such. If the consensus is to add such an option, it seems like it needs to go in one of the config files because this code is also reachable by SolrJ. Probably default to "true" to preserve present functionality?

          Thoughts?
          Erick

          Show
          Erick Erickson added a comment - From several discussions, it sounds like my test issues are unrelated to the latest patch (I did fix the test case to not do bad things). So I guess if anyone wants to pick this up, it's ready for review and/or commit. Unless we want to make some parameter like <abortOnFirstIndexingFailure> or some such. If the consensus is to add such an option, it seems like it needs to go in one of the config files because this code is also reachable by SolrJ. Probably default to "true" to preserve present functionality? Thoughts? Erick
          Hide
          Lance Norskog added a comment -

          Unless we want to make some parameter like <abortOnFirstIndexingFailure> or some such.

          Yes. Yes. Yes. Fail Early, Fail Often is a common system design style. Meaning, if I have a data quality problem kick me in the head about it. For example, my online store should keep yesterday's product lineup rather than load a Halloween costume with no categories:

          <field ... required="true" .../>
          

          is there for a reason.

          Show
          Lance Norskog added a comment - Unless we want to make some parameter like <abortOnFirstIndexingFailure> or some such. Yes. Yes. Yes. Fail Early, Fail Often is a common system design style. Meaning, if I have a data quality problem kick me in the head about it. For example, my online store should keep yesterday's product lineup rather than load a Halloween costume with no categories: <field ... required= "true" .../> is there for a reason.
          Hide
          Erick Erickson added a comment -

          Is there any particular preference where the tag for making this configurable should go? My first impulse would be SolrConfig.xml, <indexDefaults> but I wanted to ask first to see if I'm reading the intent correctly. Leaving the field out would be equivalent to <abortOnFirstBatchIndexError>true</...>

          Oh My,! that is a clumsy name, any better suggestions?
          processAllBatchIndexDocsPossible (default to false)
          continueAddingDocsInBatchIfErrors (default to false)

          I kind of like the last one.

          Show
          Erick Erickson added a comment - Is there any particular preference where the tag for making this configurable should go? My first impulse would be SolrConfig.xml, <indexDefaults> but I wanted to ask first to see if I'm reading the intent correctly. Leaving the field out would be equivalent to <abortOnFirstBatchIndexError>true</...> Oh My,! that is a clumsy name, any better suggestions? processAllBatchIndexDocsPossible (default to false) continueAddingDocsInBatchIfErrors (default to false) I kind of like the last one.
          Hide
          Erik Hatcher added a comment -

          This is solely for the XML loader, right Erick? If so, the setting really belongs with the /update handler configuration seems to me. [this is reminiscent of the schema.xml setting for default operator and/or, which is awkwardly placed, when it really belongs with the qparser declaration]

          On a related note - are there comparable issues with other batch indexers like the CSV one?

          Show
          Erik Hatcher added a comment - This is solely for the XML loader, right Erick? If so, the setting really belongs with the /update handler configuration seems to me. [this is reminiscent of the schema.xml setting for default operator and/or, which is awkwardly placed, when it really belongs with the qparser declaration] On a related note - are there comparable issues with other batch indexers like the CSV one?
          Hide
          Erick Erickson added a comment -

          bq: If so, the setting really belongs with the /update handler configuration seems to me
          Thanks, this is what I was looking for, I'll try to make it so unless there are objections. This goes in <updateHandler class="solr.DirectUpdateHandler2">, right?

          bq: On a related note - are there comparable issues with other batch indexers like the CSV one?
          I assume so, I haven't looked. I'd like to raise that as a separate JIRA if we want to address it.

          Show
          Erick Erickson added a comment - bq: If so, the setting really belongs with the /update handler configuration seems to me Thanks, this is what I was looking for, I'll try to make it so unless there are objections. This goes in <updateHandler class="solr.DirectUpdateHandler2">, right? bq: On a related note - are there comparable issues with other batch indexers like the CSV one? I assume so, I haven't looked. I'd like to raise that as a separate JIRA if we want to address it.
          Hide
          Erick Erickson added a comment -

          OK, I think this is ready to go if someone wants to take a look and commit.

          This patch includes the ability to turn on continuing to process documents after the first failure, as per Erik H's comments. The default is the old behavior of stopping upon the first error.

          Changed example solrconfig.xml to include the new parameter as false (mimicing old behavior) in both 3x and trunk.

          Show
          Erick Erickson added a comment - OK, I think this is ready to go if someone wants to take a look and commit. This patch includes the ability to turn on continuing to process documents after the first failure, as per Erik H's comments. The default is the old behavior of stopping upon the first error. Changed example solrconfig.xml to include the new parameter as false (mimicing old behavior) in both 3x and trunk.
          Hide
          Lance Norskog added a comment -

          The general verb is 'update' not 'index', so... abortUpdateOnError default to true is my vote.

          The extraction handler by nature takes one document. The DIH has its own system for this that works pretty well.

          Show
          Lance Norskog added a comment - The general verb is 'update' not 'index', so... abortUpdateOnError default to true is my vote. The extraction handler by nature takes one document. The DIH has its own system for this that works pretty well.
          Hide
          Grant Ingersoll added a comment -

          This patch looks pretty reasonable from the details of the implementation, but I don't think it's quite ready for commit yet.

          First, we should be able to extend this to all that implement ContentStreamLoader (JSONLoader, CSVLoader) if they want it (it doesn't make sense for the SolrCell stuff).

          As I see it, we can do this by putting some base functionality into ContentStreamLoader which does what is done in this patch.
          I think we need two methods, one that handles the immediate error (takes in a StringBuilder and the info about the doc that failed) and decides whether to abort or buffer the error for later reporting depending on the configuration setting.

          I don't think the configuration of the item belongs in the UpdateHandler. Erik H. meant that it goes in the configuration of the /update RequestHandler in the config, not the DirectUpdateHandler2, as in

          <requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />

          This config could be a request param just like any other (such that one could even say they want to override it via a request via the defaults, appends, invariants).

          Also, I know it is tempting to do so, but please don't reformat the code in the patch. It slows down review significantly. In general, I try to reformat right before committing as do most committers.

          Show
          Grant Ingersoll added a comment - This patch looks pretty reasonable from the details of the implementation, but I don't think it's quite ready for commit yet. First, we should be able to extend this to all that implement ContentStreamLoader (JSONLoader, CSVLoader) if they want it (it doesn't make sense for the SolrCell stuff). As I see it, we can do this by putting some base functionality into ContentStreamLoader which does what is done in this patch. I think we need two methods, one that handles the immediate error (takes in a StringBuilder and the info about the doc that failed) and decides whether to abort or buffer the error for later reporting depending on the configuration setting. I don't think the configuration of the item belongs in the UpdateHandler. Erik H. meant that it goes in the configuration of the /update RequestHandler in the config, not the DirectUpdateHandler2, as in <requestHandler name= "/update" class= "solr.XmlUpdateRequestHandler" /> This config could be a request param just like any other (such that one could even say they want to override it via a request via the defaults, appends, invariants). Also, I know it is tempting to do so, but please don't reformat the code in the patch. It slows down review significantly. In general, I try to reformat right before committing as do most committers.
          Hide
          Grant Ingersoll added a comment -

          Oh, one other thing. You don't need to produce a 3.x patch. We can just do an SVN merge.

          Show
          Grant Ingersoll added a comment - Oh, one other thing. You don't need to produce a 3.x patch. We can just do an SVN merge.
          Hide
          Lance Norskog added a comment -

          The SolrJ StreamingUpdateSolrServer needs updating for this. It needs a lot of other updating also.

          Show
          Lance Norskog added a comment - The SolrJ StreamingUpdateSolrServer needs updating for this. It needs a lot of other updating also.
          Hide
          Erick Erickson added a comment -

          Grant:

          Thanks for the comments. I'd assumed there was a better, more general fix, but what about the near term? Does it make any sense to re-work this enough to put the parameter in the update handler and commit? I think there's value in the patch as-is that would provide better-but-not-comprehensive help to users, at least in the XML case, rather than defer any improvement at all into the indefinite future.

          As always, though, up to you.

          Erick

          Show
          Erick Erickson added a comment - Grant: Thanks for the comments. I'd assumed there was a better, more general fix, but what about the near term? Does it make any sense to re-work this enough to put the parameter in the update handler and commit? I think there's value in the patch as-is that would provide better-but-not-comprehensive help to users, at least in the XML case, rather than defer any improvement at all into the indefinite future. As always, though, up to you. Erick
          Hide
          Erick Erickson added a comment -

          So, Grant. How do you feel about refactorings <G>?

          I got bitten by this problem again so I decided to dust off the patch, and I re-created it. This one shouldn't have the gratuitous re-formatting. But, after I added the bookkeeping, the method got even more unwieldy, so I extracted some of the code to methods in XMLLoader. I also have the un-refactored version if this one is too painful.

          This patch incorporates the changes you suggested months ago. I'm a little uncertain whether putting a constant in UpdateParams.java was the correct place, but it seemed like a pattern used for other parameters.

          One minor issue: The behavior is the same here as it used to be if you don't start the packet with <add>. An NPE is thrown. That's because the addCmd variable isn't initialized until the <add> tag is encountered and the NPE is a result of using the addCmd variable later (I think I was seeing it at line 118). I think it would be better to fail if the first element wasn't an <add> element rather than because it just happens to cause an NPE.

          While I'm at it, though, what do you think about making this robust enough to ignore ?xml and/or !DOCTYPE entries? Or is that just not worth the bother?

          Erick

          Show
          Erick Erickson added a comment - So, Grant. How do you feel about refactorings <G>? I got bitten by this problem again so I decided to dust off the patch, and I re-created it. This one shouldn't have the gratuitous re-formatting. But, after I added the bookkeeping, the method got even more unwieldy, so I extracted some of the code to methods in XMLLoader. I also have the un-refactored version if this one is too painful. This patch incorporates the changes you suggested months ago. I'm a little uncertain whether putting a constant in UpdateParams.java was the correct place, but it seemed like a pattern used for other parameters. One minor issue: The behavior is the same here as it used to be if you don't start the packet with <add>. An NPE is thrown. That's because the addCmd variable isn't initialized until the <add> tag is encountered and the NPE is a result of using the addCmd variable later (I think I was seeing it at line 118). I think it would be better to fail if the first element wasn't an <add> element rather than because it just happens to cause an NPE. While I'm at it, though, what do you think about making this robust enough to ignore ?xml and/or !DOCTYPE entries? Or is that just not worth the bother? Erick
          Hide
          Erick Erickson added a comment -

          Anyone willing to pick this up? It's one of those things that's not technically very interesting, but makes end-users' lives easier...

          Erick

          Show
          Erick Erickson added a comment - Anyone willing to pick this up? It's one of those things that's not technically very interesting, but makes end-users' lives easier... Erick
          Hide
          Grant Ingersoll added a comment -

          What about addressing the other handlers? Any progress on that?

          Show
          Grant Ingersoll added a comment - What about addressing the other handlers? Any progress on that?
          Hide
          Erick Erickson added a comment -

          Nope, the framework is in for anything that derives from the base class, but only the XMLLoader uses it yet.

          Does it make sense to open separate JIRAs for those? Which ones do you think are most useful next?

          Show
          Erick Erickson added a comment - Nope, the framework is in for anything that derives from the base class, but only the XMLLoader uses it yet. Does it make sense to open separate JIRAs for those? Which ones do you think are most useful next?
          Hide
          Grant Ingersoll added a comment -

          We should fix this for all the update handlers.

          Show
          Grant Ingersoll added a comment - We should fix this for all the update handlers.
          Hide
          Lance Norskog added a comment - - edited

          If the DIH semantics cover all of the use cases, please follow that model: behavior, names, etc. It will be much easier on users.

          Show
          Lance Norskog added a comment - - edited If the DIH semantics cover all of the use cases, please follow that model: behavior, names, etc. It will be much easier on users.
          Hide
          Shinichiro Abe added a comment -

          In Solr Cell, There is the same problem.
          It aborts mid during posting the protected files(SOLR-2480).
          I hope that update handlers should be fixed by applying that model.

          Show
          Shinichiro Abe added a comment - In Solr Cell, There is the same problem. It aborts mid during posting the protected files( SOLR-2480 ). I hope that update handlers should be fixed by applying that model.
          Hide
          Grant Ingersoll added a comment -

          Erick, feel free to take this one and iterate as you see fit

          Show
          Grant Ingersoll added a comment - Erick, feel free to take this one and iterate as you see fit
          Hide
          Robert Muir added a comment -

          Bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - Bulk move 3.2 -> 3.3
          Hide
          Robert Muir added a comment -

          3.4 -> 3.5

          Show
          Robert Muir added a comment - 3.4 -> 3.5
          Hide
          Erick Erickson added a comment -

          Well, it's clear I won't get to this in the 3.6 time frame, so if someone else wants to pick it up feel free. However, I also wonder whether with 4.0 and SolrCloud we have to approach this differently to accomodate how documents are passed around there?

          Show
          Erick Erickson added a comment - Well, it's clear I won't get to this in the 3.6 time frame, so if someone else wants to pick it up feel free. However, I also wonder whether with 4.0 and SolrCloud we have to approach this differently to accomodate how documents are passed around there?
          Hide
          Yonik Seeley added a comment -

          I imagine a maxErrors parameter might be useful (and more readable than abortOnFirstBatchIndexError)

          maxErrors=0 (the current behavior - stop processing more updates when we hit an error)
          maxErrors=10 (allow up to 10 documents to fail before aborting the update... useful for true bulk uploading where you want to allow for an isolated failure or two, but still want to stop if every single update is failing because something is configured wrong)
          maxErrors=-1 (allow an unlimited number of documents to fail)

          Making updates transactional seems really tough in cloud mode since we don't keep old versions of documents around... although it might be possible for a short time with the transaction log. Anyway, that should definitely be a separate issue.

          A couple of other notes:

          • structured error responses were recently added in 4.0-dev that should make this issue easier in general. Example:
            {"responseHeader":{"status":400,"QTime":0},"error":{"msg":"ERROR: [doc=mydoc] unknown field 'foo'","code":400}}
            
          • Per did some error handling work that's included in a patch attached to SOLR-3178
          Show
          Yonik Seeley added a comment - I imagine a maxErrors parameter might be useful (and more readable than abortOnFirstBatchIndexError) maxErrors=0 (the current behavior - stop processing more updates when we hit an error) maxErrors=10 (allow up to 10 documents to fail before aborting the update... useful for true bulk uploading where you want to allow for an isolated failure or two, but still want to stop if every single update is failing because something is configured wrong) maxErrors=-1 (allow an unlimited number of documents to fail) Making updates transactional seems really tough in cloud mode since we don't keep old versions of documents around... although it might be possible for a short time with the transaction log. Anyway, that should definitely be a separate issue. A couple of other notes: structured error responses were recently added in 4.0-dev that should make this issue easier in general. Example: { "responseHeader" :{ "status" :400, "QTime" :0}, "error" :{ "msg" : "ERROR: [doc=mydoc] unknown field 'foo'" , "code" :400}} Per did some error handling work that's included in a patch attached to SOLR-3178
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Tomás Fernández Löbbe added a comment -

          This is a different approach for this issue. The errors are managed by an UpdateRequestProcessor that must be added before other processors in the chain. It accepts maxErrors in the configuration as default or as a request parameter. If used, the default maxErrors value is Integer.MAX_VALUE, to get the current behavior one should set it to 0 (however, wouldn’t make sense to add the processor to the chain in this case, unless it depends on the request parameter).
          This would handle only bad documents, but not others mentioned in previous comments (like Tika parsing exceptions, etc).
          The response will look something like:

          <?xml version="1.0" encoding="UTF-8"?>
          <response>
          <lst name="responseHeader">
            <int name="numErrors">10</int>
            <lst name="errors">
              <lst name="1">
                <str name="message">ERROR: [doc=1] Error adding field 'weight'='b' msg=For input string: "b"</str>
              </lst>
              <lst name="3">
                <str name="message">ERROR: [doc=3] Error adding field 'weight'='b' msg=For input string: "b"</str>
              </lst>
          ...
            <int name="status">0</int>
            <int name="QTime">17</int>
          </lst>
          </response>
          
          Show
          Tomás Fernández Löbbe added a comment - This is a different approach for this issue. The errors are managed by an UpdateRequestProcessor that must be added before other processors in the chain. It accepts maxErrors in the configuration as default or as a request parameter. If used, the default maxErrors value is Integer.MAX_VALUE, to get the current behavior one should set it to 0 (however, wouldn’t make sense to add the processor to the chain in this case, unless it depends on the request parameter). This would handle only bad documents, but not others mentioned in previous comments (like Tika parsing exceptions, etc). The response will look something like: <?xml version= "1.0" encoding= "UTF-8" ?> <response> <lst name= "responseHeader" > <int name= "numErrors" > 10 </int> <lst name= "errors" > <lst name= "1" > <str name= "message" > ERROR: [doc=1] Error adding field 'weight'='b' msg=For input string: "b" </str> </lst> <lst name= "3" > <str name= "message" > ERROR: [doc=3] Error adding field 'weight'='b' msg=For input string: "b" </str> </lst> ... <int name= "status" > 0 </int> <int name= "QTime" > 17 </int> </lst> </response>
          Hide
          Hoss Man added a comment -

          . The errors are managed by an UpdateRequestProcessor that must be added before other processors in the chain.

          Off the cuff: this sounds like a great idea.

          The on piece of feedback that occurred to me though would be to tweak the response format so that there is a 1-to-1 correspondence of documents in the initial request to statuses in the response – even if the schema doesn't use uniqueKey...

          <lst name="responseHeader">
            <int name="numErrors">10</int>
            <lst name="results">
              <!-- if schema has uniqueKeys, they are the names of the response -->
              <lst name="42" /> <!-- success so empty -->
              <lst name="1"> <!-- 2nd doc in update, with uniqueKey of 1 had this failure -->
                <str name="message">ERROR: [doc=1] Error adding field 'weight'='b' msg=For input string: "b"</str>
              </lst>
              <lst name="60" /> <!-- success so empty -->
              <lst name="3"> <!-- 4th doc in update, with uniqueKey of 3 had this failure -->
                <str name="message">ERROR: [doc=3] Error adding field 'weight'='b' msg=For input string: "b"</str>
              </lst>
          ...
            <int name="status">0</int>
            <int name="QTime">17</int>
          </lst>
          
          

          ?

          Show
          Hoss Man added a comment - . The errors are managed by an UpdateRequestProcessor that must be added before other processors in the chain. Off the cuff: this sounds like a great idea. The on piece of feedback that occurred to me though would be to tweak the response format so that there is a 1-to-1 correspondence of documents in the initial request to statuses in the response – even if the schema doesn't use uniqueKey... <lst name= "responseHeader" > < int name= "numErrors" >10</ int > <lst name= "results" > <!-- if schema has uniqueKeys, they are the names of the response --> <lst name= "42" /> <!-- success so empty --> <lst name= "1" > <!-- 2nd doc in update, with uniqueKey of 1 had this failure --> <str name= "message" >ERROR: [doc=1] Error adding field 'weight'='b' msg=For input string: "b" </str> </lst> <lst name= "60" /> <!-- success so empty --> <lst name= "3" > <!-- 4th doc in update, with uniqueKey of 3 had this failure --> <str name= "message" >ERROR: [doc=3] Error adding field 'weight'='b' msg=For input string: "b" </str> </lst> ... < int name= "status" >0</ int > < int name= "QTime" >17</ int > </lst> ?
          Hide
          Yonik Seeley added a comment -

          even if the schema doesn't use uniqueKey...

          That would lead to some huge responses. I think instead the notion of not having a uniqueKey should essentially be deprecated.

          Show
          Yonik Seeley added a comment - even if the schema doesn't use uniqueKey... That would lead to some huge responses. I think instead the notion of not having a uniqueKey should essentially be deprecated.
          Hide
          Shalin Shekhar Mangar added a comment -

          That would lead to some huge responses. I think instead the notion of not having a uniqueKey should essentially be deprecated.

          +1

          Show
          Shalin Shekhar Mangar added a comment - That would lead to some huge responses. I think instead the notion of not having a uniqueKey should essentially be deprecated. +1
          Hide
          Tomás Fernández Löbbe added a comment -

          I think instead the notion of not having a uniqueKey should essentially be deprecated.

          +1

          That would lead to some huge responses.

          Do you mean including the ids of good docs in the response too? I don't think that would be that big. Should be much smaller than the request

          Show
          Tomás Fernández Löbbe added a comment - I think instead the notion of not having a uniqueKey should essentially be deprecated. +1 That would lead to some huge responses. Do you mean including the ids of good docs in the response too? I don't think that would be that big. Should be much smaller than the request
          Hide
          Yonik Seeley added a comment -

          Do you mean including the ids of good docs in the response too? I don't think that would be that big. Should be much smaller than the request

          Some people (including myself) send/load millions of docs per request - it's very unfriendly to get back megabytes of responses unless you explicitly ask.
          If this processor is not in the default chain, then I guess it doesn't matter much. But I could see adding this ability by default (regardless of if it's a separate processor or not) via a parameter like maxErrors or something.

          Show
          Yonik Seeley added a comment - Do you mean including the ids of good docs in the response too? I don't think that would be that big. Should be much smaller than the request Some people (including myself) send/load millions of docs per request - it's very unfriendly to get back megabytes of responses unless you explicitly ask. If this processor is not in the default chain, then I guess it doesn't matter much. But I could see adding this ability by default (regardless of if it's a separate processor or not) via a parameter like maxErrors or something.
          Hide
          Tomás Fernández Löbbe added a comment -

          I see. Maybe I could add then just the "numSucceed" just as a confirmation that the rest of the docs made it in?

          Show
          Tomás Fernández Löbbe added a comment - I see. Maybe I could add then just the "numSucceed" just as a confirmation that the rest of the docs made it in?
          Hide
          Tomás Fernández Löbbe added a comment -

          added "numAdds" to header

          Show
          Tomás Fernández Löbbe added a comment - added "numAdds" to header
          Hide
          Tomás Fernández Löbbe added a comment -

          Any more thoughts on this patch?

          Show
          Tomás Fernández Löbbe added a comment - Any more thoughts on this patch?
          Hide
          Hoss Man added a comment -

          Tomas:

          • we need better class level javadocs for the TolerantUpdateProcessorFactory - basically everything currently in the TolerantUpdateProcessor's javadocs, plus some example configuration, plus a note about how "maxErrors" can be specified as a request param or as an init param and an explanation of the default behavior if "maxErrors" specified at all
          • would you mind renaming "tolerant-chain1" and "tolerant-chain2" with more descriptive names to make the tests easier to read? perhaps "tolerate-10-failures-chain" and "tolerate-unlimited-failures-chain" ?
          • even if "maxErrors" isn't reached, we should consider carefully whether or not it makes sense to be returning a "200" status code even if every update command that's executed for a request fails. (ie: if maxErrors defaults to Integer.MAX_VALUE, and i send 100 docs and all 100 fail, should i really get a 200 status code back?)
          Show
          Hoss Man added a comment - Tomas: we need better class level javadocs for the TolerantUpdateProcessorFactory - basically everything currently in the TolerantUpdateProcessor's javadocs, plus some example configuration, plus a note about how "maxErrors" can be specified as a request param or as an init param and an explanation of the default behavior if "maxErrors" specified at all would you mind renaming "tolerant-chain1" and "tolerant-chain2" with more descriptive names to make the tests easier to read? perhaps "tolerate-10-failures-chain" and "tolerate-unlimited-failures-chain" ? even if "maxErrors" isn't reached, we should consider carefully whether or not it makes sense to be returning a "200" status code even if every update command that's executed for a request fails. (ie: if maxErrors defaults to Integer.MAX_VALUE, and i send 100 docs and all 100 fail, should i really get a 200 status code back?)
          Hide
          Uwe Schindler added a comment -

          Move issue to Solr 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Solr 4.9.
          Hide
          Tomás Fernández Löbbe added a comment -

          I uploaded a new patch with more javadocs and the test chains name changed.

          even if "maxErrors" isn't reached, we should consider carefully whether or not it makes sense to be returning a "200" status code even if every update command that's executed for a request fails. (ie: if maxErrors defaults to Integer.MAX_VALUE, and i send 100 docs and all 100 fail, should i really get a 200 status code back?)

          I think this would make it more confusing. Having this processor means that the client wants to manage failing docs on their side. If all the docs fail so be it, they’ll know how to manage it on their side, I don’t think that should be a special case. Plus, I think getting the 200 gives you more information, it tells you that Solr tried adding all the docs the client sent and it didn’t abort somewhere in the middle, like it would happen if you get a 4XX/5XX

          I was also thinking that this processor won’t work together with DistributedUpdateProcessor, it has its own error processing, plus the distribution would create multiple internal requests (chains too) right? Also, the ConcurrentUpdateSolrServer used in SolrCmdDistributor would batch docs in a non-deterministic way, right? Would be impossible to count errors at this level.

          Show
          Tomás Fernández Löbbe added a comment - I uploaded a new patch with more javadocs and the test chains name changed. even if "maxErrors" isn't reached, we should consider carefully whether or not it makes sense to be returning a "200" status code even if every update command that's executed for a request fails. (ie: if maxErrors defaults to Integer.MAX_VALUE, and i send 100 docs and all 100 fail, should i really get a 200 status code back?) I think this would make it more confusing. Having this processor means that the client wants to manage failing docs on their side. If all the docs fail so be it, they’ll know how to manage it on their side, I don’t think that should be a special case. Plus, I think getting the 200 gives you more information, it tells you that Solr tried adding all the docs the client sent and it didn’t abort somewhere in the middle, like it would happen if you get a 4XX/5XX I was also thinking that this processor won’t work together with DistributedUpdateProcessor, it has its own error processing, plus the distribution would create multiple internal requests (chains too) right? Also, the ConcurrentUpdateSolrServer used in SolrCmdDistributor would batch docs in a non-deterministic way, right? Would be impossible to count errors at this level.
          Hide
          Hoss Man added a comment -

          I think this would make it more confusing. Having this processor means that the client wants to manage failing docs on their side. If all the docs fail so be it.

          Yeah, i'm not convinced you're wrong – I just wasn't sure how i felt about it and I wanted to make we considered. Even if users configure this, they might be surprised if something like a a schema.xml mismatch with some update process they are using causes a 500 error on every individual udpate – but still results in a 200 coming back because of this component.

          But I think you are right – as long as the docs are clear that the status will allways be a 200, even if all docs fail, we're fine.

          I was also thinking that this processor won’t work together with DistributedUpdateProcessor, it has its own error processing, plus the distribution would create multiple internal requests...

          As long as this processor is configured before the DistributedUpdateProcessorFactory it should work fine:

          • when the requests get forwarded to other shards, they'll bypass this processor (and any other processors that come before DistributedUpdateProcessorFactory) so it won't break the cumulative error handling in DistributedUpdateProcessorFactory
          • DistributedUpdateProcessorFactory still ultimately throws only one Exception per UpdateCommand when it forwards to multiple replicas, so your new processor will still get at most 1 error to track per doc when accumulating results to return to the client

          but it's trivial to write a distributed version of your test case to prove that you get the results you expect – probably a good idea to write one to help future proof this processor against unforeseen future changes in the distributed update processing

          Show
          Hoss Man added a comment - I think this would make it more confusing. Having this processor means that the client wants to manage failing docs on their side. If all the docs fail so be it. Yeah, i'm not convinced you're wrong – I just wasn't sure how i felt about it and I wanted to make we considered. Even if users configure this, they might be surprised if something like a a schema.xml mismatch with some update process they are using causes a 500 error on every individual udpate – but still results in a 200 coming back because of this component. But I think you are right – as long as the docs are clear that the status will allways be a 200, even if all docs fail, we're fine. I was also thinking that this processor won’t work together with DistributedUpdateProcessor, it has its own error processing, plus the distribution would create multiple internal requests... As long as this processor is configured before the DistributedUpdateProcessorFactory it should work fine: when the requests get forwarded to other shards, they'll bypass this processor (and any other processors that come before DistributedUpdateProcessorFactory) so it won't break the cumulative error handling in DistributedUpdateProcessorFactory DistributedUpdateProcessorFactory still ultimately throws only one Exception per UpdateCommand when it forwards to multiple replicas, so your new processor will still get at most 1 error to track per doc when accumulating results to return to the client but it's trivial to write a distributed version of your test case to prove that you get the results you expect – probably a good idea to write one to help future proof this processor against unforeseen future changes in the distributed update processing
          Hide
          Tomás Fernández Löbbe added a comment -

          My simple test to use with SolrCloud fails (not 100% of the times, but very frequently). This is my understanding of the problem:
          It works only in the case of the update arriving to the shard leader (as it would fail while adding the doc locally), but if the update needs to be forwarded to the leader, then it will not work.
          If the request is forwarded to the leader it is done asynchronically and the DistributedUpdateProcessor tracks the errors internally. Finally, after all the docs where processed the “finish” method is called and the DistributedUpdateProcessor will add one of the exceptions to the response. This is a problem because “processAdd” never really fails as the TolerantUpdateProcessor is expecting. It also can’t know the total number of errors, this is counted internally in the DistributedUpdateProcessor.

          As a side note, this DistributedUpdateProcessor behavior makes it “tolerant”, but only in some cases? A request like this:

          <add>invalid-doc</add>
          <add>valid-doc</add>
          <add>valid-doc</add>

          would leave Solr in a different state depending on who is receiving the request (the shard leader or a replica/follower). Is this expected?

          Show
          Tomás Fernández Löbbe added a comment - My simple test to use with SolrCloud fails (not 100% of the times, but very frequently). This is my understanding of the problem: It works only in the case of the update arriving to the shard leader (as it would fail while adding the doc locally), but if the update needs to be forwarded to the leader, then it will not work. If the request is forwarded to the leader it is done asynchronically and the DistributedUpdateProcessor tracks the errors internally. Finally, after all the docs where processed the “finish” method is called and the DistributedUpdateProcessor will add one of the exceptions to the response. This is a problem because “processAdd” never really fails as the TolerantUpdateProcessor is expecting. It also can’t know the total number of errors, this is counted internally in the DistributedUpdateProcessor. As a side note, this DistributedUpdateProcessor behavior makes it “tolerant”, but only in some cases? A request like this: <add>invalid-doc</add> <add>valid-doc</add> <add>valid-doc</add> would leave Solr in a different state depending on who is receiving the request (the shard leader or a replica/follower). Is this expected?
          Hide
          Tomás Fernández Löbbe added a comment -

          As a side note, this DistributedUpdateProcessor behavior makes it “tolerant”, but only in some cases?

          I have confirmed this. Depending on which node gets the initial update request and the position of the invalid doc in the batch, the docs that end up indexed will vary from 0 to all but the invalid doc.

          Show
          Tomás Fernández Löbbe added a comment - As a side note, this DistributedUpdateProcessor behavior makes it “tolerant”, but only in some cases? I have confirmed this. Depending on which node gets the initial update request and the position of the invalid doc in the batch, the docs that end up indexed will vary from 0 to all but the invalid doc.
          Hide
          Denis Shishlyannikoc added a comment -

          Question related to this JIRA.
          After failure to index one of documents with wrong date value (2014-03-18K18:15:13Z) solr kept this document in some queue and tried to reindex this document again (attempt per some 3-5 minutes, did not measure exact time of that), showing same (failed to parse date) exception in logs! After solr server restart issue is gone: no more tries to reindex problematic date document.
          Looks like not very correct actions. How can it be explained? How can I avoid such reindexing ? I don't care to lose some not correct documents, but I don't want solr to stuck on them after failure.

          Show
          Denis Shishlyannikoc added a comment - Question related to this JIRA. After failure to index one of documents with wrong date value (2014-03-18K18:15:13Z) solr kept this document in some queue and tried to reindex this document again (attempt per some 3-5 minutes, did not measure exact time of that), showing same (failed to parse date) exception in logs! After solr server restart issue is gone: no more tries to reindex problematic date document. Looks like not very correct actions. How can it be explained? How can I avoid such reindexing ? I don't care to lose some not correct documents, but I don't want solr to stuck on them after failure.
          Hide
          Tomás Fernández Löbbe added a comment -

          Denis Shishlyannikoc Sorry I missed your comment. I understand this issue you are seeing is not with any of the patches in this Jira, right? If so, you should ask the question in the users list, you'll get much more eyes in your problem that way than posting here.

          Show
          Tomás Fernández Löbbe added a comment - Denis Shishlyannikoc Sorry I missed your comment. I understand this issue you are seeing is not with any of the patches in this Jira, right? If so, you should ask the question in the users list, you'll get much more eyes in your problem that way than posting here.
          Hide
          Hoss Man added a comment -

          It works only in the case of the update arriving to the shard leader (as it would fail while adding the doc locally), but if the update needs to be forwarded to the leader, then it will not work.

          ...i'm not sure if this will solve all of the problems Tomas ran into, but one thing that might help (and was added after the latest version of hte patch was written) is the "UpdateRequestProcessorFactory.RunAlways" marker interface. it gives UpdateProcessorFactories a mechanism to say they want to be run as part of hte chain even if the "update.distrib" logic would normally skip them for already being run on a previous node (ie: the update has already been forwarded once)

          so that interface, combined with some basic checks of "am i the leader?" could allow this processor to ensure it was always/only executing some bits of logic on the leader.

          (there might still be some problems however in terms of accurately responding/reporting aggregate failures when batch updates involve docs that go to differnet leaders)

          Show
          Hoss Man added a comment - It works only in the case of the update arriving to the shard leader (as it would fail while adding the doc locally), but if the update needs to be forwarded to the leader, then it will not work. ...i'm not sure if this will solve all of the problems Tomas ran into, but one thing that might help (and was added after the latest version of hte patch was written) is the "UpdateRequestProcessorFactory.RunAlways" marker interface. it gives UpdateProcessorFactories a mechanism to say they want to be run as part of hte chain even if the "update.distrib" logic would normally skip them for already being run on a previous node (ie: the update has already been forwarded once) so that interface, combined with some basic checks of "am i the leader?" could allow this processor to ensure it was always/only executing some bits of logic on the leader. (there might still be some problems however in terms of accurately responding/reporting aggregate failures when batch updates involve docs that go to differnet leaders)
          Hide
          Anshum Gupta added a comment - - edited

          Working on more tests in the next patch and support/test for deletes.

          Show
          Anshum Gupta added a comment - - edited Working on more tests in the next patch and support/test for deletes.
          Hide
          Anshum Gupta added a comment -

          I'm seeing a few errors with the current patch and I think I know what's going on. I'll take a look at it and update the patch tomorrow.

          Show
          Anshum Gupta added a comment - I'm seeing a few errors with the current patch and I think I know what's going on. I'll take a look at it and update the patch tomorrow.
          Hide
          Noble Paul added a comment -

          I guess it would be better if we return the whole command instead of just the id to the user

          Show
          Noble Paul added a comment - I guess it would be better if we return the whole command instead of just the id to the user
          Hide
          Anshum Gupta added a comment -

          Updated patch

          Show
          Anshum Gupta added a comment - Updated patch
          Hide
          Anshum Gupta added a comment -

          Updated patch. I'm still working on getting the correct error count for the following case:
          Client -> Node1 -> Node2 (Leader for shard 1) -> Node3 (Non-leader Replica for shard 1)

          In the above case, Node2 as of now return an HTTP OK and doesn't throw an exception, the StreamingSolrClient used but the Distributed Updated Processor doesn't realize the error that was consumed by the leader of shard 1.
          I'm making progress on that but have run into deadend with a few of the approaches I've tried so far.

          The TolerantUpdateProcessor as of now kicks in when
          1. It's the leader node or,
          2. It's the node that receives the request from the client i.e. yet to forward the request to the leader shard.

          Show
          Anshum Gupta added a comment - Updated patch. I'm still working on getting the correct error count for the following case: Client -> Node1 -> Node2 (Leader for shard 1) -> Node3 (Non-leader Replica for shard 1) In the above case, Node2 as of now return an HTTP OK and doesn't throw an exception, the StreamingSolrClient used but the Distributed Updated Processor doesn't realize the error that was consumed by the leader of shard 1. I'm making progress on that but have run into deadend with a few of the approaches I've tried so far. The TolerantUpdateProcessor as of now kicks in when 1. It's the leader node or, 2. It's the node that receives the request from the client i.e. yet to forward the request to the leader shard.
          Hide
          Hoss Man added a comment -

          I started playing arround with this patch a bit to see if I could help move it forward. I'm a little out of my depth with a lot of the details of how distribute updates work, but the more I tried to make sense of it, the more convinced I was that there was a lot of things that just weren't very well accounted for in the existing tests (which were consistently failing, but the failures themselves weren't consistent between runs).

          Here's a summary of what's new/different in the patch i'm attaching...

          • DistributedUpdateProcessor.DistribPhase
            • not sure why this enum was made non-static in earlier patches ... i reverted this unneeded change.
          • TolerantUpdateProcessor
            • processDelete
              • Method has a couple of glaringly obvious bugs, that aparently don't trip under the current tests
              • added several nocommits of things that jumpted out at me
          • DistribTolerantUpdateProcessorTest
            • beefed up assertion msgs in assertUSucceedsWithErrors
            • fixed testValidAdds so it's not dead code
            • testInvalidAdds
              • sanity check code wasn't passing reliably
                • details of what failed are lost depending on how update is routed (random seed)
                • relaxed this check to be reliable with a nocommit comment to see if we can tighten it up
              • assuming sanity check passes assertUSucceedsWithErrors (still) fails on some seeds w/null error list
                • I'm Guessing this is what anshum alluded to in last comment: "Node2 as of now return an HTTP OK and doesn't throw an exception, the StreamingSolrClient used but the Distributed Updated Processor doesn't realize the error that was consumed by the leader of shard 1"
          • TestTolerantUpdateProcessorCloud
            • New MiniSolrCloudCluster based test to try and demonstrate all the possible distrib code paths i could think of (see below)

          TestTolerantUpdateProcessorCloud is the real meat of what i've added here. Starting with the basic behavior/assertions currently tested in TolerantUpdateProcessorTest, I built it up to try and exorcise every possible distribute update code path i could imagine (updates with docs all on one shard some of which fail, updates with docs for diff shards and some from each shard fail, updates with docs for diff shards but only one shard fails, etc...) – but only tested against a MinSolrCloud collection that actaully had 1 node, 1 shard, 1 replica and an HttpSolrClient talking directly to that node. Once all those assertions were passing, then I changed it to use 5 nodes, 2 shards, 2 replicas and started testing all of those scenerios against 5 HttpSolrClients pointed at every individual node (one of which hosts no replicas) as well as a ZK aware CloudSolrClient. All 6 tests against all 6 clients currently fail (reliably) at some point in these scenerios.


          Independent of all the things i still need to make sense of in the existing code to try and help get these tests passing, I still have one big question about what the desired/epected behavior should be for clients when maxErrors is exceeded – at the moment, in single node setups, the client gets a 400 error with the top level "error" section corisponding with whatever error caused it to exceed the maxErrors, but the responseHeader is still populated with the individual errors and the appropraite numAdds & numErrors, for example...

          $ curl -v -X POST 'http://localhost:8983/solr/techproducts/update?indent=true&commit=true&update.chain=tolerant' -H 'Content-Type: application/json' --data-binary '[{"id":"hoss1","foo_i":42},{"id":"bogus1","foo_i":"bogus"},{"id":"hoss2","foo_i":66},{"id":"bogus2","foo_i":"bogus"},{"id":"bogus3","foo_i":"bogus"},{"id":"hoss3","foo_i":42}]'
          * Hostname was NOT found in DNS cache
          *   Trying 127.0.0.1...
          * Connected to localhost (127.0.0.1) port 8983 (#0)
          > POST /solr/techproducts/update?indent=true&commit=true&update.chain=tolerant HTTP/1.1
          > User-Agent: curl/7.38.0
          > Host: localhost:8983
          > Accept: */*
          > Content-Type: application/json
          > Content-Length: 175
          > 
          * upload completely sent off: 175 out of 175 bytes
          < HTTP/1.1 400 Bad Request
          < Content-Type: text/plain;charset=utf-8
          < Transfer-Encoding: chunked
          < 
          {
            "responseHeader":{
              "numErrors":3,
              "errors":{
                "bogus1":{
                  "message":"ERROR: [doc=bogus1] Error adding field 'foo_i'='bogus' msg=For input string: \"bogus\""},
                "bogus2":{
                  "message":"ERROR: [doc=bogus2] Error adding field 'foo_i'='bogus' msg=For input string: \"bogus\""},
                "bogus3":{
                  "message":"ERROR: [doc=bogus3] Error adding field 'foo_i'='bogus' msg=For input string: \"bogus\""}},
              "numAdds":2,
              "status":400,
              "QTime":4},
            "error":{
              "msg":"ERROR: [doc=bogus3] Error adding field 'foo_i'='bogus' msg=For input string: \"bogus\"",
              "code":400}}
          * Connection #0 to host localhost left intact
          

          ...but because this is a 400 error, that means that if you use HttpSolrClient, you're not going to get access to any of that detailed error information at all – you'll just get a RemoteSolrException with the bare details.

          • Should the use of this processor force all "error" responses to be rewritten as HTTP 200s?
          • Should the solrj clients be updated so that RemoteSolrException still provides an accessor to get the parsed/structured SolrResponse (assuming the HTTP response body can be parsed w/o any other errors?)

          ?

          Show
          Hoss Man added a comment - I started playing arround with this patch a bit to see if I could help move it forward. I'm a little out of my depth with a lot of the details of how distribute updates work, but the more I tried to make sense of it, the more convinced I was that there was a lot of things that just weren't very well accounted for in the existing tests (which were consistently failing, but the failures themselves weren't consistent between runs). Here's a summary of what's new/different in the patch i'm attaching... DistributedUpdateProcessor.DistribPhase not sure why this enum was made non-static in earlier patches ... i reverted this unneeded change. TolerantUpdateProcessor processDelete Method has a couple of glaringly obvious bugs, that aparently don't trip under the current tests added several nocommits of things that jumpted out at me DistribTolerantUpdateProcessorTest beefed up assertion msgs in assertUSucceedsWithErrors fixed testValidAdds so it's not dead code testInvalidAdds sanity check code wasn't passing reliably details of what failed are lost depending on how update is routed (random seed) relaxed this check to be reliable with a nocommit comment to see if we can tighten it up assuming sanity check passes assertUSucceedsWithErrors (still) fails on some seeds w/null error list I'm Guessing this is what anshum alluded to in last comment: "Node2 as of now return an HTTP OK and doesn't throw an exception, the StreamingSolrClient used but the Distributed Updated Processor doesn't realize the error that was consumed by the leader of shard 1" TestTolerantUpdateProcessorCloud New MiniSolrCloudCluster based test to try and demonstrate all the possible distrib code paths i could think of (see below) TestTolerantUpdateProcessorCloud is the real meat of what i've added here. Starting with the basic behavior/assertions currently tested in TolerantUpdateProcessorTest, I built it up to try and exorcise every possible distribute update code path i could imagine (updates with docs all on one shard some of which fail, updates with docs for diff shards and some from each shard fail, updates with docs for diff shards but only one shard fails, etc...) – but only tested against a MinSolrCloud collection that actaully had 1 node, 1 shard, 1 replica and an HttpSolrClient talking directly to that node. Once all those assertions were passing, then I changed it to use 5 nodes, 2 shards, 2 replicas and started testing all of those scenerios against 5 HttpSolrClients pointed at every individual node (one of which hosts no replicas) as well as a ZK aware CloudSolrClient. All 6 tests against all 6 clients currently fail (reliably) at some point in these scenerios. Independent of all the things i still need to make sense of in the existing code to try and help get these tests passing, I still have one big question about what the desired/epected behavior should be for clients when maxErrors is exceeded – at the moment, in single node setups, the client gets a 400 error with the top level "error" section corisponding with whatever error caused it to exceed the maxErrors, but the responseHeader is still populated with the individual errors and the appropraite numAdds & numErrors, for example... $ curl -v -X POST 'http: //localhost:8983/solr/techproducts/update?indent= true &commit= true &update.chain=tolerant' -H 'Content-Type: application/json' --data-binary '[{ "id" : "hoss1" , "foo_i" :42},{ "id" : "bogus1" , "foo_i" : "bogus" },{ "id" : "hoss2" , "foo_i" :66},{ "id" : "bogus2" , "foo_i" : "bogus" },{ "id" : "bogus3" , "foo_i" : "bogus" },{ "id" : "hoss3" , "foo_i" :42}]' * Hostname was NOT found in DNS cache * Trying 127.0.0.1... * Connected to localhost (127.0.0.1) port 8983 (#0) > POST /solr/techproducts/update?indent= true &commit= true &update.chain=tolerant HTTP/1.1 > User-Agent: curl/7.38.0 > Host: localhost:8983 > Accept: */* > Content-Type: application/json > Content-Length: 175 > * upload completely sent off: 175 out of 175 bytes < HTTP/1.1 400 Bad Request < Content-Type: text/plain;charset=utf-8 < Transfer-Encoding: chunked < { "responseHeader" :{ "numErrors" :3, "errors" :{ "bogus1" :{ "message" : "ERROR: [doc=bogus1] Error adding field 'foo_i'='bogus' msg=For input string: \" bogus\""}, "bogus2" :{ "message" : "ERROR: [doc=bogus2] Error adding field 'foo_i'='bogus' msg=For input string: \" bogus\""}, "bogus3" :{ "message" : "ERROR: [doc=bogus3] Error adding field 'foo_i'='bogus' msg=For input string: \" bogus\""}}, "numAdds" :2, "status" :400, "QTime" :4}, "error" :{ "msg" : "ERROR: [doc=bogus3] Error adding field 'foo_i'='bogus' msg=For input string: \" bogus\"", "code" :400}} * Connection #0 to host localhost left intact ...but because this is a 400 error, that means that if you use HttpSolrClient, you're not going to get access to any of that detailed error information at all – you'll just get a RemoteSolrException with the bare details. Should the use of this processor force all "error" responses to be rewritten as HTTP 200s? Should the solrj clients be updated so that RemoteSolrException still provides an accessor to get the parsed/structured SolrResponse (assuming the HTTP response body can be parsed w/o any other errors?) ?
          Hide
          Hoss Man added a comment -

          Ok, i've been going around in circles on this issue for the past few weeks, but i've finally got something that feels like progress.

          Listing everything new since the last patch would be fairely tedioius, so i'll focus on the broad strokes...

          • punting on deletes (for now)
            • the previuos processDelete method was totally broken in distributed queries (was doing equality comparisons of Strings directly to enum objects, made assumptions about isLeader pased on previous "add" commands in the same request, etc...)
            • i was having enough time struggling with getting adds to work properly, that i just removed the delete code completely
            • based on the rest of the progress (see below) re-adding support for deletes should be straight forward, but will require some refactoring of how the errors are tracked & counted to distinguish between an "add doc w/id=22" that fails and a "delete doc w/id=22" that fails in the same request
          • Dealing with forward(ing/ed) requests and async distributed failures
            • Most of the meat of the test failures in the last patch came from dealing with the async requests fired off my
              DistributedUpdateProcessor and how to deal with failures reported by other leaders.
            • I started down the road of trying to do a bettter job in SolrCmdDistributor of tracking the UpdateCommand that corisponds with each Req so that when processing the Error we could know what failed remotely – this code is still in the patch (because i think it's cleaner then the cmdString tracking currently in Solr) but proved mostly useless because of how ConcurrentUpdateSolrClient can combine multiple "requests" together.
            • The next step, which mostly worked, was to improve the error handling in DUP's finish() so that instead of aborting with info about whatever (remote) Error happened to return first, it now returns a new DistributedUpdatesAsyncException which wraps & remembers the summary info of all the remote errors encountered – or at least, all of the errors that were previously candidates to tell the user about. Stuff that was swallowed & logged before is still swallowed & logged.
              • One notable change her is that i switched DUP.finish() from directly calling SOlrQueryResponse.setException() and instead made it throw the exception. Independent of this issue, the existing behavior seems like a bug / bad-form – what if the caller already caught some earlier exception it wants to return and finish() is just being called in finally?
              • This lead me to discover SOLR-8633 – the patch from that issue is currently including in this patch because it's so neccessary for hte current code.
              • FWIW: if, for some reason, folks think calling SOlrQueryResponse.setException() is better for some wacko reason, then TolerantUpdateProcess.finish() could, in theory, go check SOlrQueryResponse.getException() and treat it the same way as exceptions it catches (see below) but that's a lot more tedious and (and error prone in the long run)
            • Once DUP.finish() started throwing DistributedUpdatesAsyncException, tracking all the various errors we care about from distributed requests became possible...
              • TolerantUpdateProcessor now pays attention to when the request is a TOLEADER forwarded request, and if so:
                • it acts tolerant of up to maxErrors failures (just like single node)
                • in finish(), if any failures happened, return the first error – and annotate it with info about all the failures in this request, using the existing SolrException.getMetadata() map.
              • ConcurrentUpdateSolrClient already ensures that if a SolrException happens on the remote server, any "metadata" included in the response for that exception is copied into the local SolrException it constructs
              • So if/when TolerantUpdateProcessor.finish() catches a DistributedUpdatesAsyncException, it loops over the wrapped exceptions, and pulls out the metadata for each of those to update it's list of known failures
          • which error is returned if maxErrors exceeded: now the "first" one
            • i mentioned this above in discussing how async errors from other leaders are handled, but i wanted to emphasis it and elaborate a bit
            • in the original patch, exceptions were completley ignored until maxErrors were seen, at which point hte next exception was immediately (re)-thrown
            • that made sense for single node cases, but in a distributed case, with async exceptions, we can't always just "rethrow" an exception
              • it's also ambiguious what the "last" exception really is in an async situation.
            • so now, instead, TolerantUpdateProcessor always remembers the first exception it caught, and if/when maxErrors is exceeded, it re-throws that first exception
              • later, in finish() that (remembered) exception is annotated with metadata about all the failures
                • this metadata is critical for forwarded requests, but should also be useful for SolrClient users who (should) see it copied into a RemoteSolrException even if they don't get the normal UpdateResponse object and it's getResponseHeader()
            • this, in my opinion, makes the behavior of using TolerantUpdateProcessor more useful (and consistent with how stock solr works) when there is a fundamental problem with your data – you get an error indicating the first problem document in your data, as opposed to seeing an error from the 11th, or 101th, or maxErrors+1 malformed document in your data.
            • if folks disagree, we just need to re-work the FirstErrTracker class to be a LastErrTracker class...
              • FirstErrTracker.throwFirst() becomes LastErrTracker.throwLast()
              • instead of checking null == first, LastErrorTracker.caught(Throwable) would ignore any additional exceptions once true == thrown
          • the problem with returning numAdds (and numDeletes if we want that)
            • getting numAdds to work correctly in a distributed request is really hard
            • currently the code (specifically on the first node processing the AddUpdateCommand just requires that super.processAdd() succeeds to do numAdds++
              • this gives a missleading number when the failures aren't reported immediately because they were forwarded async to a diff leader and we only find out about problems in finish()
            • even if we only did numAdds++ for docs where we know we are the leader, getting the count from other leaders later is tricky
              • right now, the only way TolerantUpdateProcessor learns the results of any async requests to other leaders is if DUP.finish() throws an error – we can get numAdds from the metadata of those errors, but that doesn't help the case when requests succeed
              • DUP doesn't currently do anything with successful responses, so there's no easy way to get the numAdds in that case
            • possible solutions...
              • eliminate the concept of numAdds from this feature, focus solely on being tolerant and reporting the errors (which we can do accurately)
                • if people want to know how many succeeded, they can use features like versions=true explicitly – although even then, i'm not sure if it will work reliably today ... i don't see any indication that DUP merges the remote responses when that feature is used.
              • make TolerantUpdateProcessor detect when it's not a leader for something, and in that case implicitly add version=true to get the info from requests we forward to, and prune the response down to just a simple numAdds count in finish()
                • same problem as above if i'm correct about versions=true not currently handled correctly in forward(ing/ed) requests
              • return distinct numAddsAttempted and numAddsConfirmed values – probably not as useful to end clients, but more accurate representation of what we know...
                • track a distinct "numAddsAttempted" for every shard we forward to (added together at end of request)
                  • kind of a sketchy pain in the ass to do
                • assume numAddsConfirmed = numAddsAttempted for any other shard leader we dont see an exception from
                • for shard leaders we do get exceptions from, add to our numAddsConfirmed based on the metadata (ie: numAddsConfirmed + remoteException.getMetatata(NUM_ADDS_CONFIRMED)
              • treat it as a distinct feature in a new jira
                • geting numAdds (or numDeletes) count in the responseHeader really feels like it should be an orthoginal feature to being tolerant of maxErrors
                • we should open a distinct issue to track adding that as a feature, and ensuring that DUP aggregates correctly from all the various leaders that requests get forwarded to
                • this is waht i personally think we should do – notably because it would make it trivial to know how many documents you added when streaming a bunch of data, even if you don't use this update processor.
          • the nuance of the maxErrors=N param
            • i just want to point out – in a distributed cloud setup, there is no way to truely enforce a hard limit at maxErrors
            • the async processing means that if a request includes docs destined for diff shards, we can't ensure that only that max amount of errors will be hit – we might hit more errors in async threads before we notice and stop processing
            • i don't think this is a problem, it's a feature of handling updates to diff shards/leaders in paralel, but it does mean we might want to rethink the param name ? (errorTolleranceThreshold=N perhaps?

          There's still a lot of work todo, in no particular order...

          • deletes
            • need to refactor the way we track errors (both locally and stashed in the SolrException metadata) so that we can we can distinguish "add doc w/id=22" faiulres from "delete doc w/id=22" failures
            • probably need to rethink the responseHeader formatting for how errors are tracked as well in order to distinguish them
            • once we have that, adding a processDeletes(...) method that works similar to processAdd should be trivial
            • need to beef up the cloud testing to include delete checks
          • CloudSolrClient
            • all of the tests using CloudSolrClient currently fail because of how that client does it's own "direct to leaders" splitting/merging of docs destined for diff shards.
            • a bunch of new code needs written there (may be able to refactor/share some stuff from the error list merging code in TolerantUpdateProcessor (see above about refactoring that for deletes)
          • need lots more randomized testing
          • tons of nocommits that need cleaned up
          Show
          Hoss Man added a comment - Ok, i've been going around in circles on this issue for the past few weeks, but i've finally got something that feels like progress. Listing everything new since the last patch would be fairely tedioius, so i'll focus on the broad strokes... punting on deletes (for now) the previuos processDelete method was totally broken in distributed queries (was doing equality comparisons of Strings directly to enum objects, made assumptions about isLeader pased on previous "add" commands in the same request, etc...) i was having enough time struggling with getting adds to work properly, that i just removed the delete code completely based on the rest of the progress (see below) re-adding support for deletes should be straight forward, but will require some refactoring of how the errors are tracked & counted to distinguish between an "add doc w/id=22" that fails and a "delete doc w/id=22" that fails in the same request Dealing with forward(ing/ed) requests and async distributed failures Most of the meat of the test failures in the last patch came from dealing with the async requests fired off my DistributedUpdateProcessor and how to deal with failures reported by other leaders. I started down the road of trying to do a bettter job in SolrCmdDistributor of tracking the UpdateCommand that corisponds with each Req so that when processing the Error we could know what failed remotely – this code is still in the patch (because i think it's cleaner then the cmdString tracking currently in Solr) but proved mostly useless because of how ConcurrentUpdateSolrClient can combine multiple "requests" together. The next step, which mostly worked, was to improve the error handling in DUP's finish() so that instead of aborting with info about whatever (remote) Error happened to return first, it now returns a new DistributedUpdatesAsyncException which wraps & remembers the summary info of all the remote errors encountered – or at least, all of the errors that were previously candidates to tell the user about. Stuff that was swallowed & logged before is still swallowed & logged. One notable change her is that i switched DUP.finish() from directly calling SOlrQueryResponse.setException() and instead made it throw the exception. Independent of this issue, the existing behavior seems like a bug / bad-form – what if the caller already caught some earlier exception it wants to return and finish() is just being called in finally? This lead me to discover SOLR-8633 – the patch from that issue is currently including in this patch because it's so neccessary for hte current code. FWIW: if, for some reason, folks think calling SOlrQueryResponse.setException() is better for some wacko reason, then TolerantUpdateProcess.finish() could, in theory, go check SOlrQueryResponse.getException() and treat it the same way as exceptions it catches (see below) but that's a lot more tedious and (and error prone in the long run) Once DUP.finish() started throwing DistributedUpdatesAsyncException , tracking all the various errors we care about from distributed requests became possible... TolerantUpdateProcessor now pays attention to when the request is a TOLEADER forwarded request, and if so: it acts tolerant of up to maxErrors failures (just like single node) in finish() , if any failures happened, return the first error – and annotate it with info about all the failures in this request, using the existing SolrException.getMetadata() map. ConcurrentUpdateSolrClient already ensures that if a SolrException happens on the remote server, any "metadata" included in the response for that exception is copied into the local SolrException it constructs So if/when TolerantUpdateProcessor.finish() catches a DistributedUpdatesAsyncException , it loops over the wrapped exceptions, and pulls out the metadata for each of those to update it's list of known failures which error is returned if maxErrors exceeded: now the "first" one i mentioned this above in discussing how async errors from other leaders are handled, but i wanted to emphasis it and elaborate a bit in the original patch, exceptions were completley ignored until maxErrors were seen, at which point hte next exception was immediately (re)-thrown that made sense for single node cases, but in a distributed case, with async exceptions, we can't always just "rethrow" an exception it's also ambiguious what the "last" exception really is in an async situation. so now, instead, TolerantUpdateProcessor always remembers the first exception it caught, and if/when maxErrors is exceeded, it re-throws that first exception later, in finish() that (remembered) exception is annotated with metadata about all the failures this metadata is critical for forwarded requests, but should also be useful for SolrClient users who (should) see it copied into a RemoteSolrException even if they don't get the normal UpdateResponse object and it's getResponseHeader() this, in my opinion, makes the behavior of using TolerantUpdateProcessor more useful (and consistent with how stock solr works) when there is a fundamental problem with your data – you get an error indicating the first problem document in your data, as opposed to seeing an error from the 11th, or 101th, or maxErrors+1 malformed document in your data. if folks disagree, we just need to re-work the FirstErrTracker class to be a LastErrTracker class... FirstErrTracker.throwFirst() becomes LastErrTracker.throwLast() instead of checking null == first , LastErrorTracker.caught(Throwable) would ignore any additional exceptions once true == thrown the problem with returning numAdds (and numDeletes if we want that) getting numAdds to work correctly in a distributed request is really hard currently the code (specifically on the first node processing the AddUpdateCommand just requires that super.processAdd() succeeds to do numAdds++ this gives a missleading number when the failures aren't reported immediately because they were forwarded async to a diff leader and we only find out about problems in finish() even if we only did numAdds++ for docs where we know we are the leader, getting the count from other leaders later is tricky right now, the only way TolerantUpdateProcessor learns the results of any async requests to other leaders is if DUP.finish() throws an error – we can get numAdds from the metadata of those errors, but that doesn't help the case when requests succeed DUP doesn't currently do anything with successful responses, so there's no easy way to get the numAdds in that case possible solutions... eliminate the concept of numAdds from this feature, focus solely on being tolerant and reporting the errors (which we can do accurately) if people want to know how many succeeded, they can use features like versions=true explicitly – although even then, i'm not sure if it will work reliably today ... i don't see any indication that DUP merges the remote responses when that feature is used. make TolerantUpdateProcessor detect when it's not a leader for something, and in that case implicitly add version=true to get the info from requests we forward to, and prune the response down to just a simple numAdds count in finish() same problem as above if i'm correct about versions=true not currently handled correctly in forward(ing/ed) requests return distinct numAddsAttempted and numAddsConfirmed values – probably not as useful to end clients, but more accurate representation of what we know... track a distinct "numAddsAttempted" for every shard we forward to (added together at end of request) kind of a sketchy pain in the ass to do assume numAddsConfirmed = numAddsAttempted for any other shard leader we dont see an exception from for shard leaders we do get exceptions from, add to our numAddsConfirmed based on the metadata (ie: numAddsConfirmed + remoteException.getMetatata(NUM_ADDS_CONFIRMED) treat it as a distinct feature in a new jira geting numAdds (or numDeletes ) count in the responseHeader really feels like it should be an orthoginal feature to being tolerant of maxErrors we should open a distinct issue to track adding that as a feature, and ensuring that DUP aggregates correctly from all the various leaders that requests get forwarded to this is waht i personally think we should do – notably because it would make it trivial to know how many documents you added when streaming a bunch of data, even if you don't use this update processor. the nuance of the maxErrors=N param i just want to point out – in a distributed cloud setup, there is no way to truely enforce a hard limit at maxErrors the async processing means that if a request includes docs destined for diff shards, we can't ensure that only that max amount of errors will be hit – we might hit more errors in async threads before we notice and stop processing i don't think this is a problem, it's a feature of handling updates to diff shards/leaders in paralel, but it does mean we might want to rethink the param name ? ( errorTolleranceThreshold=N perhaps? There's still a lot of work todo, in no particular order... deletes need to refactor the way we track errors (both locally and stashed in the SolrException metadata) so that we can we can distinguish "add doc w/id=22" faiulres from "delete doc w/id=22" failures probably need to rethink the responseHeader formatting for how errors are tracked as well in order to distinguish them once we have that, adding a processDeletes(...) method that works similar to processAdd should be trivial need to beef up the cloud testing to include delete checks CloudSolrClient all of the tests using CloudSolrClient currently fail because of how that client does it's own "direct to leaders" splitting/merging of docs destined for diff shards. a bunch of new code needs written there (may be able to refactor/share some stuff from the error list merging code in TolerantUpdateProcessor (see above about refactoring that for deletes) need lots more randomized testing tons of nocommits that need cleaned up
          Hide
          Mark Miller added a comment -

          Wow, I've never seen this old beast issue.

          Goodness, with all this work if we could just fix the error handling too for 6.0...

          If we had proper error responses, oh the things we could do. I don't know how you handle when 100,000 docs in your stream fail though.

          Show
          Mark Miller added a comment - Wow, I've never seen this old beast issue. Goodness, with all this work if we could just fix the error handling too for 6.0... If we had proper error responses, oh the things we could do. I don't know how you handle when 100,000 docs in your stream fail though.
          Hide
          Mark Miller added a comment -

          One notable change her is that i switched DUP.finish() from directly calling SOlrQueryResponse.setException() and instead made it throw the exception. Independent of this issue, the existing behavior seems like a bug / bad-form – what if the caller already caught some earlier exception it wants to return and finish() is just being called in finally?

          Can we document that on the setException method?

          maxErrors

          Yeah, works for the end user. Internally, if we had a good way to track all the fails in some efficient manner (we learn about them as they happen or something), we could perhaps use a single ConcurrentUpdateSolrClient per replica and be much more connection efficient. Kind of beyond this issue, but my interest in this issue is that it seems to be the start of that path.

          Show
          Mark Miller added a comment - One notable change her is that i switched DUP.finish() from directly calling SOlrQueryResponse.setException() and instead made it throw the exception. Independent of this issue, the existing behavior seems like a bug / bad-form – what if the caller already caught some earlier exception it wants to return and finish() is just being called in finally? Can we document that on the setException method? maxErrors Yeah, works for the end user. Internally, if we had a good way to track all the fails in some efficient manner (we learn about them as they happen or something), we could perhaps use a single ConcurrentUpdateSolrClient per replica and be much more connection efficient. Kind of beyond this issue, but my interest in this issue is that it seems to be the start of that path.
          Hide
          Mark Miller added a comment - - edited

          I have no issue with punting numAdds on this.

          If we could get this out in a major version, I'd also love to make this standard behavior and not some optional update processor.

          Show
          Mark Miller added a comment - - edited I have no issue with punting numAdds on this. If we could get this out in a major version, I'd also love to make this standard behavior and not some optional update processor.
          Hide
          Mark Miller added a comment - - edited

          // nocommit: should this really be a top level exception?
          // nocommit: or should it be an HTTP:200 with the details of what faild in the body?

          For the Collections API I went with HTTP:200. The overall request to the server succeeded, here are the individual update fails. I guess I lean that way a bit. If 1 update out 1000 fails, it seems kind of strange to fail the whole request with this new code.

          **
          Oh wait a minute, are you only doing that when maxErrors is exceeded? In that case failing the request makes sense to me I guess.

          Show
          Mark Miller added a comment - - edited // nocommit: should this really be a top level exception? // nocommit: or should it be an HTTP:200 with the details of what faild in the body? For the Collections API I went with HTTP:200. The overall request to the server succeeded, here are the individual update fails. I guess I lean that way a bit. If 1 update out 1000 fails, it seems kind of strange to fail the whole request with this new code. ** Oh wait a minute, are you only doing that when maxErrors is exceeded? In that case failing the request makes sense to me I guess.
          Hide
          Mark Miller added a comment -
                if ("LeaderChanged".equals(cause)) {
                  // let's just fail this request and let the client retry? or just call processAdd again?
                  log.error("On "+cloudDesc.getCoreNodeName()+", replica "+replicaUrl+
                      " now thinks it is the leader! Failing the request to let the client retry! "+error.e);
                  errorsForClient.add(error);
                  break; // nocommit: why not continue?
                }
          

          Been thinking about this a little. Perhaps the idea is, we know the error is not from a forward request, we skip those here. Which means they must all be leader to replica and if the leader changed, we can't put anyone in LIR anyway.
          Anyhow, it does seems safer to me to not break and process all the errors.

          Show
          Mark Miller added a comment - if ( "LeaderChanged" .equals(cause)) { // let's just fail this request and let the client retry? or just call processAdd again? log.error( "On " +cloudDesc.getCoreNodeName()+ ", replica " +replicaUrl+ " now thinks it is the leader! Failing the request to let the client retry! " +error.e); errorsForClient.add(error); break ; // nocommit: why not continue ? } Been thinking about this a little. Perhaps the idea is, we know the error is not from a forward request, we skip those here. Which means they must all be leader to replica and if the leader changed, we can't put anyone in LIR anyway. Anyhow, it does seems safer to me to not break and process all the errors.
          Hide
          Hoss Man added a comment -

          Can we document that on the setException method?

          I'm not sure what exactly it should say, but that seems orthogonal to the current issue – feel free to add whatever you think makes sense as a distinct git commit, my point was just that the current behavior is very inconsistent with the way exceptions are normally processed in solr, and doesn't give "up stream" callers the chance to catch/handle the exception.

          If we could get this out in a major version, I'd also love to make this standard behavior and not some optional update processor.

          Agreed - but for now, to minimize the invasiveness, I'd prefer to continue on the path of using a custom update processor & then later we can assess refactoring the code to make this a more integrated feature and change that update processor to a No-Op.

          (particularly since there is going to be some overhead in counting/tracking the errors)

          Oh wait a minute, are you only doing that when maxErrors is exceeded? In that case failing the request makes sense to me I guess.

          yeah, that's the context of the question ... i'm leaning towards agreeing with you, particularly since (as things stand now) the caller can access the SolrJ exception metadata to see exactly what failed (but i really wish there was an easier way to access the full response body in those cases)

          Anyhow, it does seems safer to me to not break and process all the errors.

          Yeah, that was my thinking.

          Show
          Hoss Man added a comment - Can we document that on the setException method? I'm not sure what exactly it should say, but that seems orthogonal to the current issue – feel free to add whatever you think makes sense as a distinct git commit, my point was just that the current behavior is very inconsistent with the way exceptions are normally processed in solr, and doesn't give "up stream" callers the chance to catch/handle the exception. If we could get this out in a major version, I'd also love to make this standard behavior and not some optional update processor. Agreed - but for now, to minimize the invasiveness, I'd prefer to continue on the path of using a custom update processor & then later we can assess refactoring the code to make this a more integrated feature and change that update processor to a No-Op. (particularly since there is going to be some overhead in counting/tracking the errors) Oh wait a minute, are you only doing that when maxErrors is exceeded? In that case failing the request makes sense to me I guess. yeah, that's the context of the question ... i'm leaning towards agreeing with you, particularly since (as things stand now) the caller can access the SolrJ exception metadata to see exactly what failed (but i really wish there was an easier way to access the full response body in those cases) Anyhow, it does seems safer to me to not break and process all the errors. Yeah, that was my thinking.
          Hide
          Yonik Seeley added a comment -

          I'd also love to make this standard behavior and not some optional update processor.

          +1 to that... big value in having it by default.

          Show
          Yonik Seeley added a comment - I'd also love to make this standard behavior and not some optional update processor. +1 to that... big value in having it by default.
          Hide
          Mark Miller added a comment -

          I'm not sure what exactly it should say, but that seems orthogonal to the current issue

          Seems like part of this issue to me. As you point out and fix in this issue, we really should not be doing setException explicitly, that should really be done in one place. So you fix it here, but best way to prevent this things from creeping back in is doc. Hardly worth another issue to add a comment to the effect of what you already wrote above though.

          So something along the lines of: You should not generally add new calls to this method, you should throw exceptions and let the existing infra handle it.

          Show
          Mark Miller added a comment - I'm not sure what exactly it should say, but that seems orthogonal to the current issue Seems like part of this issue to me. As you point out and fix in this issue, we really should not be doing setException explicitly, that should really be done in one place. So you fix it here, but best way to prevent this things from creeping back in is doc. Hardly worth another issue to add a comment to the effect of what you already wrote above though. So something along the lines of: You should not generally add new calls to this method, you should throw exceptions and let the existing infra handle it.
          Hide
          ASF subversion and git services added a comment -

          Commit 3a9da7ae576f35e742ec54a72da2d4224066bb63 in lucene-solr's branch refs/heads/jira/SOLR-445 from markrmiller
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3a9da7a ]

          SOLR-445: hossman's feb 5 2016 patch

          Show
          ASF subversion and git services added a comment - Commit 3a9da7ae576f35e742ec54a72da2d4224066bb63 in lucene-solr's branch refs/heads/jira/ SOLR-445 from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3a9da7a ] SOLR-445 : hossman's feb 5 2016 patch
          Hide
          ASF subversion and git services added a comment -

          Commit fd12a5b9f8d6319945d4445ac31e650bd1627dfc in lucene-solr's branch refs/heads/jira/SOLR-445 from markrmiller
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=fd12a5b ]

          SOLR-445: some cleanup

          Show
          ASF subversion and git services added a comment - Commit fd12a5b9f8d6319945d4445ac31e650bd1627dfc in lucene-solr's branch refs/heads/jira/ SOLR-445 from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=fd12a5b ] SOLR-445 : some cleanup
          Hide
          Mark Miller added a comment -

          Pushed a branch of this patch with some minor cleanup I needed to remove all Eclipse errors and the comment I mentioned above.

          Show
          Mark Miller added a comment - Pushed a branch of this patch with some minor cleanup I needed to remove all Eclipse errors and the comment I mentioned above.
          Hide
          Hoss Man added a comment -

          and the comment I mentioned above.

          I still don't understand why that's part of this issue and not SOLR-8633 where the actual bad behavior needs fixed (and should be done indepenetnly, and prior to, trying to move forward here)

          Show
          Hoss Man added a comment - and the comment I mentioned above. I still don't understand why that's part of this issue and not SOLR-8633 where the actual bad behavior needs fixed (and should be done indepenetnly, and prior to, trying to move forward here)
          Hide
          Mark Miller added a comment -

          Huh? What does SOLR-8633 have to do with calling setException?

          I'd say it fits right here. Here is where it's talked about, here is where it's changed in a patch...

          Show
          Mark Miller added a comment - Huh? What does SOLR-8633 have to do with calling setException? I'd say it fits right here. Here is where it's talked about, here is where it's changed in a patch...
          Hide
          Mark Miller added a comment -

          Let's not get too pedantic about adding comments to help future devs avoid bad decisions when we find bad decisions. Easier to just add the comment and make the code base a little easier to understand (which I've taken a stab at in the above branch).

          Show
          Mark Miller added a comment - Let's not get too pedantic about adding comments to help future devs avoid bad decisions when we find bad decisions. Easier to just add the comment and make the code base a little easier to understand (which I've taken a stab at in the above branch).
          Hide
          Hoss Man added a comment -

          Huh? What does SOLR-8633 have to do with calling setException?

          Sorry, nothing ... it's been a while since i looked at the specifics of that code and i spaced out on what we were actually talking about.

          Show
          Hoss Man added a comment - Huh? What does SOLR-8633 have to do with calling setException? Sorry, nothing ... it's been a while since i looked at the specifics of that code and i spaced out on what we were actually talking about.
          Hide
          ASF subversion and git services added a comment -

          Commit a58ad2a6b11077a24040810b9c6b6d84f15b055d in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a58ad2a ]

          Merge branch 'master' into jira/SOLR-445

          Show
          ASF subversion and git services added a comment - Commit a58ad2a6b11077a24040810b9c6b6d84f15b055d in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a58ad2a ] Merge branch 'master' into jira/ SOLR-445
          Hide
          ASF subversion and git services added a comment -

          Commit bc5dfeeff1a182630fc3b55be3cf2f4fe164d446 in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bc5dfee ]

          SOLR-445: play nice with SOLR-8674 test changes

          Show
          ASF subversion and git services added a comment - Commit bc5dfeeff1a182630fc3b55be3cf2f4fe164d446 in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bc5dfee ] SOLR-445 : play nice with SOLR-8674 test changes
          Hide
          ASF subversion and git services added a comment -

          Commit 2e5c5b022ed2f185b95c745ccdbd0a142e3b5794 in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2e5c5b0 ]

          SOLR-445: prune out bogus (ie: technically infeasible to be accurate) 'numAdds' code

          Show
          ASF subversion and git services added a comment - Commit 2e5c5b022ed2f185b95c745ccdbd0a142e3b5794 in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2e5c5b0 ] SOLR-445 : prune out bogus (ie: technically infeasible to be accurate) 'numAdds' code
          Hide
          ASF subversion and git services added a comment -

          Commit d5fd11999e17f52939ff0a2dd54572f0a95431ca in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d5fd119 ]

          SOLR-445: loop over all distributed errors

          Show
          ASF subversion and git services added a comment - Commit d5fd11999e17f52939ff0a2dd54572f0a95431ca in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d5fd119 ] SOLR-445 : loop over all distributed errors
          Hide
          ASF subversion and git services added a comment -

          Commit 98e8c344b81a169f20b09742e48f423f533837f6 in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=98e8c34 ]

          SOLR-445: no need to track maxErrors explicitly, that's what errors.size() is for

          Show
          ASF subversion and git services added a comment - Commit 98e8c344b81a169f20b09742e48f423f533837f6 in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=98e8c34 ] SOLR-445 : no need to track maxErrors explicitly, that's what errors.size() is for
          Hide
          ASF subversion and git services added a comment -

          Commit 08bcb769bd1e896e719ebb0b4512208c993d9c38 in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=08bcb76 ]

          SOLR-445: refactor metadata key+val parsing/formatting to use a new static inner helper class (KnownErr)

          Show
          ASF subversion and git services added a comment - Commit 08bcb769bd1e896e719ebb0b4512208c993d9c38 in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=08bcb76 ] SOLR-445 : refactor metadata key+val parsing/formatting to use a new static inner helper class (KnownErr)
          Hide
          ASF subversion and git services added a comment -

          Commit 4ce376fa0f1e6acc84744582ddba6dfe9fd6f11a in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4ce376f ]

          SOLR-445: replace errors map with List<KnownErr> and tweak public so we can differentiate errors of diff types

          for example: an error on deleteById for docId1 vs an error on add for docId1

          Show
          ASF subversion and git services added a comment - Commit 4ce376fa0f1e6acc84744582ddba6dfe9fd6f11a in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4ce376f ] SOLR-445 : replace errors map with List<KnownErr> and tweak public so we can differentiate errors of diff types for example: an error on deleteById for docId1 vs an error on add for docId1
          Hide
          ASF subversion and git services added a comment -

          Commit 0f0571928012c071c62fb928d2142d21fa183e2b in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0f05719 ]

          SOLR-445: basic support and tests for being tolerant of deletion failures

          the tests that send deleteByQuerys to non shard leaders are failing because the resulting error details are null, need to re-review the DBQ logic in DUP to figure out where these failures are getting lost (see testVariousDeletesViaNoCollectionClient, testVariousDeletesViaShard1NonLeaderClient, testVariousDeletesViaShard2NonLeaderClient)

          Show
          ASF subversion and git services added a comment - Commit 0f0571928012c071c62fb928d2142d21fa183e2b in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0f05719 ] SOLR-445 : basic support and tests for being tolerant of deletion failures the tests that send deleteByQuerys to non shard leaders are failing because the resulting error details are null, need to re-review the DBQ logic in DUP to figure out where these failures are getting lost (see testVariousDeletesViaNoCollectionClient, testVariousDeletesViaShard1NonLeaderClient, testVariousDeletesViaShard2NonLeaderClient)
          Hide
          ASF subversion and git services added a comment -

          Commit 2401c9495319e1b5065b05ef3a36ee586f06b6d4 in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2401c94 ]

          SOLR-445 Merge branch 'master' into jira/SOLR-445 (pick up SOLR-8738 changes)

          Show
          ASF subversion and git services added a comment - Commit 2401c9495319e1b5065b05ef3a36ee586f06b6d4 in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2401c94 ] SOLR-445 Merge branch 'master' into jira/ SOLR-445 (pick up SOLR-8738 changes)
          Hide
          ASF subversion and git services added a comment -

          Commit 2401c9495319e1b5065b05ef3a36ee586f06b6d4 in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2401c94 ]

          SOLR-445 Merge branch 'master' into jira/SOLR-445 (pick up SOLR-8738 changes)

          Show
          ASF subversion and git services added a comment - Commit 2401c9495319e1b5065b05ef3a36ee586f06b6d4 in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2401c94 ] SOLR-445 Merge branch 'master' into jira/ SOLR-445 (pick up SOLR-8738 changes)
          Hide
          ASF subversion and git services added a comment -

          Commit 0bd817d19cadf7a6c0c522ed15f75110620029b3 in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0bd817d ]

          SOLR-445: ensure maxErrors=-1 is treated as effectively unlimited

          Show
          ASF subversion and git services added a comment - Commit 0bd817d19cadf7a6c0c522ed15f75110620029b3 in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0bd817d ] SOLR-445 : ensure maxErrors=-1 is treated as effectively unlimited
          Hide
          Hoss Man added a comment -

          (ment to post last friday but was blocked by the jira outage)

          Ok ... i think things are looking pretty good on the jira/SOLR-445 branch – good enough that I'd really like some help reviewing the code & sanity checking the API (and internals for anyone who is up for it)...


          For folks who haven't been following closely, here's what the configuration looks like (from the javadocs)...

            <processor class="solr.TolerantUpdateProcessorFactory">
              <int name="maxErrors">10</int>
            </processor>
          

          When a chain with this processor is used, but maxErrors isn't exceeded, here's what the response looks like...

          $ curl 'http://localhost:8983/solr/techproducts/update?update.chain=tolerant-chain&wt=json&indent=true&maxErrors=-1' -H "Content-Type: application/json" --data-binary '{"add" : { "doc":{"id":"1","foo_i":"bogus"}}, "delete": {"query":"malformed:["}}'
          {
            "responseHeader":{
              "errors":[{
                  "type":"ADD",
                  "id":"1",
                  "message":"ERROR: [doc=1] Error adding field 'foo_i'='bogus' msg=For input string: \"bogus\""},
                {
                  "type":"DELQ",
                  "id":"malformed:[",
                  "message":"org.apache.solr.search.SyntaxError: Cannot parse 'malformed:[': Encountered \"<EOF>\" at line 1, column 11.\nWas expecting one of:\n    <RANGE_QUOTED> ...\n    <RANGE_GOOP> ...\n    "}],
              "maxErrors":-1,
              "status":0,
              "QTime":1}}
          

          Note in the above example that:

          • maxErrors can be overridden on a per-request basis
          • an effective maxErrors==-1 (either from config, or request param) means "unlimited" (under the covers it's using Integer.MAX_VALUE)

          If/When maxErrors is reached for a request, then the first exception that the processor caught is propagated back to the user, and metadata is set on that exception with all of the same details about all the tolerated errors.

          This next example is the same as the previous except that instead of maxErrors=-1 the request param is now maxErrors=1...

          $ curl 'http://localhost:8983/solr/techproducts/update?update.chain=tolerant-chain&wt=json&indent=true&maxErrors=1' -H "Content-Type: application/json" --data-binary '{"add" : { "doc":{"id":"1","foo_i":"bogus"}}, "delete": {"query":"malformed:["}}'
          {
            "responseHeader":{
              "errors":[{
                  "type":"ADD",
                  "id":"1",
                  "message":"ERROR: [doc=1] Error adding field 'foo_i'='bogus' msg=For input string: \"bogus\""},
                {
                  "type":"DELQ",
                  "id":"malformed:[",
                  "message":"org.apache.solr.search.SyntaxError: Cannot parse 'malformed:[': Encountered \"<EOF>\" at line 1, column 11.\nWas expecting one of:\n    <RANGE_QUOTED> ...\n    <RANGE_GOOP> ...\n    "}],
              "maxErrors":1,
              "status":400,
              "QTime":1},
            "error":{
              "metadata":[
                "org.apache.solr.common.ToleratedUpdateError--ADD:1","ERROR: [doc=1] Error adding field 'foo_i'='bogus' msg=For input string: \"bogus\"",
                "org.apache.solr.common.ToleratedUpdateError--DELQ:malformed:[","org.apache.solr.search.SyntaxError: Cannot parse 'malformed:[': Encountered \"<EOF>\" at line 1, column 11.\nWas expecting one of:\n    <RANGE_QUOTED> ...\n    <RANGE_GOOP> ...\n    ",
                "error-class","org.apache.solr.common.SolrException",
                "root-error-class","java.lang.NumberFormatException"],
              "msg":"ERROR: [doc=1] Error adding field 'foo_i'='bogus' msg=For input string: \"bogus\"",
              "code":400}}
          

          ...the added exception metadata ensures that even in client code like the various SolrJ SolrClient implementations, which throw a (client side) exception on non-200 responses, the end user can access info on all the tolerated errors that were ignored before the maxErrors threshold was reached.

          CloudSolrClient in particular – which already has logic to split UpdateRequests; route individual commands to the appropraite leaders; and merge the responses – has been updated to handle merging these responses as well.

          (The ToleratedUpdateError class for modeling these types of errors has been added to solr-common, and has static utilities that client code can use to parse the data out of the responseHeader or out of any client side SolrException metadata)


          There are still a bunch of nocommit comments, but they are almost all related to either:

          • adding tests
          • adding docs
          • refactoring / hardening some internal APIs
          • removing suspected unneccessary "isLeader" code (once tests are final)

          I'll keep working on those, but I'd appreciate feedback from folks on how things currently stand.

          Even if you don't understand/care about the internals, thoughts on the user facing API would be appreciated.

          Show
          Hoss Man added a comment - (ment to post last friday but was blocked by the jira outage) Ok ... i think things are looking pretty good on the jira/ SOLR-445 branch – good enough that I'd really like some help reviewing the code & sanity checking the API (and internals for anyone who is up for it)... For folks who haven't been following closely, here's what the configuration looks like (from the javadocs)... <processor class= "solr.TolerantUpdateProcessorFactory" > < int name= "maxErrors" >10</ int > </processor> When a chain with this processor is used, but maxErrors isn't exceeded, here's what the response looks like... $ curl 'http: //localhost:8983/solr/techproducts/update?update.chain=tolerant-chain&wt=json&indent= true &maxErrors=-1' -H "Content-Type: application/json" --data-binary '{ "add" : { "doc" :{ "id" : "1" , "foo_i" : "bogus" }}, "delete" : { "query" : "malformed:[" }}' { "responseHeader" :{ "errors" :[{ "type" : "ADD" , "id" : "1" , "message" : "ERROR: [doc=1] Error adding field 'foo_i'='bogus' msg=For input string: \" bogus\""}, { "type" : "DELQ" , "id" : "malformed:[" , "message" : "org.apache.solr.search.SyntaxError: Cannot parse 'malformed:[': Encountered \" <EOF>\ " at line 1, column 11.\nWas expecting one of:\n <RANGE_QUOTED> ...\n <RANGE_GOOP> ...\n " }], "maxErrors" :-1, "status" :0, "QTime" :1}} Note in the above example that: maxErrors can be overridden on a per-request basis an effective maxErrors==-1 (either from config, or request param) means "unlimited" (under the covers it's using Integer.MAX_VALUE ) If/When maxErrors is reached for a request, then the first exception that the processor caught is propagated back to the user, and metadata is set on that exception with all of the same details about all the tolerated errors. This next example is the same as the previous except that instead of maxErrors=-1 the request param is now maxErrors=1 ... $ curl 'http: //localhost:8983/solr/techproducts/update?update.chain=tolerant-chain&wt=json&indent= true &maxErrors=1' -H "Content-Type: application/json" --data-binary '{ "add" : { "doc" :{ "id" : "1" , "foo_i" : "bogus" }}, "delete" : { "query" : "malformed:[" }}' { "responseHeader" :{ "errors" :[{ "type" : "ADD" , "id" : "1" , "message" : "ERROR: [doc=1] Error adding field 'foo_i'='bogus' msg=For input string: \" bogus\""}, { "type" : "DELQ" , "id" : "malformed:[" , "message" : "org.apache.solr.search.SyntaxError: Cannot parse 'malformed:[': Encountered \" <EOF>\ " at line 1, column 11.\nWas expecting one of:\n <RANGE_QUOTED> ...\n <RANGE_GOOP> ...\n " }], "maxErrors" :1, "status" :400, "QTime" :1}, "error" :{ "metadata" :[ "org.apache.solr.common.ToleratedUpdateError--ADD:1" , "ERROR: [doc=1] Error adding field 'foo_i'='bogus' msg=For input string: \" bogus\"", "org.apache.solr.common.ToleratedUpdateError--DELQ:malformed:[" , "org.apache.solr.search.SyntaxError: Cannot parse 'malformed:[': Encountered \" <EOF>\ " at line 1, column 11.\nWas expecting one of:\n <RANGE_QUOTED> ...\n <RANGE_GOOP> ...\n " , "error-class" , "org.apache.solr.common.SolrException" , "root-error-class" , "java.lang.NumberFormatException" ], "msg" : "ERROR: [doc=1] Error adding field 'foo_i'='bogus' msg=For input string: \" bogus\"", "code" :400}} ...the added exception metadata ensures that even in client code like the various SolrJ SolrClient implementations, which throw a (client side) exception on non-200 responses, the end user can access info on all the tolerated errors that were ignored before the maxErrors threshold was reached. CloudSolrClient in particular – which already has logic to split UpdateRequests ; route individual commands to the appropraite leaders; and merge the responses – has been updated to handle merging these responses as well. (The ToleratedUpdateError class for modeling these types of errors has been added to solr-common, and has static utilities that client code can use to parse the data out of the responseHeader or out of any client side SolrException metadata) There are still a bunch of nocommit comments, but they are almost all related to either: adding tests adding docs refactoring / hardening some internal APIs removing suspected unneccessary "isLeader" code (once tests are final) I'll keep working on those, but I'd appreciate feedback from folks on how things currently stand. Even if you don't understand/care about the internals, thoughts on the user facing API would be appreciated.
          Hide
          Timothy Potter added a comment -

          digging into this now, thanks Hoss Man

          Show
          Timothy Potter added a comment - digging into this now, thanks Hoss Man
          Hide
          Anshum Gupta added a comment -

          Thanks Hoss! I'll take a look at this tonight or tomorrow morning.

          Show
          Anshum Gupta added a comment - Thanks Hoss! I'll take a look at this tonight or tomorrow morning.
          Hide
          David Smiley added a comment -

          The usage you explained looks nice Hoss! This will be very useful. I assume there are SolrJ hooks here. I'll leave the internal review to others.

          Show
          David Smiley added a comment - The usage you explained looks nice Hoss! This will be very useful. I assume there are SolrJ hooks here. I'll leave the internal review to others.
          Hide
          ASF subversion and git services added a comment -

          Commit 116ffaec6f6680c7312cb87680f8463df862f1f0 in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=116ffae ]

          SOLR-445: test failures from adds mixed with deletes

          Show
          ASF subversion and git services added a comment - Commit 116ffaec6f6680c7312cb87680f8463df862f1f0 in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=116ffae ] SOLR-445 : test failures from adds mixed with deletes
          Hide
          Anshum Gupta added a comment -

          Hoss Man what's the best way to look at the diff here?
          I tried the following but this gives a ton of unrelated stuff, which I assume is because this branch isn't up to date with the current master

          git diff master
          

          I also tried this after Shawn Heisey suggested it on irc, but it's the same result:

          # after checking out and doing a pull for both, master and jira/SOLR-445
          git diff refs/heads/master refs/heads/jira/SOLR-445
          
          Show
          Anshum Gupta added a comment - Hoss Man what's the best way to look at the diff here? I tried the following but this gives a ton of unrelated stuff, which I assume is because this branch isn't up to date with the current master git diff master I also tried this after Shawn Heisey suggested it on irc, but it's the same result: # after checking out and doing a pull for both, master and jira/SOLR-445 git diff refs/heads/master refs/heads/jira/SOLR-445
          Hide
          Hoss Man added a comment -

          git diff master...jira/SOLR-445 should be what you are looking for. (note: three dots)

          if you want to review hte list of individual commits, that's git log master..jira/SOLR-445 (note: two dots)

          (using two docs with diff, or three dots with log give you totally different and exciting and confusing behavior)

          Show
          Hoss Man added a comment - git diff master...jira/ SOLR-445 should be what you are looking for. (note: three dots) if you want to review hte list of individual commits, that's git log master..jira/ SOLR-445 (note: two dots) (using two docs with diff, or three dots with log give you totally different and exciting and confusing behavior)
          Hide
          ASF subversion and git services added a comment -

          Commit a0d48f873c21ca0ab5ba02748c1659a983aad886 in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a0d48f8 ]

          SOLR-445: start of a new randomized/chaosmonkey test, currently blocked by SOLR-8862 (no monkey yet)

          Show
          ASF subversion and git services added a comment - Commit a0d48f873c21ca0ab5ba02748c1659a983aad886 in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a0d48f8 ] SOLR-445 : start of a new randomized/chaosmonkey test, currently blocked by SOLR-8862 (no monkey yet)
          Hide
          ASF subversion and git services added a comment -

          Commit 8cc0a38453b389bdb031d78ad638b76dfa27f2d5 in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8cc0a38 ]

          SOLR-445: Merge branch 'master' into jira/SOLR-445

          Show
          ASF subversion and git services added a comment - Commit 8cc0a38453b389bdb031d78ad638b76dfa27f2d5 in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8cc0a38 ] SOLR-445 : Merge branch 'master' into jira/ SOLR-445
          Hide
          ASF subversion and git services added a comment -

          Commit 8cc0a38453b389bdb031d78ad638b76dfa27f2d5 in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8cc0a38 ]

          SOLR-445: Merge branch 'master' into jira/SOLR-445

          Show
          ASF subversion and git services added a comment - Commit 8cc0a38453b389bdb031d78ad638b76dfa27f2d5 in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8cc0a38 ] SOLR-445 : Merge branch 'master' into jira/ SOLR-445
          Hide
          ASF subversion and git services added a comment -

          Commit aeda8dc4ae881c4ec405d70dcbf1d0b2c30871b7 in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=aeda8dc ]

          SOLR-445: fix test bugs, and put in a stupid work around for SOLR-8862

          Show
          ASF subversion and git services added a comment - Commit aeda8dc4ae881c4ec405d70dcbf1d0b2c30871b7 in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=aeda8dc ] SOLR-445 : fix test bugs, and put in a stupid work around for SOLR-8862
          Hide
          ASF subversion and git services added a comment -

          Commit 1aa1ba3b3af69cad65b7a411ca88e120a418a598 in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1aa1ba3 ]

          SOLR-445: harden & add logging to test

          also rename since chaos monkey isn't going to be involved (due to SOLR-8872)

          Show
          ASF subversion and git services added a comment - Commit 1aa1ba3b3af69cad65b7a411ca88e120a418a598 in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1aa1ba3 ] SOLR-445 : harden & add logging to test also rename since chaos monkey isn't going to be involved (due to SOLR-8872 )
          Hide
          ASF subversion and git services added a comment -

          Commit 6ec8c635bf5853dfb229f89cb2818749c1cfe8ce in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6ec8c63 ]

          SOLR-445: cleanup some simple nocommits

          Show
          ASF subversion and git services added a comment - Commit 6ec8c635bf5853dfb229f89cb2818749c1cfe8ce in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6ec8c63 ] SOLR-445 : cleanup some simple nocommits
          Hide
          ASF subversion and git services added a comment -

          Commit 21c0fe690dc4e968e484ee906632a50bf0273786 in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=21c0fe6 ]

          SOLR-445: hardent the ToleratedUpdateError API to hide implementation details

          Show
          ASF subversion and git services added a comment - Commit 21c0fe690dc4e968e484ee906632a50bf0273786 in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=21c0fe6 ] SOLR-445 : hardent the ToleratedUpdateError API to hide implementation details
          Hide
          Anshum Gupta added a comment -

          Thanks Hoss.

          I didn't look at the recent commits but from whatever I reviewed, this looks good. A bunch of nocommits but good stuff overall.
          I'll try and pitch in.

          Show
          Anshum Gupta added a comment - Thanks Hoss. I didn't look at the recent commits but from whatever I reviewed, this looks good. A bunch of nocommits but good stuff overall. I'll try and pitch in.
          Hide
          Tomás Fernández Löbbe added a comment -

          Looks really good to me.
          What is the impact of many docs failing due to missing ID? Is there a test for that? I couldn't find one, but the diff is pretty big, I may have missed stuff.
          Don't know the answer to the "isLeader" question. I'd say the request would fail if leader changes in the middle of a request, but I'm not sure.

          Show
          Tomás Fernández Löbbe added a comment - Looks really good to me. What is the impact of many docs failing due to missing ID? Is there a test for that? I couldn't find one, but the diff is pretty big, I may have missed stuff. Don't know the answer to the "isLeader" question. I'd say the request would fail if leader changes in the middle of a request, but I'm not sure.
          Hide
          ASF subversion and git services added a comment -

          Commit fe54da0b58ed18a38f3dd436dd3f30fbee9acbbf in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=fe54da0 ]

          SOLR-445: remove nocommits related to OOM trapping since SOLR-8539 has concluded that this isn't a thing the java code actually needs to be defensive of

          Show
          ASF subversion and git services added a comment - Commit fe54da0b58ed18a38f3dd436dd3f30fbee9acbbf in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=fe54da0 ] SOLR-445 : remove nocommits related to OOM trapping since SOLR-8539 has concluded that this isn't a thing the java code actually needs to be defensive of
          Hide
          ASF subversion and git services added a comment -

          Commit 5d93384e724b6f611270e212a4f9bd5b00c38e85 in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5d93384 ]

          SOLR-445: fix exception msg when CloudSolrClient does async updates that (cumulatively) exceed maxErrors

          I initially thought it would make sense to refactor DistributedUpdatesAsyncException into solr-common and re-use it here, but when i started down that path i realized it didn't make any sense since there aren't actual exceptions to wrap client side.

          Show
          ASF subversion and git services added a comment - Commit 5d93384e724b6f611270e212a4f9bd5b00c38e85 in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5d93384 ] SOLR-445 : fix exception msg when CloudSolrClient does async updates that (cumulatively) exceed maxErrors I initially thought it would make sense to refactor DistributedUpdatesAsyncException into solr-common and re-use it here, but when i started down that path i realized it didn't make any sense since there aren't actual exceptions to wrap client side.
          Hide
          ASF subversion and git services added a comment -

          Commit cc2cd23ca2537324dc7e4afe6a29605bbf9f1cb8 in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cc2cd23 ]

          SOLR-445: cloud test & bug fix for docs missing their uniqueKey field

          Show
          ASF subversion and git services added a comment - Commit cc2cd23ca2537324dc7e4afe6a29605bbf9f1cb8 in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cc2cd23 ] SOLR-445 : cloud test & bug fix for docs missing their uniqueKey field
          Hide
          Hoss Man added a comment -

          What is the impact of many docs failing due to missing ID? Is there a test for that? I couldn't find one, but the diff is pretty big, I may have missed stuff.

          good question – there were checks of this in TolerantUpdateProcessorTest (from the early days of this patch) but i added some to TestTolerantUpdateProcessorCloud which uncovered a bug (now fixed) when checking isLeader – see: cc2cd23ca2537324dc7e4afe6a29605bbf9f1cb8

          Don't know the answer to the "isLeader" question. I'd say the request would fail if leader changes in the middle of a request, but I'm not sure.

          Hmm... can you explain more what you think/expect could go wrong with the isLeader code removed that wouldn't go wrong with the code as it is today? I mean ... theoretically, even with the isLeader check as we have it right now, the leader could change between the time we do the isLeader check and the call to super.processAdd (where DUP will do it's own isLeader check) ... or it could change (again) between the time super.processAdd/DUP.processAdd throws an exception and the time we make a decision wetherto only track it or track and immediately re-throw.

          I'm just not sure if that added code is really gaining us anything useful – but if someone can help me understand (or better still: demonstrate with a test) a concrete situation where the current code does the correct thing, but removeing the isLeader check is broken then i'll be convinced.


          Where things currently stand:

          • The only remaining nocommits on the branch are questions about deleting the isLeader code, and questions about deleting DistribTolerantUpdateProcessorTest since we have other more robust cloud tests now.
          • Even with the "retry after giving serachers time to reopen" logic in TestTolerantUpdateProcessorRandomCloud, i'm seeing a failure that reproduces consistently for me...
               [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestTolerantUpdateProcessorRandomCloud -Dtests.method=testRandomUpdates -Dtests.seed=ECFD2B9118A542E7 -Dtests.slow=true -Dtests.locale=bg -Dtests.timezone=Asia/Taipei -Dtests.asserts=true -Dtests.file.encoding=UTF-8
               [junit4] FAILURE 6.00s | TestTolerantUpdateProcessorRandomCloud.testRandomUpdates <<<
               [junit4]    > Throwable #1: java.lang.AssertionError: cloud client doc count doesn't match bitself cardinality expected:<22> but was:<23>
            

            ...so i'm currently working to improve the logging and trace through the test to understand that.

          Show
          Hoss Man added a comment - What is the impact of many docs failing due to missing ID? Is there a test for that? I couldn't find one, but the diff is pretty big, I may have missed stuff. good question – there were checks of this in TolerantUpdateProcessorTest (from the early days of this patch) but i added some to TestTolerantUpdateProcessorCloud which uncovered a bug (now fixed) when checking isLeader – see: cc2cd23ca2537324dc7e4afe6a29605bbf9f1cb8 Don't know the answer to the "isLeader" question. I'd say the request would fail if leader changes in the middle of a request, but I'm not sure. Hmm... can you explain more what you think/expect could go wrong with the isLeader code removed that wouldn't go wrong with the code as it is today? I mean ... theoretically, even with the isLeader check as we have it right now, the leader could change between the time we do the isLeader check and the call to super.processAdd (where DUP will do it's own isLeader check) ... or it could change (again) between the time super.processAdd/DUP.processAdd throws an exception and the time we make a decision wetherto only track it or track and immediately re-throw. I'm just not sure if that added code is really gaining us anything useful – but if someone can help me understand (or better still: demonstrate with a test) a concrete situation where the current code does the correct thing, but removeing the isLeader check is broken then i'll be convinced. Where things currently stand: The only remaining nocommits on the branch are questions about deleting the isLeader code, and questions about deleting DistribTolerantUpdateProcessorTest since we have other more robust cloud tests now. Even with the "retry after giving serachers time to reopen" logic in TestTolerantUpdateProcessorRandomCloud, i'm seeing a failure that reproduces consistently for me... [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestTolerantUpdateProcessorRandomCloud -Dtests.method=testRandomUpdates -Dtests.seed=ECFD2B9118A542E7 -Dtests.slow=true -Dtests.locale=bg -Dtests.timezone=Asia/Taipei -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] FAILURE 6.00s | TestTolerantUpdateProcessorRandomCloud.testRandomUpdates <<< [junit4] > Throwable #1: java.lang.AssertionError: cloud client doc count doesn't match bitself cardinality expected:<22> but was:<23> ...so i'm currently working to improve the logging and trace through the test to understand that.
          Hide
          ASF subversion and git services added a comment -

          Commit ae22181193dcb24707f7255f0132a2a0a85bf300 in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ae22181 ]

          SOLR-445: fix silly test bug

          Show
          ASF subversion and git services added a comment - Commit ae22181193dcb24707f7255f0132a2a0a85bf300 in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ae22181 ] SOLR-445 : fix silly test bug
          Hide
          ASF subversion and git services added a comment -

          Commit 956d9a592a0a6e9c9d7c8244a4289f4cbf5d5012 in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=956d9a5 ]

          SOLR-445: more testing of DBQ mixed with failures

          (trying to staticly recreate a random failure i haven't fully figured out yet)

          Show
          ASF subversion and git services added a comment - Commit 956d9a592a0a6e9c9d7c8244a4289f4cbf5d5012 in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=956d9a5 ] SOLR-445 : more testing of DBQ mixed with failures (trying to staticly recreate a random failure i haven't fully figured out yet)
          Hide
          ASF subversion and git services added a comment -

          Commit 2622eac2915ee210cfffd1969ef5dd8e2030e5cf in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2622eac ]

          SOLR-445: harden checks in random test; add isoluated cloud test demonstrating bug random test found; add nocommit hack to DUP to work around test failure for now

          (SOLR-8890 to fix a better way)

          Show
          ASF subversion and git services added a comment - Commit 2622eac2915ee210cfffd1969ef5dd8e2030e5cf in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2622eac ] SOLR-445 : harden checks in random test; add isoluated cloud test demonstrating bug random test found; add nocommit hack to DUP to work around test failure for now ( SOLR-8890 to fix a better way)
          Hide
          ASF subversion and git services added a comment -

          Commit a4686553712a0d01dc2d6853038c4cca2caee63f in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a468655 ]

          SOLR-445: randomized testing of the 'doc missing unique key' code path

          Show
          ASF subversion and git services added a comment - Commit a4686553712a0d01dc2d6853038c4cca2caee63f in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a468655 ] SOLR-445 : randomized testing of the 'doc missing unique key' code path
          Hide
          ASF subversion and git services added a comment -

          Commit da3ea40e80189c7c2bbd8114a99c72a64262786b in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=da3ea40 ]

          SOLR-8890: generalized whitelist of param names DUP will use when forwarding requests, usage in SOLR-445

          Show
          ASF subversion and git services added a comment - Commit da3ea40e80189c7c2bbd8114a99c72a64262786b in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=da3ea40 ] SOLR-8890 : generalized whitelist of param names DUP will use when forwarding requests, usage in SOLR-445
          Hide
          ASF subversion and git services added a comment -

          Commit 39884c0b0c02b4090640d6268a45a1cf5f54f3e0 in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=39884c0 ]

          SOLR-445: removing questionable isLeader check; beasting the tests w/o this code didn't demonstrate any problems

          Show
          ASF subversion and git services added a comment - Commit 39884c0b0c02b4090640d6268a45a1cf5f54f3e0 in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=39884c0 ] SOLR-445 : removing questionable isLeader check; beasting the tests w/o this code didn't demonstrate any problems
          Hide
          ASF subversion and git services added a comment -

          Commit 1d8cdd27993a46ae17c4ac308504513a33f01a15 in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1d8cdd2 ]

          SOLR-445: remove test - we have more complete coverage in TestTolerantUpdateProcessorCloud which uses the more robust SolrCloudTestCase model

          Show
          ASF subversion and git services added a comment - Commit 1d8cdd27993a46ae17c4ac308504513a33f01a15 in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1d8cdd2 ] SOLR-445 : remove test - we have more complete coverage in TestTolerantUpdateProcessorCloud which uses the more robust SolrCloudTestCase model
          Hide
          ASF subversion and git services added a comment -

          Commit b08c284b26b1779d03693a45e219db89839461d0 in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b08c284 ]

          SOLR-445: fix logger declaration to satisfy precommit

          Show
          ASF subversion and git services added a comment - Commit b08c284b26b1779d03693a45e219db89839461d0 in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b08c284 ] SOLR-445 : fix logger declaration to satisfy precommit
          Hide
          Hoss Man added a comment -

          I'm still beasting the tests a bit, but i think this is pretty solid and ready for master/branch_6x

          Show
          Hoss Man added a comment - I'm still beasting the tests a bit, but i think this is pretty solid and ready for master/branch_6x
          Hide
          Hoss Man added a comment -

          updated summary to reflect basic information about feature being added

          Show
          Hoss Man added a comment - updated summary to reflect basic information about feature being added
          Hide
          Timothy Potter added a comment -

          LGTM +1 Nice test coverage of all this! This will be very useful for streaming applications (such as from Spark and Storm) where re-trying individual docs after an error is less than ideal. Now we'll be able to pin-point exactly which docs had issues!

          I'd prefer this to be baked into the default chain but can understand the rationale for leaving it out for now too. So long as we put up an example of how to enable it using the Config API in the ref guide.

          Show
          Timothy Potter added a comment - LGTM +1 Nice test coverage of all this! This will be very useful for streaming applications (such as from Spark and Storm) where re-trying individual docs after an error is less than ideal. Now we'll be able to pin-point exactly which docs had issues! I'd prefer this to be baked into the default chain but can understand the rationale for leaving it out for now too. So long as we put up an example of how to enable it using the Config API in the ref guide.
          Hide
          ASF subversion and git services added a comment -

          Commit f051f56be96b12f1f3e35978ca4c840ae06a801f in lucene-solr's branch refs/heads/master from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f051f56 ]

          SOLR-445: new ToleranteUpdateProcessorFactory to support skipping update commands that cause failures when sending multiple updates in a single request.

          SOLR-8890: New static method in DistributedUpdateProcessorFactory to allow UpdateProcessorFactories to indicate request params that should be forwarded when DUP distributes updates.

          This commit is a squashed merge from the jira/SOLR-445 branch (as of b08c284b26b1779d03693a45e219db89839461d0)

          Show
          ASF subversion and git services added a comment - Commit f051f56be96b12f1f3e35978ca4c840ae06a801f in lucene-solr's branch refs/heads/master from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f051f56 ] SOLR-445 : new ToleranteUpdateProcessorFactory to support skipping update commands that cause failures when sending multiple updates in a single request. SOLR-8890 : New static method in DistributedUpdateProcessorFactory to allow UpdateProcessorFactories to indicate request params that should be forwarded when DUP distributes updates. This commit is a squashed merge from the jira/ SOLR-445 branch (as of b08c284b26b1779d03693a45e219db89839461d0)
          Hide
          ASF subversion and git services added a comment -

          Commit f051f56be96b12f1f3e35978ca4c840ae06a801f in lucene-solr's branch refs/heads/master from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f051f56 ]

          SOLR-445: new ToleranteUpdateProcessorFactory to support skipping update commands that cause failures when sending multiple updates in a single request.

          SOLR-8890: New static method in DistributedUpdateProcessorFactory to allow UpdateProcessorFactories to indicate request params that should be forwarded when DUP distributes updates.

          This commit is a squashed merge from the jira/SOLR-445 branch (as of b08c284b26b1779d03693a45e219db89839461d0)

          Show
          ASF subversion and git services added a comment - Commit f051f56be96b12f1f3e35978ca4c840ae06a801f in lucene-solr's branch refs/heads/master from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f051f56 ] SOLR-445 : new ToleranteUpdateProcessorFactory to support skipping update commands that cause failures when sending multiple updates in a single request. SOLR-8890 : New static method in DistributedUpdateProcessorFactory to allow UpdateProcessorFactories to indicate request params that should be forwarded when DUP distributes updates. This commit is a squashed merge from the jira/ SOLR-445 branch (as of b08c284b26b1779d03693a45e219db89839461d0)
          Hide
          ASF subversion and git services added a comment -

          Commit 5b6eacb80bca5815059cd50a1646fa4ecb146e43 in lucene-solr's branch refs/heads/branch_6x from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5b6eacb ]

          SOLR-445: new ToleranteUpdateProcessorFactory to support skipping update commands that cause failures when sending multiple updates in a single request.

          SOLR-8890: New static method in DistributedUpdateProcessorFactory to allow UpdateProcessorFactories to indicate request params that should be forwarded when DUP distributes updates.

          This commit is a squashed merge from the jira/SOLR-445 branch (as of b08c284b26b1779d03693a45e219db89839461d0)

          Show
          ASF subversion and git services added a comment - Commit 5b6eacb80bca5815059cd50a1646fa4ecb146e43 in lucene-solr's branch refs/heads/branch_6x from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5b6eacb ] SOLR-445 : new ToleranteUpdateProcessorFactory to support skipping update commands that cause failures when sending multiple updates in a single request. SOLR-8890 : New static method in DistributedUpdateProcessorFactory to allow UpdateProcessorFactories to indicate request params that should be forwarded when DUP distributes updates. This commit is a squashed merge from the jira/ SOLR-445 branch (as of b08c284b26b1779d03693a45e219db89839461d0)
          Hide
          ASF subversion and git services added a comment -

          Commit 5b6eacb80bca5815059cd50a1646fa4ecb146e43 in lucene-solr's branch refs/heads/branch_6x from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5b6eacb ]

          SOLR-445: new ToleranteUpdateProcessorFactory to support skipping update commands that cause failures when sending multiple updates in a single request.

          SOLR-8890: New static method in DistributedUpdateProcessorFactory to allow UpdateProcessorFactories to indicate request params that should be forwarded when DUP distributes updates.

          This commit is a squashed merge from the jira/SOLR-445 branch (as of b08c284b26b1779d03693a45e219db89839461d0)

          Show
          ASF subversion and git services added a comment - Commit 5b6eacb80bca5815059cd50a1646fa4ecb146e43 in lucene-solr's branch refs/heads/branch_6x from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5b6eacb ] SOLR-445 : new ToleranteUpdateProcessorFactory to support skipping update commands that cause failures when sending multiple updates in a single request. SOLR-8890 : New static method in DistributedUpdateProcessorFactory to allow UpdateProcessorFactories to indicate request params that should be forwarded when DUP distributes updates. This commit is a squashed merge from the jira/ SOLR-445 branch (as of b08c284b26b1779d03693a45e219db89839461d0)
          Hide
          ASF subversion and git services added a comment -

          Commit b8c0ff66f958e5e199874059b0427ea267778c3a in lucene-solr's branch refs/heads/branch_6x from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b8c0ff6 ]

          SOLR-445: Merge remote-tracking branch 'refs/remotes/origin/branch_6x' into branch_6x

          (picking up mid backport conflicts)

          Show
          ASF subversion and git services added a comment - Commit b8c0ff66f958e5e199874059b0427ea267778c3a in lucene-solr's branch refs/heads/branch_6x from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b8c0ff6 ] SOLR-445 : Merge remote-tracking branch 'refs/remotes/origin/branch_6x' into branch_6x (picking up mid backport conflicts)

            People

            • Assignee:
              Hoss Man
              Reporter:
              Will Johnson
            • Votes:
              6 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development