Solr
  1. Solr
  2. SOLR-445

Update Handlers abort with bad documents

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.3
    • Fix Version/s: 4.9, 5.0
    • Component/s: update
    • Labels:
      None

      Description

      Has anyone run into the problem of handling bad documents / failures mid batch. Ie:

      <add>
      <doc>
      <field name="id">1</field>
      </doc>
      <doc>
      <field name="id">2</field>
      <field name="myDateField">I_AM_A_BAD_DATE</field>
      </doc>
      <doc>
      <field name="id">3</field>
      </doc>
      </add>

      Right now solr adds the first doc and then aborts. It would seem like it should either fail the entire batch or log a message/return a code and then continue on to add doc 3. Option 1 would seem to be much harder to accomplish and possibly require more memory while Option 2 would require more information to come back from the API. I'm about to dig into this but I thought I'd ask to see if anyone had any suggestions, thoughts or comments.

      1. SOLR-445-alternative.patch
        23 kB
        Tomás Fernández Löbbe
      2. SOLR-445-alternative.patch
        23 kB
        Tomás Fernández Löbbe
      3. SOLR-445-3_x.patch
        20 kB
        Erick Erickson
      4. solr-445.xml
        0.7 kB
        Erick Erickson
      5. SOLR-445.patch
        42 kB
        Erick Erickson
      6. SOLR-445.patch
        18 kB
        Erick Erickson
      7. SOLR-445.patch
        45 kB
        Erick Erickson
      8. SOLR-445.patch
        44 kB
        Erick Erickson
      9. SOLR-445_3x.patch
        46 kB
        Erick Erickson

        Issue Links

          Activity

          Hide
          Ryan McKinley added a comment -

          yup... as is, it keeps anything it had before it fails.

          Check http://wiki.apache.org/solr/UpdateRequestProcessor for a way to intercect the standard processing. One could easily write an UpdateRequestProcessor to make sure all documents are valid before tryinig to commit them. (that is queue everything up until finish())

          Another approach may be to keep track of the docs that were added correctly, then remove them if something goes wrong....

          Show
          Ryan McKinley added a comment - yup... as is, it keeps anything it had before it fails. Check http://wiki.apache.org/solr/UpdateRequestProcessor for a way to intercect the standard processing. One could easily write an UpdateRequestProcessor to make sure all documents are valid before tryinig to commit them. (that is queue everything up until finish()) Another approach may be to keep track of the docs that were added correctly, then remove them if something goes wrong....
          Hide
          Yonik Seeley added a comment -

          Option 2 seems like a good approach... return a list of documents that failed, why they failed, etc, and that could be logged for a human to check later.

          Show
          Yonik Seeley added a comment - Option 2 seems like a good approach... return a list of documents that failed, why they failed, etc, and that could be logged for a human to check later.
          Hide
          Grant Ingersoll added a comment -

          Option 2 seems like a good approach... return a list of documents that failed, why they failed, etc, and that could be logged for a human to check later.

          +1

          And it should not just be for the XML version, right? Anything that can take a batch of documents should be able to return a list of those that failed along with the reason they failed.

          Show
          Grant Ingersoll added a comment - Option 2 seems like a good approach... return a list of documents that failed, why they failed, etc, and that could be logged for a human to check later. +1 And it should not just be for the XML version, right? Anything that can take a batch of documents should be able to return a list of those that failed along with the reason they failed.
          Hide
          Grant Ingersoll added a comment -

          Is it reasonable to use the AddUpdateCommand to communicate out of the UpdateHandler that a given document failed? For instance, in the update handler, it could catch any exception, and then add that exception onto the command (the next reuse would have to reset it) and then the various RequestHandler (XML/CSV) can check to see if the exception is set, add it to a list of failed docs and then add the failed docs to the response, which can then be written out as needed by the writers?

          Show
          Grant Ingersoll added a comment - Is it reasonable to use the AddUpdateCommand to communicate out of the UpdateHandler that a given document failed? For instance, in the update handler, it could catch any exception, and then add that exception onto the command (the next reuse would have to reset it) and then the various RequestHandler (XML/CSV) can check to see if the exception is set, add it to a list of failed docs and then add the failed docs to the response, which can then be written out as needed by the writers?
          Hide
          Shalin Shekhar Mangar added a comment -

          Marking for 1.5

          Show
          Shalin Shekhar Mangar added a comment - Marking for 1.5
          Hide
          Hoss Man added a comment -

          Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

          A unique token for finding these 240 issues in the future: hossversioncleanup20100527

          Show
          Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
          Hide
          Erick Erickson added a comment -

          Here's a cut at an improvement at least.

          The attached XML file contains an <add> packet with a number of documents illustrating a number of errors. The xml file can be POSTed Solr to index via the post.jar file so you can see the output.

          This patch attempts to report back to the user the following for each document that failed:
          1> the ordinal position in the file where the error occurred (e.g. the first, second, etc <doc> tag).
          2> the <uniqueKey> if available.
          3> the error.

          The general idea is to accrue the errors in a StringBuilder and eventually re-throw the error after processing as far as possible.

          Issues:
          1> the reported format in the log file is kind of hard to read. I pipe-delimited the various <doc> tags, but they run together in a Windows DOS window. What happens on Unix I'm not quite sure. Suggestions welcome.
          2> From the original post, rolling this back will be tricky. Very tricky. The autocommit feature makes it indeterminate what's been committed to the index, so I don't know how to even approach rolling back everything.
          3> The intent here is to give the user a clue where to start when figuring out what document(s) failed so they don't have to guess.
          4> Tests fail, but I have no clue why. I checked out a new copy of trunk and that fails as well, so I don't think that this patch is the cause of the errors. But let's not commit this until we can be sure.
          5> What do you think about limiting the number of docs that fail before quitting? One could imagine some ratio (say 10%) that have to fail before quitting (with some safeguards, like don't bother calculating the ratio until 20 docs had been processed or...). Or an absolute number. Should this be a parameter? Or hard-coded? The assumption here is that if 10 (or 100 or..) docs fail, there's something pretty fundamentally wrong and it's a waste to keep on. I don't have any strong feeling here, I can argue it either way....
          6> Sorry, all, but I reflexively hit the reformat keystrokes so the raw patch may be hard to read. But I'm pretty well in the camp that you have to reformat as you go or the code will be held hostage to the last person who didn't format properly. I'm pretty sure I'm using the right codestyle.xml file, but let me know if not.
          7> I doubt that this has any bearing on, say, SolrJ indexing. Should that be another bug (or is there one already)? Anybody got a clue where I'd look for that since I'm in the area anyway?

          Erick

          Show
          Erick Erickson added a comment - Here's a cut at an improvement at least. The attached XML file contains an <add> packet with a number of documents illustrating a number of errors. The xml file can be POSTed Solr to index via the post.jar file so you can see the output. This patch attempts to report back to the user the following for each document that failed: 1> the ordinal position in the file where the error occurred (e.g. the first, second, etc <doc> tag). 2> the <uniqueKey> if available. 3> the error. The general idea is to accrue the errors in a StringBuilder and eventually re-throw the error after processing as far as possible. Issues: 1> the reported format in the log file is kind of hard to read. I pipe-delimited the various <doc> tags, but they run together in a Windows DOS window. What happens on Unix I'm not quite sure. Suggestions welcome. 2> From the original post, rolling this back will be tricky. Very tricky. The autocommit feature makes it indeterminate what's been committed to the index, so I don't know how to even approach rolling back everything. 3> The intent here is to give the user a clue where to start when figuring out what document(s) failed so they don't have to guess. 4> Tests fail, but I have no clue why. I checked out a new copy of trunk and that fails as well, so I don't think that this patch is the cause of the errors. But let's not commit this until we can be sure. 5> What do you think about limiting the number of docs that fail before quitting? One could imagine some ratio (say 10%) that have to fail before quitting (with some safeguards, like don't bother calculating the ratio until 20 docs had been processed or...). Or an absolute number. Should this be a parameter? Or hard-coded? The assumption here is that if 10 (or 100 or..) docs fail, there's something pretty fundamentally wrong and it's a waste to keep on. I don't have any strong feeling here, I can argue it either way.... 6> Sorry, all, but I reflexively hit the reformat keystrokes so the raw patch may be hard to read. But I'm pretty well in the camp that you have to reformat as you go or the code will be held hostage to the last person who didn't format properly. I'm pretty sure I'm using the right codestyle.xml file, but let me know if not. 7> I doubt that this has any bearing on, say, SolrJ indexing. Should that be another bug (or is there one already)? Anybody got a clue where I'd look for that since I'm in the area anyway? Erick
          Hide
          Lance Norskog added a comment -

          2> From the original post, rolling this back will be tricky. Very tricky. The autocommit feature makes it indeterminate what's been committed to the index, so I don't know how to even approach rolling back everything.

          Don't allow autocommits during an update. Simple. Or, rather, all update requests block at the beginning during an autocommit. If an update request has too many documents, don't do so many documents in an update.

          Show
          Lance Norskog added a comment - 2> From the original post, rolling this back will be tricky. Very tricky. The autocommit feature makes it indeterminate what's been committed to the index, so I don't know how to even approach rolling back everything. Don't allow autocommits during an update. Simple. Or, rather, all update requests block at the beginning during an autocommit. If an update request has too many documents, don't do so many documents in an update.
          Hide
          Erick Erickson added a comment -

          I think it's ready for review, both trunk and 3_x. Would someone look this over and commit it if they think it's ready?

          Note to self: do NOT call initCore in a test case just because you need a different schema.

          The problem I was having with running tests was because I needed a schema file with a required field so I naively called initCore with schema11.xml in spite of the fact that @BeforeClass called it with just schema.xml. Which apparently does bad things with the state of something and caused other tests to fail... I can get TestDistributedSearch to fail on unchanged source code simply by calling initCore with schema11.xml and doing nothing else in a new test case in BasicFunctionalityTest. So I put my new tests that required schema11 in a new file instead.

          The XML file attached is not intended to be committed, it is just a convenience for anyone checking out this patch to run against a Solr instance to see what is returned.

          This seems to return the data in the SolrJ case as well.

          NOTE: This does change the behavior of Solr. Without this patch, the first document that is incorrect stops processing. Now, it continues merrily on adding documents as it can. Is this desirable behavior? It would be easy to abort on first error if that's the consensus, and I could take some tedious record-keeping out. I think there's no big problem with continuing on, since the state of committed documents is indeterminate already when errors occur so worrying about this should be part of a bigger issue.

          Show
          Erick Erickson added a comment - I think it's ready for review, both trunk and 3_x. Would someone look this over and commit it if they think it's ready? Note to self: do NOT call initCore in a test case just because you need a different schema. The problem I was having with running tests was because I needed a schema file with a required field so I naively called initCore with schema11.xml in spite of the fact that @BeforeClass called it with just schema.xml. Which apparently does bad things with the state of something and caused other tests to fail... I can get TestDistributedSearch to fail on unchanged source code simply by calling initCore with schema11.xml and doing nothing else in a new test case in BasicFunctionalityTest. So I put my new tests that required schema11 in a new file instead. The XML file attached is not intended to be committed, it is just a convenience for anyone checking out this patch to run against a Solr instance to see what is returned. This seems to return the data in the SolrJ case as well. NOTE: This does change the behavior of Solr. Without this patch, the first document that is incorrect stops processing. Now, it continues merrily on adding documents as it can. Is this desirable behavior? It would be easy to abort on first error if that's the consensus, and I could take some tedious record-keeping out. I think there's no big problem with continuing on, since the state of committed documents is indeterminate already when errors occur so worrying about this should be part of a bigger issue.
          Hide
          Simon Rosenthal added a comment -

          Don't allow autocommits during an update. Simple. Or, rather, all update requests block at the beginning during an autocommit. If an update request has too many documents, don't do so many documents in an update. (Lance)

          Lance - How do you (dynamically ) disable autocommits during a specific update ? That functionality would also be useful in other use cases, but that's another issue).

          NOTE: This does change the behavior of Solr. Without this patch, the first document that is incorrect stops processing. Now, it continues merrily on adding documents as it can. Is this desirable behavior? It would be easy to abort on first error if that's the consensus, and I could take some tedious record-keeping out. I think there's no big problem with continuing on, since the state of committed documents is indeterminate already when errors occur so worrying about this should be part of a bigger issue.

          I think it should be an option, if possible. I can see use cases where abort-on-first-error is desirable, but also situations where you know one or two documents may be erroneous, and its worth continuing on in order to index the other 99%

          Show
          Simon Rosenthal added a comment - Don't allow autocommits during an update. Simple. Or, rather, all update requests block at the beginning during an autocommit. If an update request has too many documents, don't do so many documents in an update. (Lance) Lance - How do you (dynamically ) disable autocommits during a specific update ? That functionality would also be useful in other use cases, but that's another issue). NOTE: This does change the behavior of Solr. Without this patch, the first document that is incorrect stops processing. Now, it continues merrily on adding documents as it can. Is this desirable behavior? It would be easy to abort on first error if that's the consensus, and I could take some tedious record-keeping out. I think there's no big problem with continuing on, since the state of committed documents is indeterminate already when errors occur so worrying about this should be part of a bigger issue. I think it should be an option, if possible. I can see use cases where abort-on-first-error is desirable, but also situations where you know one or two documents may be erroneous, and its worth continuing on in order to index the other 99%
          Hide
          Erick Erickson added a comment -

          From several discussions, it sounds like my test issues are unrelated to the latest patch (I did fix the test case to not do bad things). So I guess if anyone wants to pick this up, it's ready for review and/or commit.

          Unless we want to make some parameter like <abortOnFirstIndexingFailure> or some such. If the consensus is to add such an option, it seems like it needs to go in one of the config files because this code is also reachable by SolrJ. Probably default to "true" to preserve present functionality?

          Thoughts?
          Erick

          Show
          Erick Erickson added a comment - From several discussions, it sounds like my test issues are unrelated to the latest patch (I did fix the test case to not do bad things). So I guess if anyone wants to pick this up, it's ready for review and/or commit. Unless we want to make some parameter like <abortOnFirstIndexingFailure> or some such. If the consensus is to add such an option, it seems like it needs to go in one of the config files because this code is also reachable by SolrJ. Probably default to "true" to preserve present functionality? Thoughts? Erick
          Hide
          Lance Norskog added a comment -

          Unless we want to make some parameter like <abortOnFirstIndexingFailure> or some such.

          Yes. Yes. Yes. Fail Early, Fail Often is a common system design style. Meaning, if I have a data quality problem kick me in the head about it. For example, my online store should keep yesterday's product lineup rather than load a Halloween costume with no categories:

          <field ... required="true" .../>
          

          is there for a reason.

          Show
          Lance Norskog added a comment - Unless we want to make some parameter like <abortOnFirstIndexingFailure> or some such. Yes. Yes. Yes. Fail Early, Fail Often is a common system design style. Meaning, if I have a data quality problem kick me in the head about it. For example, my online store should keep yesterday's product lineup rather than load a Halloween costume with no categories: <field ... required= "true" .../> is there for a reason.
          Hide
          Erick Erickson added a comment -

          Is there any particular preference where the tag for making this configurable should go? My first impulse would be SolrConfig.xml, <indexDefaults> but I wanted to ask first to see if I'm reading the intent correctly. Leaving the field out would be equivalent to <abortOnFirstBatchIndexError>true</...>

          Oh My,! that is a clumsy name, any better suggestions?
          processAllBatchIndexDocsPossible (default to false)
          continueAddingDocsInBatchIfErrors (default to false)

          I kind of like the last one.

          Show
          Erick Erickson added a comment - Is there any particular preference where the tag for making this configurable should go? My first impulse would be SolrConfig.xml, <indexDefaults> but I wanted to ask first to see if I'm reading the intent correctly. Leaving the field out would be equivalent to <abortOnFirstBatchIndexError>true</...> Oh My,! that is a clumsy name, any better suggestions? processAllBatchIndexDocsPossible (default to false) continueAddingDocsInBatchIfErrors (default to false) I kind of like the last one.
          Hide
          Erik Hatcher added a comment -

          This is solely for the XML loader, right Erick? If so, the setting really belongs with the /update handler configuration seems to me. [this is reminiscent of the schema.xml setting for default operator and/or, which is awkwardly placed, when it really belongs with the qparser declaration]

          On a related note - are there comparable issues with other batch indexers like the CSV one?

          Show
          Erik Hatcher added a comment - This is solely for the XML loader, right Erick? If so, the setting really belongs with the /update handler configuration seems to me. [this is reminiscent of the schema.xml setting for default operator and/or, which is awkwardly placed, when it really belongs with the qparser declaration] On a related note - are there comparable issues with other batch indexers like the CSV one?
          Hide
          Erick Erickson added a comment -

          bq: If so, the setting really belongs with the /update handler configuration seems to me
          Thanks, this is what I was looking for, I'll try to make it so unless there are objections. This goes in <updateHandler class="solr.DirectUpdateHandler2">, right?

          bq: On a related note - are there comparable issues with other batch indexers like the CSV one?
          I assume so, I haven't looked. I'd like to raise that as a separate JIRA if we want to address it.

          Show
          Erick Erickson added a comment - bq: If so, the setting really belongs with the /update handler configuration seems to me Thanks, this is what I was looking for, I'll try to make it so unless there are objections. This goes in <updateHandler class="solr.DirectUpdateHandler2">, right? bq: On a related note - are there comparable issues with other batch indexers like the CSV one? I assume so, I haven't looked. I'd like to raise that as a separate JIRA if we want to address it.
          Hide
          Erick Erickson added a comment -

          OK, I think this is ready to go if someone wants to take a look and commit.

          This patch includes the ability to turn on continuing to process documents after the first failure, as per Erik H's comments. The default is the old behavior of stopping upon the first error.

          Changed example solrconfig.xml to include the new parameter as false (mimicing old behavior) in both 3x and trunk.

          Show
          Erick Erickson added a comment - OK, I think this is ready to go if someone wants to take a look and commit. This patch includes the ability to turn on continuing to process documents after the first failure, as per Erik H's comments. The default is the old behavior of stopping upon the first error. Changed example solrconfig.xml to include the new parameter as false (mimicing old behavior) in both 3x and trunk.
          Hide
          Lance Norskog added a comment -

          The general verb is 'update' not 'index', so... abortUpdateOnError default to true is my vote.

          The extraction handler by nature takes one document. The DIH has its own system for this that works pretty well.

          Show
          Lance Norskog added a comment - The general verb is 'update' not 'index', so... abortUpdateOnError default to true is my vote. The extraction handler by nature takes one document. The DIH has its own system for this that works pretty well.
          Hide
          Grant Ingersoll added a comment -

          This patch looks pretty reasonable from the details of the implementation, but I don't think it's quite ready for commit yet.

          First, we should be able to extend this to all that implement ContentStreamLoader (JSONLoader, CSVLoader) if they want it (it doesn't make sense for the SolrCell stuff).

          As I see it, we can do this by putting some base functionality into ContentStreamLoader which does what is done in this patch.
          I think we need two methods, one that handles the immediate error (takes in a StringBuilder and the info about the doc that failed) and decides whether to abort or buffer the error for later reporting depending on the configuration setting.

          I don't think the configuration of the item belongs in the UpdateHandler. Erik H. meant that it goes in the configuration of the /update RequestHandler in the config, not the DirectUpdateHandler2, as in

          <requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />

          This config could be a request param just like any other (such that one could even say they want to override it via a request via the defaults, appends, invariants).

          Also, I know it is tempting to do so, but please don't reformat the code in the patch. It slows down review significantly. In general, I try to reformat right before committing as do most committers.

          Show
          Grant Ingersoll added a comment - This patch looks pretty reasonable from the details of the implementation, but I don't think it's quite ready for commit yet. First, we should be able to extend this to all that implement ContentStreamLoader (JSONLoader, CSVLoader) if they want it (it doesn't make sense for the SolrCell stuff). As I see it, we can do this by putting some base functionality into ContentStreamLoader which does what is done in this patch. I think we need two methods, one that handles the immediate error (takes in a StringBuilder and the info about the doc that failed) and decides whether to abort or buffer the error for later reporting depending on the configuration setting. I don't think the configuration of the item belongs in the UpdateHandler. Erik H. meant that it goes in the configuration of the /update RequestHandler in the config, not the DirectUpdateHandler2, as in <requestHandler name= "/update" class= "solr.XmlUpdateRequestHandler" /> This config could be a request param just like any other (such that one could even say they want to override it via a request via the defaults, appends, invariants). Also, I know it is tempting to do so, but please don't reformat the code in the patch. It slows down review significantly. In general, I try to reformat right before committing as do most committers.
          Hide
          Grant Ingersoll added a comment -

          Oh, one other thing. You don't need to produce a 3.x patch. We can just do an SVN merge.

          Show
          Grant Ingersoll added a comment - Oh, one other thing. You don't need to produce a 3.x patch. We can just do an SVN merge.
          Hide
          Lance Norskog added a comment -

          The SolrJ StreamingUpdateSolrServer needs updating for this. It needs a lot of other updating also.

          Show
          Lance Norskog added a comment - The SolrJ StreamingUpdateSolrServer needs updating for this. It needs a lot of other updating also.
          Hide
          Erick Erickson added a comment -

          Grant:

          Thanks for the comments. I'd assumed there was a better, more general fix, but what about the near term? Does it make any sense to re-work this enough to put the parameter in the update handler and commit? I think there's value in the patch as-is that would provide better-but-not-comprehensive help to users, at least in the XML case, rather than defer any improvement at all into the indefinite future.

          As always, though, up to you.

          Erick

          Show
          Erick Erickson added a comment - Grant: Thanks for the comments. I'd assumed there was a better, more general fix, but what about the near term? Does it make any sense to re-work this enough to put the parameter in the update handler and commit? I think there's value in the patch as-is that would provide better-but-not-comprehensive help to users, at least in the XML case, rather than defer any improvement at all into the indefinite future. As always, though, up to you. Erick
          Hide
          Erick Erickson added a comment -

          So, Grant. How do you feel about refactorings <G>?

          I got bitten by this problem again so I decided to dust off the patch, and I re-created it. This one shouldn't have the gratuitous re-formatting. But, after I added the bookkeeping, the method got even more unwieldy, so I extracted some of the code to methods in XMLLoader. I also have the un-refactored version if this one is too painful.

          This patch incorporates the changes you suggested months ago. I'm a little uncertain whether putting a constant in UpdateParams.java was the correct place, but it seemed like a pattern used for other parameters.

          One minor issue: The behavior is the same here as it used to be if you don't start the packet with <add>. An NPE is thrown. That's because the addCmd variable isn't initialized until the <add> tag is encountered and the NPE is a result of using the addCmd variable later (I think I was seeing it at line 118). I think it would be better to fail if the first element wasn't an <add> element rather than because it just happens to cause an NPE.

          While I'm at it, though, what do you think about making this robust enough to ignore ?xml and/or !DOCTYPE entries? Or is that just not worth the bother?

          Erick

          Show
          Erick Erickson added a comment - So, Grant. How do you feel about refactorings <G>? I got bitten by this problem again so I decided to dust off the patch, and I re-created it. This one shouldn't have the gratuitous re-formatting. But, after I added the bookkeeping, the method got even more unwieldy, so I extracted some of the code to methods in XMLLoader. I also have the un-refactored version if this one is too painful. This patch incorporates the changes you suggested months ago. I'm a little uncertain whether putting a constant in UpdateParams.java was the correct place, but it seemed like a pattern used for other parameters. One minor issue: The behavior is the same here as it used to be if you don't start the packet with <add>. An NPE is thrown. That's because the addCmd variable isn't initialized until the <add> tag is encountered and the NPE is a result of using the addCmd variable later (I think I was seeing it at line 118). I think it would be better to fail if the first element wasn't an <add> element rather than because it just happens to cause an NPE. While I'm at it, though, what do you think about making this robust enough to ignore ?xml and/or !DOCTYPE entries? Or is that just not worth the bother? Erick
          Hide
          Erick Erickson added a comment -

          Anyone willing to pick this up? It's one of those things that's not technically very interesting, but makes end-users' lives easier...

          Erick

          Show
          Erick Erickson added a comment - Anyone willing to pick this up? It's one of those things that's not technically very interesting, but makes end-users' lives easier... Erick
          Hide
          Grant Ingersoll added a comment -

          What about addressing the other handlers? Any progress on that?

          Show
          Grant Ingersoll added a comment - What about addressing the other handlers? Any progress on that?
          Hide
          Erick Erickson added a comment -

          Nope, the framework is in for anything that derives from the base class, but only the XMLLoader uses it yet.

          Does it make sense to open separate JIRAs for those? Which ones do you think are most useful next?

          Show
          Erick Erickson added a comment - Nope, the framework is in for anything that derives from the base class, but only the XMLLoader uses it yet. Does it make sense to open separate JIRAs for those? Which ones do you think are most useful next?
          Hide
          Grant Ingersoll added a comment -

          We should fix this for all the update handlers.

          Show
          Grant Ingersoll added a comment - We should fix this for all the update handlers.
          Hide
          Lance Norskog added a comment - - edited

          If the DIH semantics cover all of the use cases, please follow that model: behavior, names, etc. It will be much easier on users.

          Show
          Lance Norskog added a comment - - edited If the DIH semantics cover all of the use cases, please follow that model: behavior, names, etc. It will be much easier on users.
          Hide
          Shinichiro Abe added a comment -

          In Solr Cell, There is the same problem.
          It aborts mid during posting the protected files(SOLR-2480).
          I hope that update handlers should be fixed by applying that model.

          Show
          Shinichiro Abe added a comment - In Solr Cell, There is the same problem. It aborts mid during posting the protected files( SOLR-2480 ). I hope that update handlers should be fixed by applying that model.
          Hide
          Grant Ingersoll added a comment -

          Erick, feel free to take this one and iterate as you see fit

          Show
          Grant Ingersoll added a comment - Erick, feel free to take this one and iterate as you see fit
          Hide
          Robert Muir added a comment -

          Bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - Bulk move 3.2 -> 3.3
          Hide
          Robert Muir added a comment -

          3.4 -> 3.5

          Show
          Robert Muir added a comment - 3.4 -> 3.5
          Hide
          Erick Erickson added a comment -

          Well, it's clear I won't get to this in the 3.6 time frame, so if someone else wants to pick it up feel free. However, I also wonder whether with 4.0 and SolrCloud we have to approach this differently to accomodate how documents are passed around there?

          Show
          Erick Erickson added a comment - Well, it's clear I won't get to this in the 3.6 time frame, so if someone else wants to pick it up feel free. However, I also wonder whether with 4.0 and SolrCloud we have to approach this differently to accomodate how documents are passed around there?
          Hide
          Yonik Seeley added a comment -

          I imagine a maxErrors parameter might be useful (and more readable than abortOnFirstBatchIndexError)

          maxErrors=0 (the current behavior - stop processing more updates when we hit an error)
          maxErrors=10 (allow up to 10 documents to fail before aborting the update... useful for true bulk uploading where you want to allow for an isolated failure or two, but still want to stop if every single update is failing because something is configured wrong)
          maxErrors=-1 (allow an unlimited number of documents to fail)

          Making updates transactional seems really tough in cloud mode since we don't keep old versions of documents around... although it might be possible for a short time with the transaction log. Anyway, that should definitely be a separate issue.

          A couple of other notes:

          • structured error responses were recently added in 4.0-dev that should make this issue easier in general. Example:
            {"responseHeader":{"status":400,"QTime":0},"error":{"msg":"ERROR: [doc=mydoc] unknown field 'foo'","code":400}}
            
          • Per did some error handling work that's included in a patch attached to SOLR-3178
          Show
          Yonik Seeley added a comment - I imagine a maxErrors parameter might be useful (and more readable than abortOnFirstBatchIndexError) maxErrors=0 (the current behavior - stop processing more updates when we hit an error) maxErrors=10 (allow up to 10 documents to fail before aborting the update... useful for true bulk uploading where you want to allow for an isolated failure or two, but still want to stop if every single update is failing because something is configured wrong) maxErrors=-1 (allow an unlimited number of documents to fail) Making updates transactional seems really tough in cloud mode since we don't keep old versions of documents around... although it might be possible for a short time with the transaction log. Anyway, that should definitely be a separate issue. A couple of other notes: structured error responses were recently added in 4.0-dev that should make this issue easier in general. Example: { "responseHeader" :{ "status" :400, "QTime" :0}, "error" :{ "msg" : "ERROR: [doc=mydoc] unknown field 'foo'" , "code" :400}} Per did some error handling work that's included in a patch attached to SOLR-3178
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Tomás Fernández Löbbe added a comment -

          This is a different approach for this issue. The errors are managed by an UpdateRequestProcessor that must be added before other processors in the chain. It accepts maxErrors in the configuration as default or as a request parameter. If used, the default maxErrors value is Integer.MAX_VALUE, to get the current behavior one should set it to 0 (however, wouldn’t make sense to add the processor to the chain in this case, unless it depends on the request parameter).
          This would handle only bad documents, but not others mentioned in previous comments (like Tika parsing exceptions, etc).
          The response will look something like:

          <?xml version="1.0" encoding="UTF-8"?>
          <response>
          <lst name="responseHeader">
            <int name="numErrors">10</int>
            <lst name="errors">
              <lst name="1">
                <str name="message">ERROR: [doc=1] Error adding field 'weight'='b' msg=For input string: "b"</str>
              </lst>
              <lst name="3">
                <str name="message">ERROR: [doc=3] Error adding field 'weight'='b' msg=For input string: "b"</str>
              </lst>
          ...
            <int name="status">0</int>
            <int name="QTime">17</int>
          </lst>
          </response>
          
          Show
          Tomás Fernández Löbbe added a comment - This is a different approach for this issue. The errors are managed by an UpdateRequestProcessor that must be added before other processors in the chain. It accepts maxErrors in the configuration as default or as a request parameter. If used, the default maxErrors value is Integer.MAX_VALUE, to get the current behavior one should set it to 0 (however, wouldn’t make sense to add the processor to the chain in this case, unless it depends on the request parameter). This would handle only bad documents, but not others mentioned in previous comments (like Tika parsing exceptions, etc). The response will look something like: <?xml version= "1.0" encoding= "UTF-8" ?> <response> <lst name= "responseHeader" > <int name= "numErrors" > 10 </int> <lst name= "errors" > <lst name= "1" > <str name= "message" > ERROR: [doc=1] Error adding field 'weight'='b' msg=For input string: "b" </str> </lst> <lst name= "3" > <str name= "message" > ERROR: [doc=3] Error adding field 'weight'='b' msg=For input string: "b" </str> </lst> ... <int name= "status" > 0 </int> <int name= "QTime" > 17 </int> </lst> </response>
          Hide
          Hoss Man added a comment -

          . The errors are managed by an UpdateRequestProcessor that must be added before other processors in the chain.

          Off the cuff: this sounds like a great idea.

          The on piece of feedback that occurred to me though would be to tweak the response format so that there is a 1-to-1 correspondence of documents in the initial request to statuses in the response – even if the schema doesn't use uniqueKey...

          <lst name="responseHeader">
            <int name="numErrors">10</int>
            <lst name="results">
              <!-- if schema has uniqueKeys, they are the names of the response -->
              <lst name="42" /> <!-- success so empty -->
              <lst name="1"> <!-- 2nd doc in update, with uniqueKey of 1 had this failure -->
                <str name="message">ERROR: [doc=1] Error adding field 'weight'='b' msg=For input string: "b"</str>
              </lst>
              <lst name="60" /> <!-- success so empty -->
              <lst name="3"> <!-- 4th doc in update, with uniqueKey of 3 had this failure -->
                <str name="message">ERROR: [doc=3] Error adding field 'weight'='b' msg=For input string: "b"</str>
              </lst>
          ...
            <int name="status">0</int>
            <int name="QTime">17</int>
          </lst>
          
          

          ?

          Show
          Hoss Man added a comment - . The errors are managed by an UpdateRequestProcessor that must be added before other processors in the chain. Off the cuff: this sounds like a great idea. The on piece of feedback that occurred to me though would be to tweak the response format so that there is a 1-to-1 correspondence of documents in the initial request to statuses in the response – even if the schema doesn't use uniqueKey... <lst name= "responseHeader" > < int name= "numErrors" >10</ int > <lst name= "results" > <!-- if schema has uniqueKeys, they are the names of the response --> <lst name= "42" /> <!-- success so empty --> <lst name= "1" > <!-- 2nd doc in update, with uniqueKey of 1 had this failure --> <str name= "message" >ERROR: [doc=1] Error adding field 'weight'='b' msg=For input string: "b" </str> </lst> <lst name= "60" /> <!-- success so empty --> <lst name= "3" > <!-- 4th doc in update, with uniqueKey of 3 had this failure --> <str name= "message" >ERROR: [doc=3] Error adding field 'weight'='b' msg=For input string: "b" </str> </lst> ... < int name= "status" >0</ int > < int name= "QTime" >17</ int > </lst> ?
          Hide
          Yonik Seeley added a comment -

          even if the schema doesn't use uniqueKey...

          That would lead to some huge responses. I think instead the notion of not having a uniqueKey should essentially be deprecated.

          Show
          Yonik Seeley added a comment - even if the schema doesn't use uniqueKey... That would lead to some huge responses. I think instead the notion of not having a uniqueKey should essentially be deprecated.
          Hide
          Shalin Shekhar Mangar added a comment -

          That would lead to some huge responses. I think instead the notion of not having a uniqueKey should essentially be deprecated.

          +1

          Show
          Shalin Shekhar Mangar added a comment - That would lead to some huge responses. I think instead the notion of not having a uniqueKey should essentially be deprecated. +1
          Hide
          Tomás Fernández Löbbe added a comment -

          I think instead the notion of not having a uniqueKey should essentially be deprecated.

          +1

          That would lead to some huge responses.

          Do you mean including the ids of good docs in the response too? I don't think that would be that big. Should be much smaller than the request

          Show
          Tomás Fernández Löbbe added a comment - I think instead the notion of not having a uniqueKey should essentially be deprecated. +1 That would lead to some huge responses. Do you mean including the ids of good docs in the response too? I don't think that would be that big. Should be much smaller than the request
          Hide
          Yonik Seeley added a comment -

          Do you mean including the ids of good docs in the response too? I don't think that would be that big. Should be much smaller than the request

          Some people (including myself) send/load millions of docs per request - it's very unfriendly to get back megabytes of responses unless you explicitly ask.
          If this processor is not in the default chain, then I guess it doesn't matter much. But I could see adding this ability by default (regardless of if it's a separate processor or not) via a parameter like maxErrors or something.

          Show
          Yonik Seeley added a comment - Do you mean including the ids of good docs in the response too? I don't think that would be that big. Should be much smaller than the request Some people (including myself) send/load millions of docs per request - it's very unfriendly to get back megabytes of responses unless you explicitly ask. If this processor is not in the default chain, then I guess it doesn't matter much. But I could see adding this ability by default (regardless of if it's a separate processor or not) via a parameter like maxErrors or something.
          Hide
          Tomás Fernández Löbbe added a comment -

          I see. Maybe I could add then just the "numSucceed" just as a confirmation that the rest of the docs made it in?

          Show
          Tomás Fernández Löbbe added a comment - I see. Maybe I could add then just the "numSucceed" just as a confirmation that the rest of the docs made it in?
          Hide
          Tomás Fernández Löbbe added a comment -

          added "numAdds" to header

          Show
          Tomás Fernández Löbbe added a comment - added "numAdds" to header
          Hide
          Tomás Fernández Löbbe added a comment -

          Any more thoughts on this patch?

          Show
          Tomás Fernández Löbbe added a comment - Any more thoughts on this patch?
          Hide
          Hoss Man added a comment -

          Tomas:

          • we need better class level javadocs for the TolerantUpdateProcessorFactory - basically everything currently in the TolerantUpdateProcessor's javadocs, plus some example configuration, plus a note about how "maxErrors" can be specified as a request param or as an init param and an explanation of the default behavior if "maxErrors" specified at all
          • would you mind renaming "tolerant-chain1" and "tolerant-chain2" with more descriptive names to make the tests easier to read? perhaps "tolerate-10-failures-chain" and "tolerate-unlimited-failures-chain" ?
          • even if "maxErrors" isn't reached, we should consider carefully whether or not it makes sense to be returning a "200" status code even if every update command that's executed for a request fails. (ie: if maxErrors defaults to Integer.MAX_VALUE, and i send 100 docs and all 100 fail, should i really get a 200 status code back?)
          Show
          Hoss Man added a comment - Tomas: we need better class level javadocs for the TolerantUpdateProcessorFactory - basically everything currently in the TolerantUpdateProcessor's javadocs, plus some example configuration, plus a note about how "maxErrors" can be specified as a request param or as an init param and an explanation of the default behavior if "maxErrors" specified at all would you mind renaming "tolerant-chain1" and "tolerant-chain2" with more descriptive names to make the tests easier to read? perhaps "tolerate-10-failures-chain" and "tolerate-unlimited-failures-chain" ? even if "maxErrors" isn't reached, we should consider carefully whether or not it makes sense to be returning a "200" status code even if every update command that's executed for a request fails. (ie: if maxErrors defaults to Integer.MAX_VALUE, and i send 100 docs and all 100 fail, should i really get a 200 status code back?)
          Hide
          Uwe Schindler added a comment -

          Move issue to Solr 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Solr 4.9.

            People

            • Assignee:
              Unassigned
              Reporter:
              Will Johnson
            • Votes:
              3 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:

                Development