Solr
  1. Solr
  2. SOLR-2512

uima: add an ability to skip runtime error in AnalysisEngine

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.1
    • Fix Version/s: 3.2, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      Currently, if AnalysisEngine throws an exception during processing a text, whole adding docs go fail. Because online NLP services are error-prone, users should be able to choose whether solr skips the text processing (but source text can be indexed) for the document or throws a runtime exception so that solr can stop adding documents entirely.

      1. SOLR-2512.patch
        18 kB
        Koji Sekiguchi
      2. SOLR-2512.patch
        17 kB
        Koji Sekiguchi
      3. SOLR-2512.patch
        15 kB
        Koji Sekiguchi
      4. SOLR-2512.patch
        4 kB
        Tommaso Teofili
      5. SOLR-2512.patch
        1 kB
        Koji Sekiguchi

        Activity

        Hide
        Koji Sekiguchi added a comment -

        A draft patch attached. It doesn't include the switch.

        Show
        Koji Sekiguchi added a comment - A draft patch attached. It doesn't include the switch.
        Hide
        Tommaso Teofili added a comment - - edited

        Hello Koji, in the first implementations (see SOLR-2129) the UIMAUpdateProcessor was ignoring errors on UIMA pipelines but me and others thought it was good to take control of what was happening if any exception was thrown rather than ignoring it; however I get your point and my opinion is that that behavior should be configurable with a parameter like <bool name="ignoreErrors">true</bool>.

        Show
        Tommaso Teofili added a comment - - edited Hello Koji, in the first implementations (see SOLR-2129 ) the UIMAUpdateProcessor was ignoring errors on UIMA pipelines but me and others thought it was good to take control of what was happening if any exception was thrown rather than ignoring it; however I get your point and my opinion is that that behavior should be configurable with a parameter like <bool name="ignoreErrors">true</bool>.
        Hide
        Tommaso Teofili added a comment -

        first patch which adds a configuration parameter <bool name="ignoreErrors"> to decide if errors in UIMA pipeliness execution should avoid document indexing or not

        Show
        Tommaso Teofili added a comment - first patch which adds a configuration parameter <bool name="ignoreErrors"> to decide if errors in UIMA pipeliness execution should avoid document indexing or not
        Hide
        Koji Sekiguchi added a comment -

        Hi Tommaso, thank you for updating the patch!

        In my patch, I try to log the first 100 chars of the target text in the error message because an online NLP service I'm using is error-prone when I post a large text. But you are using SolrInputDocument in the updated patch. I'd like my method rather than logging whole solr document.

        I think that users who set ignoreErrors=true want to know the fact that the error occurs, but don't want to see whole document in the error message.

        Show
        Koji Sekiguchi added a comment - Hi Tommaso, thank you for updating the patch! In my patch, I try to log the first 100 chars of the target text in the error message because an online NLP service I'm using is error-prone when I post a large text. But you are using SolrInputDocument in the updated patch. I'd like my method rather than logging whole solr document. I think that users who set ignoreErrors=true want to know the fact that the error occurs, but don't want to see whole document in the error message.
        Hide
        Tommaso Teofili added a comment -

        I think that users who set ignoreErrors=true want to know the fact that the error occurs, but don't want to see whole document in the error message.

        You're right Koji. Considering your comment I am wondering if it may be better to get the uniqueid so that one can easily debug the document which caused that error from that without having to see the text in the log.

        Show
        Tommaso Teofili added a comment - I think that users who set ignoreErrors=true want to know the fact that the error occurs, but don't want to see whole document in the error message. You're right Koji. Considering your comment I am wondering if it may be better to get the uniqueid so that one can easily debug the document which caused that error from that without having to see the text in the log.
        Hide
        Koji Sekiguchi added a comment -

        A new patch. I added a test case for the flag true|false.

        About the logging uniqueKey, yeah I could get the uniqueKey, but it cannot be taken from cmd without schema. So I understood the idea in your patch.

        Show
        Koji Sekiguchi added a comment - A new patch. I added a test case for the flag true|false. About the logging uniqueKey, yeah I could get the uniqueKey, but it cannot be taken from cmd without schema. So I understood the idea in your patch.
        Hide
        Koji Sekiguchi added a comment -

        If no one objects, I'll commit tomorrow.

        Show
        Koji Sekiguchi added a comment - If no one objects, I'll commit tomorrow.
        Hide
        Tommaso Teofili added a comment -

        One more thing I'd change is using StringBuilder with append() instead of String concatenation ("some string" + "another string") inside the catch block of UIMAUpdateRequestProcessor.processAdd() method (I did so in my patch) since it's more efficient.

        Still I'm not sure logging the first 100 chars of text is a good idea but you're right that we should maintain the schema information to know what field is the uniquekey and this would put unnecessary coupling between the two classes.

        Show
        Tommaso Teofili added a comment - One more thing I'd change is using StringBuilder with append() instead of String concatenation ("some string" + "another string") inside the catch block of UIMAUpdateRequestProcessor.processAdd() method (I did so in my patch) since it's more efficient. Still I'm not sure logging the first 100 chars of text is a good idea but you're right that we should maintain the schema information to know what field is the uniquekey and this would put unnecessary coupling between the two classes.
        Hide
        Koji Sekiguchi added a comment -

        Updated patch attached. I understand the requirement for logging uniqueKey. In this patch, I introduced optional parameter logField:

        <bool name="ignoreErrors">true</bool>
        <!-- This is optional. It is used for logging when text processing fails. Usually, set uniqueKey field name -->
        <str name="logField">id</str>
        

        It is effective regardless of ignoreErrors setting (see the patch).

        Show
        Koji Sekiguchi added a comment - Updated patch attached. I understand the requirement for logging uniqueKey. In this patch, I introduced optional parameter logField: <bool name= "ignoreErrors" > true </bool> <!-- This is optional. It is used for logging when text processing fails. Usually, set uniqueKey field name --> <str name= "logField" >id</str> It is effective regardless of ignoreErrors setting (see the patch).
        Hide
        Koji Sekiguchi added a comment -

        README.txt updated.

        Show
        Koji Sekiguchi added a comment - README.txt updated.
        Hide
        Koji Sekiguchi added a comment -

        I'll commit soon.

        Show
        Koji Sekiguchi added a comment - I'll commit soon.
        Hide
        Tommaso Teofili added a comment -

        +1

        Show
        Tommaso Teofili added a comment - +1
        Hide
        Koji Sekiguchi added a comment -

        trunk: Committed revision 1102785.
        3x: Committed revision 1102789.

        Show
        Koji Sekiguchi added a comment - trunk: Committed revision 1102785. 3x: Committed revision 1102789.
        Hide
        Robert Muir added a comment -

        Bulk close for 3.2

        Show
        Robert Muir added a comment - Bulk close for 3.2

          People

          • Assignee:
            Koji Sekiguchi
            Reporter:
            Koji Sekiguchi
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development