Solr
  1. Solr
  2. SOLR-2480

Text extraction of password protected files

    Details

      Description

      Proposal:
      There are password-protected files. PDF, Office documents in 2007 format/97 format.
      These files are posted using SolrCell.
      We do not have to read these files if we do not know the reading password of files.
      So, these files may not be extracted text.
      My requirement is that these files should be processed normally without extracting text, and without throwing exception.

      This background:
      Now, when you post a password-protected file, solr returns 500 server error.
      Solr catches the error in ExtractingDocumentLoader and throws TikException.

      I use ManifoldCF.
      If the solr server responds 500, ManifoldCF judge is that "this
      document should be retried because I have absolutely no idea what
      happened".
      And it attempts to retry posting many times without getting the password.

      In the other case, my customer posts the files with embedded images.
      Sometimes it seems that solr throws TikaException of unknown cause.
      He wants to post just metadata without extracting text, but makes him stop posting by the exception.

      1. password-is-solrcell.docx
        33 kB
        Koji Sekiguchi
      2. SOLR-2480.patch
        25 kB
        Koji Sekiguchi
      3. SOLR-2480.patch
        9 kB
        Koji Sekiguchi
      4. SOLR-2480.patch
        5 kB
        Koji Sekiguchi
      5. SOLR-2480-idea1.patch
        1 kB
        Shinichiro Abe

        Activity

        Hide
        Shinichiro Abe added a comment -

        Improvement ideas:
        1, TikaException is always ignored, and index only the metadata.
        2, Parameter "ignoreTikaException" is provided newly.
        If it is true then it returns 200 response, if it is false then it throws TikaException.
        3, If Solr can catch internal exception about encrypting error, it changes return code each exception.
        If it can judge poi.EncryptedDocumentException, pdfbox.exceptions.CryptographyException. etc. then it returns 200 or another code response, if it judges the other exception then it throws TikaException.

        Show
        Shinichiro Abe added a comment - Improvement ideas: 1, TikaException is always ignored, and index only the metadata. 2, Parameter "ignoreTikaException" is provided newly. If it is true then it returns 200 response, if it is false then it throws TikaException. 3, If Solr can catch internal exception about encrypting error, it changes return code each exception. If it can judge poi.EncryptedDocumentException, pdfbox.exceptions.CryptographyException. etc. then it returns 200 or another code response, if it judges the other exception then it throws TikaException.
        Hide
        Shinichiro Abe added a comment -

        There is a same issue.
        https://issues.apache.org/jira/browse/SOLR-445
        If it be able to applied by that same policy, this issue is duplicate.

        Show
        Shinichiro Abe added a comment - There is a same issue. https://issues.apache.org/jira/browse/SOLR-445 If it be able to applied by that same policy, this issue is duplicate.
        Hide
        Koji Sekiguchi added a comment -

        Though I've not yet read entire comment SOLR-445, I don't think your requirement is same.
        According to description of SOLR-445, the reporter wants Solr to skip the error <doc/> and continue adding the rest of <doc/> in <add>...</add>. But I think you want Solr to skip the content field because tika cannot extract it for some reasons but add meta data fields, right?

        Show
        Koji Sekiguchi added a comment - Though I've not yet read entire comment SOLR-445 , I don't think your requirement is same. According to description of SOLR-445 , the reporter wants Solr to skip the error <doc/> and continue adding the rest of <doc/> in <add>...</add>. But I think you want Solr to skip the content field because tika cannot extract it for some reasons but add meta data fields, right?
        Hide
        Koji Sekiguchi added a comment -

        BTW, I have a similar issue when using UIMA update processor, as sometimes UIMA annotators fail to extract meta data for some reason (eg Alchemy Web services stop). I'll open a separate ticket for it.

        Show
        Koji Sekiguchi added a comment - BTW, I have a similar issue when using UIMA update processor, as sometimes UIMA annotators fail to extract meta data for some reason (eg Alchemy Web services stop). I'll open a separate ticket for it.
        Hide
        Shinichiro Abe added a comment -

        But I think you want Solr to skip the content field because tika cannot extract it for some reasons but add meta data fields, right?

        Yes, I want to post the metadate without contents that throw parse-error.
        ExtractingDocumentLoader also should be fixed.
        This patch expresses improvement ideas(1).
        And I think SOLR-445 can resolve improvement ideas(2).

        Show
        Shinichiro Abe added a comment - But I think you want Solr to skip the content field because tika cannot extract it for some reasons but add meta data fields, right? Yes, I want to post the metadate without contents that throw parse-error. ExtractingDocumentLoader also should be fixed. This patch expresses improvement ideas(1). And I think SOLR-445 can resolve improvement ideas(2).
        Hide
        Koji Sekiguchi added a comment -

        BTW, I have a similar issue when using UIMA update processor, as sometimes UIMA annotators fail to extract meta data for some reason (eg Alchemy Web services stop). I'll open a separate ticket for it.

        Opened SOLR-2512.

        Show
        Koji Sekiguchi added a comment - BTW, I have a similar issue when using UIMA update processor, as sometimes UIMA annotators fail to extract meta data for some reason (eg Alchemy Web services stop). I'll open a separate ticket for it. Opened SOLR-2512 .
        Hide
        Koji Sekiguchi added a comment -

        And I think SOLR-445 can resolve improvement ideas(2).

        No. You should consider the difference between this issue and SOLR-445. (see my comment above)

        As I understand your requirement that was described in Description, and it is quite similar SOLR-2512 that has been resolved, I'll try a patch that has ignoreErrors flag for TikaException.

        I added an ability to ignore exceptions when trying to extract mata data from text in SOLR-2512, i.g. Solr indexed the text but gave up meta data. On the other hand, the ignore flag in this ticket is for giving up text but indexing meta data. It cannot be resolved by SOLR-445.

        Show
        Koji Sekiguchi added a comment - And I think SOLR-445 can resolve improvement ideas(2). No. You should consider the difference between this issue and SOLR-445 . (see my comment above) As I understand your requirement that was described in Description, and it is quite similar SOLR-2512 that has been resolved, I'll try a patch that has ignoreErrors flag for TikaException. I added an ability to ignore exceptions when trying to extract mata data from text in SOLR-2512 , i.g. Solr indexed the text but gave up meta data. On the other hand, the ignore flag in this ticket is for giving up text but indexing meta data. It cannot be resolved by SOLR-445 .
        Hide
        Koji Sekiguchi added a comment -

        A patch that introduces ignoreTikaException flag.

        Show
        Koji Sekiguchi added a comment - A patch that introduces ignoreTikaException flag.
        Hide
        Koji Sekiguchi added a comment -

        Attached the next patch and password protected word file that is used for test.

        I added test cases for ignoreTikaException=true|false cases.

        I think this is ready to commit.

        Show
        Koji Sekiguchi added a comment - Attached the next patch and password protected word file that is used for test. I added test cases for ignoreTikaException=true|false cases. I think this is ready to commit.
        Hide
        Koji Sekiguchi added a comment -

        New patch.

        According to custom, ExtractingRequestHandlerTest class should be at o.a.s.handler.extraction, but curiously it was o.a.s.handler. I corrected it in this patch.

        Show
        Koji Sekiguchi added a comment - New patch. According to custom, ExtractingRequestHandlerTest class should be at o.a.s.handler.extraction, but curiously it was o.a.s.handler. I corrected it in this patch.
        Hide
        Koji Sekiguchi added a comment -

        trunk: Committed revision 1103120.
        3x: Committed revision 1103124.

        Show
        Koji Sekiguchi added a comment - trunk: Committed revision 1103120. 3x: Committed revision 1103124.
        Hide
        Robert Muir added a comment -

        Bulk close for 3.2

        Show
        Robert Muir added a comment - Bulk close for 3.2

          People

          • Assignee:
            Koji Sekiguchi
            Reporter:
            Shinichiro Abe
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development