Solr
  1. Solr
  2. SOLR-7189

Allow DIH to extract content from embedded documents via Tika

    Details

      Description

      DIH's TikaEntityProcessor doesn't currently extract content from embedded documents/attachments within a file. It might be useful if users could configure whether or not to include extraction of content from embedded documents.

      1. SOLR-7189.patch
        5 kB
        Tim Allison
      2. test_recursive_embedded.docx
        26 kB
        Tim Allison

        Issue Links

          Activity

          Hide
          Tim Allison added a comment -

          Patch and test file attached.

          Show
          Tim Allison added a comment - Patch and test file attached.
          Hide
          ASF subversion and git services added a comment -

          Commit 1665099 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1665099 ]

          SOLR-7189: Allow DIH to extract content from embedded documents via Tika

          Show
          ASF subversion and git services added a comment - Commit 1665099 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1665099 ] SOLR-7189 : Allow DIH to extract content from embedded documents via Tika
          Hide
          ASF subversion and git services added a comment -

          Commit 1665100 from shalin@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1665100 ]

          SOLR-7189: Allow DIH to extract content from embedded documents via Tika

          Show
          ASF subversion and git services added a comment - Commit 1665100 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1665100 ] SOLR-7189 : Allow DIH to extract content from embedded documents via Tika
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Tim!

          Show
          Shalin Shekhar Mangar added a comment - Thanks Tim!
          Hide
          Tim Allison added a comment -

          Thank you, Shalin Shekhar Mangar.

          On a related note, as of Tika 1.7, it's easy to handle embedded documents as individual documents and maintain the embedded documents' metadata via the RecursiveParserWrapper. Until Tika 1.7, the off-the-shelf handlers concatenated the content of embedded documents but didn't maintain the embedded documents' metadata.

          Do you think there would be interest in adding a parameter to DIH to create individual child documents for embedded documents and maintain their metadata? Separate issue, of course.

          Show
          Tim Allison added a comment - Thank you, Shalin Shekhar Mangar . On a related note, as of Tika 1.7, it's easy to handle embedded documents as individual documents and maintain the embedded documents' metadata via the RecursiveParserWrapper. Until Tika 1.7, the off-the-shelf handlers concatenated the content of embedded documents but didn't maintain the embedded documents' metadata. Do you think there would be interest in adding a parameter to DIH to create individual child documents for embedded documents and maintain their metadata? Separate issue, of course.
          Hide
          Shalin Shekhar Mangar added a comment -

          I can imagine uses for it but personally I don't use much of either tika or DIH so I'll defer to your judgement. I'm happy to shepherd any patches though.

          Show
          Shalin Shekhar Mangar added a comment - I can imagine uses for it but personally I don't use much of either tika or DIH so I'll defer to your judgement. I'm happy to shepherd any patches though.
          Hide
          Tim Allison added a comment -

          Got it. If anyone has an interest, I'll draft a patch, but otherwise this should do for now.

          Thank you, again!

          Show
          Tim Allison added a comment - Got it. If anyone has an interest, I'll draft a patch, but otherwise this should do for now. Thank you, again!
          Hide
          Alexandre Rafalovitch added a comment -

          I think if the new functionality allows to look inside zips for example, a lot of people would be interested. And it should be exposed through inner entity mechanism, so people could start with a list of file names for zips, then expand the zips, then process individual files, etc.

          But yes, it should be a separate issue. And I would definitely create it so the people are even aware of this new functionality.

          Show
          Alexandre Rafalovitch added a comment - I think if the new functionality allows to look inside zips for example, a lot of people would be interested. And it should be exposed through inner entity mechanism, so people could start with a list of file names for zips, then expand the zips, then process individual files, etc. But yes, it should be a separate issue. And I would definitely create it so the people are even aware of this new functionality.
          Hide
          Tim Allison added a comment -

          Just opened SOLR-7229. Will need help and input on what the default behavior should be on the Solr side.

          Show
          Tim Allison added a comment - Just opened SOLR-7229 . Will need help and input on what the default behavior should be on the Solr side.
          Hide
          Timothy Potter added a comment -

          Bulk close after 5.1 release

          Show
          Timothy Potter added a comment - Bulk close after 5.1 release

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Tim Allison
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development