Solr
  1. Solr
  2. SOLR-2416

Solr Cell fails to index Zip file contents

    Details

      Description

      Working with the latest Solr Trunk code and seems the Tika handlers for Solr Cell (ExtractingDocumentLoader.java) and Data Import handler (TikaEntityProcessor.java) fails to index the zip file contents again.
      It just indexes the file names again.
      This issue was addressed some time back, late last year, but seems to have reappeared with the latest code.

      Jira for the Data Import handler part with the patch and the testcase - https://issues.apache.org/jira/browse/SOLR-2332.

        Issue Links

          Activity

          Hide
          Jayendra Patil added a comment -

          Fix attached.

          Show
          Jayendra Patil added a comment - Fix attached.
          Hide
          Hoss Man added a comment -

          I'm not sure what exactly jayendra is referring to by "was addressed some time back ... seems to have reappeared" (i couldn't find any issues that looked similar) but i just tested and confirmed that in 1.4.1 SolrCell only indexed the metadata about *.zip files, not the contents of the zip.

          the behavior in the 3.1rc1 solr release candidate is consistent with 1.4.1 - only info about the zip file itself is extracted, not the contents (although in 3.1 we actually extract more metadata then we did in 1.4.1) so this definitely isn't a 3.1 blocker (some people were wondering on IRC)

          I'm not personally even clear if this is really a bug, or if it should be request option driven – perhaps some users only want the data about the zip file, not it's contents; and what should the beahvior be if zip file contains multiple files, and the request specifies a literal id?

          Show
          Hoss Man added a comment - I'm not sure what exactly jayendra is referring to by "was addressed some time back ... seems to have reappeared" (i couldn't find any issues that looked similar) but i just tested and confirmed that in 1.4.1 SolrCell only indexed the metadata about *.zip files, not the contents of the zip. the behavior in the 3.1rc1 solr release candidate is consistent with 1.4.1 - only info about the zip file itself is extracted, not the contents (although in 3.1 we actually extract more metadata then we did in 1.4.1) so this definitely isn't a 3.1 blocker (some people were wondering on IRC) I'm not personally even clear if this is really a bug, or if it should be request option driven – perhaps some users only want the data about the zip file, not it's contents; and what should the beahvior be if zip file contains multiple files, and the request specifies a literal id?
          Hide
          Jayendra Patil added a comment -

          This issue existed in Solr 1.4 packaged with Tika 0.4, which prevented us from using the stable version.

          Thread - http://lucene.472066.n3.nabble.com/Issue-Indexing-zip-file-content-in-Solr-1-4-td504914.html
          The issue was resolved with the Tika 0.5 upgrade @ https://issues.apache.org/jira/browse/SOLR-1567

          We are working on a Snapshot of Solr Trunk 4.X marked around 15 August 2010, which uses the Tika 0.8 snapshot jars, and the extraction works fine for us.
          However, with the latest Trunk upgraded to the stable release of Tika 0.8, it does not have the same behaviour.

          Show
          Jayendra Patil added a comment - This issue existed in Solr 1.4 packaged with Tika 0.4, which prevented us from using the stable version. Thread - http://lucene.472066.n3.nabble.com/Issue-Indexing-zip-file-content-in-Solr-1-4-td504914.html The issue was resolved with the Tika 0.5 upgrade @ https://issues.apache.org/jira/browse/SOLR-1567 We are working on a Snapshot of Solr Trunk 4.X marked around 15 August 2010, which uses the Tika 0.8 snapshot jars, and the extraction works fine for us. However, with the latest Trunk upgraded to the stable release of Tika 0.8, it does not have the same behaviour.
          Hide
          Robert Muir added a comment -

          Bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - Bulk move 3.2 -> 3.3
          Hide
          Robert Muir added a comment -

          3.4 -> 3.5

          Show
          Robert Muir added a comment - 3.4 -> 3.5
          Hide
          Jan Høydahl added a comment -

          If we add this, the behavior should probably be parameter driven. Some questions arises:
          a) What to do with metadata? Should meta data for all files in the ZIP be added to the document? What's Tikas default?
          b) How do you present the title of such a document consisting of multiple docs from ZIP? Each individual document has its own title metadata...
          c) Do you always want to traverse all files in the ZIP or only some types?
          d) What do you do when a ZIP contains another ZIP?

          All in all, perhaps this isn't such a useful feature after all?

          Show
          Jan Høydahl added a comment - If we add this, the behavior should probably be parameter driven. Some questions arises: a) What to do with metadata? Should meta data for all files in the ZIP be added to the document? What's Tikas default? b) How do you present the title of such a document consisting of multiple docs from ZIP? Each individual document has its own title metadata... c) Do you always want to traverse all files in the ZIP or only some types? d) What do you do when a ZIP contains another ZIP? All in all, perhaps this isn't such a useful feature after all?
          Hide
          Jan Høydahl added a comment -

          Unless I get good answers to the questions above, I'll close this as "Not a problem"

          Show
          Jan Høydahl added a comment - Unless I get good answers to the questions above, I'll close this as "Not a problem"
          Hide
          Jayendra Patil added a comment -

          Tika parsers the zip file and extracts the complete content of the files as well.
          It parsers all the files in the zip as well as the the zip in zip.
          The metadata is the zip file rather than the individual files

          There would be no special handling required from the Solr side.
          The metadata for the Zip and its contents would be indexed as well.

          Also, Solr doesn't allow attaching multiple files with a single document.
          Zip is a nice way of associating a document with multiple files.

          And, as in the current behavior of indexing zip with just the file names doesn't have much value in it.

          Show
          Jayendra Patil added a comment - Tika parsers the zip file and extracts the complete content of the files as well. It parsers all the files in the zip as well as the the zip in zip. The metadata is the zip file rather than the individual files There would be no special handling required from the Solr side. The metadata for the Zip and its contents would be indexed as well. Also, Solr doesn't allow attaching multiple files with a single document. Zip is a nice way of associating a document with multiple files. And, as in the current behavior of indexing zip with just the file names doesn't have much value in it.
          Hide
          Jan Høydahl added a comment -

          I see. Perhaps we should make "recursive parsing" a config option, so people can choose?

          Also, according to http://wiki.apache.org/tika/RecursiveMetadata the parser passed to the context is the parser used to parse inner files. Your patch assumes that is always AutoDetectParser, but in the case someone passes stream.type=application/zip, you'll be lost. So perhaps a better way is to create a new AutodetectParser to pass to the context.

          Would you like to attempt a new patch with this fix as well as controlling it via a config parameter, e.g. recurseContainers=true?

          Please also add a JUnit test case to the patch to verify the fix.

          Show
          Jan Høydahl added a comment - I see. Perhaps we should make "recursive parsing" a config option, so people can choose? Also, according to http://wiki.apache.org/tika/RecursiveMetadata the parser passed to the context is the parser used to parse inner files. Your patch assumes that is always AutoDetectParser, but in the case someone passes stream.type=application/zip, you'll be lost. So perhaps a better way is to create a new AutodetectParser to pass to the context. Would you like to attempt a new patch with this fix as well as controlling it via a config parameter, e.g. recurseContainers=true? Please also add a JUnit test case to the patch to verify the fix.
          Hide
          Jayendra Patil added a comment -

          sure .. will try to check on this.

          Show
          Jayendra Patil added a comment - sure .. will try to check on this.
          Hide
          Hoss Man added a comment -

          Bulk changing fixVersion 3.6 to 4.0 for any open issues that are unassigned and have not been updated since March 19.

          Email spam suppressed for this bulk edit; search for hoss20120323nofix36 to identify all issues edited

          Show
          Hoss Man added a comment - Bulk changing fixVersion 3.6 to 4.0 for any open issues that are unassigned and have not been updated since March 19. Email spam suppressed for this bulk edit; search for hoss20120323nofix36 to identify all issues edited
          Hide
          Hoss Man added a comment -

          bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment

          Show
          Hoss Man added a comment - bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment
          Hide
          Jan Høydahl added a comment -

          Moving to 5.0. If anyone wants this in as an option for Solr4, we welcome patches

          Show
          Jan Høydahl added a comment - Moving to 5.0. If anyone wants this in as an option for Solr4, we welcome patches
          Hide
          Maciej Lizewski added a comment - - edited

          I think this is really needed feature. Also in earlier versions of Solr it worked different than now: grepping code of org.apache.solr.handler.extraction.ExtractingDocumentLoader from version 1.4.0.1 show that context was not created and instead autoDetectParser::parse function was called with 3 parameters (without context) and this caused context to be automatically created with Parser=autoDetectParser...
          this is backward compatibility violation after adding PasswordProvider. Also comments in current code suggest that someone was not sure about consequences of such change: "TODO: should we design a way to pass in parse context?"

          the patch is already attached as I see...

          Show
          Maciej Lizewski added a comment - - edited I think this is really needed feature. Also in earlier versions of Solr it worked different than now: grepping code of org.apache.solr.handler.extraction.ExtractingDocumentLoader from version 1.4.0.1 show that context was not created and instead autoDetectParser::parse function was called with 3 parameters (without context) and this caused context to be automatically created with Parser=autoDetectParser... this is backward compatibility violation after adding PasswordProvider. Also comments in current code suggest that someone was not sure about consequences of such change: "TODO: should we design a way to pass in parse context?" the patch is already attached as I see...
          Hide
          Jan Høydahl added a comment -

          As far as I can see, this behaviour has been consistent over several years and versions, seemingly unrelated to PasswordProvider change for v4.0. Thus there are probably more Solr users expecting today's behavior than the pre 1.4.1 one.

          As with open source in general, features are added by real world needs, by contributors who want to help. If you need this feature for your company, the first thing to do would be to test the attached patch, add configuration param for enabling/disabling, add JUnit tests and work step by step towards a mature patch.

          Show
          Jan Høydahl added a comment - As far as I can see, this behaviour has been consistent over several years and versions, seemingly unrelated to PasswordProvider change for v4.0. Thus there are probably more Solr users expecting today's behavior than the pre 1.4.1 one. As with open source in general, features are added by real world needs, by contributors who want to help. If you need this feature for your company, the first thing to do would be to test the attached patch, add configuration param for enabling/disabling, add JUnit tests and work step by step towards a mature patch.

            People

            • Assignee:
              Unassigned
              Reporter:
              Jayendra Patil
            • Votes:
              4 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:

                Development