Details

      Description

      SolrCell should be able to index encrypted files (pdfs, word docs).

      1. SOLR-1929.patch
        16 kB
        Jan Høydahl
      2. SOLR-1929.patch
        1 kB
        Yiannis Pericleous
      3. SOLR-1929-extra-docs.zip
        35 kB
        Jan Høydahl

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          Solr Cell isn't something i'm actively involved with – but if the issue is just having a way to pass input metadata to Tika, then perhaps a more general input params setup should be used – meta.* SolrParams that are looped over and added to the Metadata object prior to extraction perhaps?

          Show
          Hoss Man added a comment - Solr Cell isn't something i'm actively involved with – but if the issue is just having a way to pass input metadata to Tika, then perhaps a more general input params setup should be used – meta.* SolrParams that are looped over and added to the Metadata object prior to extraction perhaps?
          Hide
          Lance Norskog added a comment -

          Can it be tika.pdf.* ? The Solr parameter namespace needs some careful management.

          Show
          Lance Norskog added a comment - Can it be tika.pdf.* ? The Solr parameter namespace needs some careful management.
          Hide
          Jan Høydahl added a comment -

          Tika 1.1 solves TIKA-850 which will make it easier to add this feature to Solr Cell

          Show
          Jan Høydahl added a comment - Tika 1.1 solves TIKA-850 which will make it easier to add this feature to Solr Cell
          Hide
          Jan Høydahl added a comment -

          Now that we have Tika1.1 in, we can start exploring a way to add passwords to ERH. The most flexible would perhaps be a way to specify password resolving based on regex, i.e. if the file name matches a regex for a known password you use that, if not you fallback to a default password.

          Show
          Jan Høydahl added a comment - Now that we have Tika1.1 in, we can start exploring a way to add passwords to ERH. The most flexible would perhaps be a way to specify password resolving based on regex, i.e. if the file name matches a regex for a known password you use that, if not you fallback to a default password.
          Hide
          Jan Høydahl added a comment -

          For PDFs there was a possibility of supplying the password in the metadata passed on to tika (as the first patch here). However, since TIKA-850, we can now supply a PasswordProvider on the context, which will provide the password and is future proof for any document type which supports it.

          Show
          Jan Høydahl added a comment - For PDFs there was a possibility of supplying the password in the metadata passed on to tika (as the first patch here). However, since TIKA-850 , we can now supply a PasswordProvider on the context, which will provide the password and is future proof for any document type which supports it.
          Hide
          Jan Høydahl added a comment -

          Updated patch for trunk which utilizes the new Tika feature in TIKA-850. Contains a RegexRulesPasswordProvider backed by regex rules file and/or explicit password.

          New solr cell request params:

          • resource.password - explicit password for this file
          • passwordsFile - name of property file with list of known passwords based on filename regex. Loaded using ResourceLoader

          Note that Tika currently support passwords for PDF and DOCX files, not legacy DOC files or any other type. I tried to decrypt the existing test file password-is-solrcell.docx but it fails due to unsupported enctyption method in Apache POI.

          In order to apply this patch and have tests pass, you also need to add two binary files by unzipping SOLR-1929-extra-docs.zip in project root.

          Show
          Jan Høydahl added a comment - Updated patch for trunk which utilizes the new Tika feature in TIKA-850 . Contains a RegexRulesPasswordProvider backed by regex rules file and/or explicit password. New solr cell request params: resource.password - explicit password for this file passwordsFile - name of property file with list of known passwords based on filename regex. Loaded using ResourceLoader Note that Tika currently support passwords for PDF and DOCX files, not legacy DOC files or any other type. I tried to decrypt the existing test file password-is-solrcell.docx but it fails due to unsupported enctyption method in Apache POI. In order to apply this patch and have tests pass, you also need to add two binary files by unzipping SOLR-1929 -extra-docs.zip in project root.
          Hide
          Jan Høydahl added a comment -

          Committed to trunk in r1354887, please check it out on a few of your own files. How to invoke:

          curl "http://localhost:8983/solr/collection1/update/extract?commit=true&literal.id=123&resource.password=mypassword" \
               -H "Content-Type: application/pdf" --data-binary @my-encrypted-file.pdf
          

          or

          curl "http://localhost:8983/solr/collection1/update/extract?commit=true&literal.id=123&passwordsFile=mypass.properties&resource.name=my-encrypted-file.pdf" \
               -H "Content-Type: application/pdf" --data-binary @my-encrypted-file.pdf
          
          # contents of mypass.properties could be:
          .*\.pdf = mySecretPassword
          

          It could of course be nice to make the PasswordProvider class pluggable through class-name as well, as we do for the CurrencyFieldType. But this is a first step and probably goes a long way.

          Will keep this open until it has baked for a while in trunk and been committed to 4.x

          Show
          Jan Høydahl added a comment - Committed to trunk in r1354887, please check it out on a few of your own files. How to invoke: curl "http://localhost:8983/solr/collection1/update/extract?commit=true&literal.id=123&resource.password=mypassword" \ -H "Content-Type: application/pdf" --data-binary @my-encrypted-file.pdf or curl "http://localhost:8983/solr/collection1/update/extract?commit=true&literal.id=123&passwordsFile=mypass.properties&resource.name=my-encrypted-file.pdf" \ -H "Content-Type: application/pdf" --data-binary @my-encrypted-file.pdf # contents of mypass.properties could be: .*\.pdf = mySecretPassword It could of course be nice to make the PasswordProvider class pluggable through class-name as well, as we do for the CurrencyFieldType. But this is a first step and probably goes a long way. Will keep this open until it has baked for a while in trunk and been committed to 4.x
          Hide
          Jan Høydahl added a comment -

          Merged back to 4.0 beta in r1357427

          Show
          Jan Høydahl added a comment - Merged back to 4.0 beta in r1357427
          Show
          Jan Høydahl added a comment - Documentation updated: http://wiki.apache.org/solr/ExtractingRequestHandler#Encrypted_files
          Hide
          Hoss Man added a comment -

          hoss20120711-manual-post-40alpha-change

          Show
          Hoss Man added a comment - hoss20120711-manual-post-40alpha-change

            People

            • Assignee:
              Jan Høydahl
              Reporter:
              Yiannis Pericleous
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development