Tika
  1. Tika
  2. TIKA-850

Consistent way to supply document passwords to parsers

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0
    • Fix Version/s: 1.1
    • Component/s: parser
    • Labels:
      None

      Description

      Currently, PDF document passwords are supplied to the parser via a special key on the Metadata object, while the Office Parser has a TODO and only supports the default password

      We should update all the parsers that support encrypted documents (currently PDF, Office OLE2 and Office OOXML) to receive the password in a consistent way

        Issue Links

          Activity

          Hide
          Nick Burch added a comment -

          Does anyone have a feeling for if the password should be being passed in on the Metadata object (as PDF currently supports), or on the ParseContext (as other Parser options are)?

          Show
          Nick Burch added a comment - Does anyone have a feeling for if the password should be being passed in on the Metadata object (as PDF currently supports), or on the ParseContext (as other Parser options are)?
          Hide
          Nick Burch added a comment -

          Currently, the objects set onto the ParseContext are:

          • Detector.class
          • DocumentSelector.class
          • EmbeddedDocumentExtractor.class
          • Locale.class
          • MimeConfig.class
          • Parser.class

          The ones set onto the Metadata for use by parsers are:

          • RESOURCE_NAME_KEY (resourceName)
          • CONTENT_TYPE (Content-Type)
          • PASSWORD (org.apache.pdfbox.tika.password) PDF Only
          • TIKA_MIME_FILE (tika.mime.file);
          • MIME_TYPE_MAGIC (mime.type.magic);
          Show
          Nick Burch added a comment - Currently, the objects set onto the ParseContext are: Detector.class DocumentSelector.class EmbeddedDocumentExtractor.class Locale.class MimeConfig.class Parser.class The ones set onto the Metadata for use by parsers are: RESOURCE_NAME_KEY (resourceName) CONTENT_TYPE (Content-Type) PASSWORD (org.apache.pdfbox.tika.password) PDF Only TIKA_MIME_FILE (tika.mime.file); MIME_TYPE_MAGIC (mime.type.magic);
          Hide
          Nick Burch added a comment -

          Based on this, I think the best option may be to have a new interface, called something like PasswordProvider, set onto the ParseContext

          PasswordProvider would have a single method, 'String getPassword(Metadata)', which would potentially allow you to look up the password based on the resource name and content type.

          We'd probably want a single implementation out of the box, which takes a String on the constructor and always returns that as the password, to make life easy for calling parsing when you know the password for your file

          Thoughts? Better names? Alternate ways to do it?

          Show
          Nick Burch added a comment - Based on this, I think the best option may be to have a new interface, called something like PasswordProvider, set onto the ParseContext PasswordProvider would have a single method, 'String getPassword(Metadata)', which would potentially allow you to look up the password based on the resource name and content type. We'd probably want a single implementation out of the box, which takes a String on the constructor and always returns that as the password, to make life easy for calling parsing when you know the password for your file Thoughts? Better names? Alternate ways to do it?
          Hide
          Nick Burch added a comment -

          PasswordProvider added in r1238616, based on the above description.

          The PDFParser has also been updated to use it in preference to the metadata key. Assuming there are no changes suggested in the next few days, I'll roll it out to the POI based parsers too.

          Show
          Nick Burch added a comment - PasswordProvider added in r1238616, based on the above description. The PDFParser has also been updated to use it in preference to the metadata key. Assuming there are no changes suggested in the next few days, I'll roll it out to the POI based parsers too.
          Hide
          Nick Burch added a comment -

          I've updated OfficeParser in r1244933 to use the same pattern as PDFParser, with PasswordProvider.

          I believe these are the only two that currently support password protected files, so I think this is now finished. (Well, until we add more file formats!)

          Show
          Nick Burch added a comment - I've updated OfficeParser in r1244933 to use the same pattern as PDFParser, with PasswordProvider. I believe these are the only two that currently support password protected files, so I think this is now finished. (Well, until we add more file formats!)

            People

            • Assignee:
              Unassigned
              Reporter:
              Nick Burch
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development