Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-643

ClassCastException in PdfParser on encrypted PDF with empty password

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.0
    • Fix Version/s: 1.0.0
    • Component/s: fetcher
    • Labels:
      None
    • Environment:

      This problem affects the current trunk too.

    • Patch Info:
      Patch Available

      Description

      Hi,

      If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password.

      This behaviour is implemented with the following code:
      if (pdf.isEncrypted())

      { DocumentEncryption decryptor = new DocumentEncryption(pdf); //Just try using the default password and move on decryptor.decryptDocument(""); }

      It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error:

      2008-08-07 19:15:56,860 WARN parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
      2008-08-07 19:15:56,862 WARN parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
      2008-08-07 19:15:56,862 WARN parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
      2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
      2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
      2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
      2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
      2008-08-07 19:15:56,874 WARN fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption

      Using the new security API, we don't have any error parsing this document and we can get its content:
      if (pdf.isEncrypted())

      { // Just try using the default password and move on pdf.openProtection(new StandardDecryptionMaterial("")); }

      I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API.

      Regards,


      Guillaume

        Attachments

        1. parse-pdf-PDFBox_upgrade.diff
          8 kB
          Guillaume Smet

          Activity

            People

            • Assignee:
              ab Andrzej Bialecki
              Reporter:
              gsmet Guillaume Smet
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: