Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-643

ClassCastException in PdfParser on encrypted PDF with empty password

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.0.0
    • 1.0.0
    • fetcher
    • None
    • This problem affects the current trunk too.

    • Patch Available

    Description

      Hi,

      If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password.

      This behaviour is implemented with the following code:
      if (pdf.isEncrypted())

      { DocumentEncryption decryptor = new DocumentEncryption(pdf); //Just try using the default password and move on decryptor.decryptDocument(""); }

      It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error:

      2008-08-07 19:15:56,860 WARN parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
      2008-08-07 19:15:56,862 WARN parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
      2008-08-07 19:15:56,862 WARN parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
      2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
      2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
      2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
      2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
      2008-08-07 19:15:56,874 WARN fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption

      Using the new security API, we don't have any error parsing this document and we can get its content:
      if (pdf.isEncrypted())

      { // Just try using the default password and move on pdf.openProtection(new StandardDecryptionMaterial("")); }

      I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API.

      Regards,


      Guillaume

      Attachments

        1. parse-pdf-PDFBox_upgrade.diff
          8 kB
          Guillaume Smet

        Activity

          People

            ab Andrzej Bialecki
            gsmet Guillaume Smet
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: