Details
Description
Hi,
If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password.
This behaviour is implemented with the following code:
if (pdf.isEncrypted())
It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error:
2008-08-07 19:15:56,860 WARN parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
2008-08-07 19:15:56,862 WARN parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
2008-08-07 19:15:56,862 WARN parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-08-07 19:15:56,874 WARN fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
Using the new security API, we don't have any error parsing this document and we can get its content:
if (pdf.isEncrypted())
I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API.
Regards,
–
Guillaume