|
This patch adds a check to first see if text-extraction is allowed - and only in that case try to extract text (prevents the above mentioned exception and a parse-fail).
Note: The line ((PDStandardEncryption) encDict).setCanExtractContent(true); is imho up to discussion. It only sets a bit on "encrypted" documents. Since I've read in several places that many people seem to be setting this to "false" for no good reason, I believe we don't really "brake encryption" with this line - and as such should try to index as much data as possible. The plugin itself imho works fine now. Does not throw an exception anymore and if allowed outputs text correctly.
However I still get the "garbage-output" from a PDF. Could that be due to the fact that in case no extraction is allowed (empty parsing-text returned) the parser will still fallback to using the raw text to index? What I did was deleting crawl_parse and parse_* from the segments-directory, running "nutch parse" and reindexing everything. However the raw chars in the search-output (summary) remain. If a parser throws an exeption:
Fetcher, 261: try { parse = this.parseUtil.parse(content); parseStatus = parse.getData().getStatus(); } catch (Exception e) { parseStatus = new ParseStatus(e); } if (!parseStatus.isSuccess()) { LOG.warning("Error parsing: " + key + ": " + parseStatus); parse = parseStatus.getEmptyParse(getConf()); } than we use the empty parse object: private ParseData data = null; public EmptyParseImpl(ParseStatus status, Configuration conf) { data = new ParseData(status, "", new Outlink[0], new Metadata(), new Metadata()); data.setConf(conf); } public ParseData getData() { return data; } public String getText() {
return "";
} But if one plugin fails in 0.8-dev, isn't the next used? I understand that in the default-config the text-parser would be used as the last resort fallback.
Also I'm not sure where the summary-text comes from if I use the patch above to prevent generating an exception but return empty parse-data. As far I understand the code, the next parser is only used if the previous parser return with a unsuccessfully paring status. If the parser throws an expception these exception is not catched in the parseutil at all.
So the pdf parser should throw an expception and not report a unsucessfully status to solve this problem, isn't it? But to my understanding of the plugin it still extracts as much as possible (meta-data) from the PDF. So if indexing is not allowed but this is a PDF, then returning empty text as the document-body should be fine - shouldn't it? Nothing else except a PDF-plugin will be able to handle PDF correclty in this case.
Stefan G., can you point out why in the summary I see binary data for a PDF as summary and if there is a possible fix for it in the context of this current bug here? I think
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
} catch (Exception e) { // run time exception LOG.warning("General exception in PDF parser: "+e.getMessage()); e.printStackTrace(); return new ParseStatus(ParseStatus.FAILED, "Can't be handled as pdf document. " + e).getEmptyParse(getConf()); }
The exception is:
060522 001010 General exception in PDF parser: You do not have permission to extract text
java.io.IOException: You do not have permission to extract text
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:189)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:140)
at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:120)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:77)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:257)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:143)
Could it be that, maybe as a fallback, in case the document can't be parsed and no "description" is returned that in search-output the document itself is used as "description"? If yes: In case of binary files this seems to lead to problems.