Issue Details (XML | Word | Printable)

Key: NUTCH-290
Type: Bug Bug
Status: Open Open
Priority: Major Major
Assignee: Unassigned
Reporter: Stefan Neufeind
Votes: 1
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Nutch

parse-pdf: Garbage indexed when text-extraction not allowed

Created: 28/May/06 08:34 PM   Updated: 07/Sep/06 10:49 PM
Return to search
Component/s: indexer
Affects Version/s: 0.8
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works NUTCH-290-canExtractContent.patch 2006-05-29 02:57 AM Stefan Neufeind 1 kB
Issue Links:
Cloners
 
Reference
 


 Description  « Hide
It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed.

Example-PDF:
http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Stefan Neufeind added a comment - 29/May/06 01:29 AM
this one here fires in the PDF-parser:

} catch (Exception e) { // run time exception LOG.warning("General exception in PDF parser: "+e.getMessage()); e.printStackTrace(); return new ParseStatus(ParseStatus.FAILED, "Can't be handled as pdf document. " + e).getEmptyParse(getConf()); }

The exception is:

060522 001010 General exception in PDF parser: You do not have permission to extract text
java.io.IOException: You do not have permission to extract text
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:189)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:140)
at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:120)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:77)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:257)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:143)

Could it be that, maybe as a fallback, in case the document can't be parsed and no "description" is returned that in search-output the document itself is used as "description"? If yes: In case of binary files this seems to lead to problems.


Stefan Neufeind added a comment - 29/May/06 02:57 AM
This patch adds a check to first see if text-extraction is allowed - and only in that case try to extract text (prevents the above mentioned exception and a parse-fail).

Note: The line

((PDStandardEncryption) encDict).setCanExtractContent(true);

is imho up to discussion. It only sets a bit on "encrypted" documents. Since I've read in several places that many people seem to be setting this to "false" for no good reason, I believe we don't really "brake encryption" with this line - and as such should try to index as much data as possible.
Does anybody have "problems" with this line? If yes, maybe it could be a config-option that's false by default?


Stefan Neufeind added a comment - 30/May/06 03:34 PM
The plugin itself imho works fine now. Does not throw an exception anymore and if allowed outputs text correctly.
However I still get the "garbage-output" from a PDF. Could that be due to the fact that in case no extraction is allowed (empty parsing-text returned) the parser will still fallback to using the raw text to index?

What I did was deleting crawl_parse and parse_* from the segments-directory, running "nutch parse" and reindexing everything. However the raw chars in the search-output (summary) remain. (


Stefan Groschupf added a comment - 02/Jun/06 10:45 PM
If a parser throws an exeption:
Fetcher, 261:
try { parse = this.parseUtil.parse(content); parseStatus = parse.getData().getStatus(); } catch (Exception e) { parseStatus = new ParseStatus(e); }
if (!parseStatus.isSuccess()) { LOG.warning("Error parsing: " + key + ": " + parseStatus); parse = parseStatus.getEmptyParse(getConf()); }

than we use the empty parse object:
and a empthy parse contans just no text, see getText
private static class EmptyParseImpl implements Parse {

private ParseData data = null;

public EmptyParseImpl(ParseStatus status, Configuration conf) { data = new ParseData(status, "", new Outlink[0], new Metadata(), new Metadata()); data.setConf(conf); }

public ParseData getData() { return data; }

public String getText() { return ""; }
}
So the Problem should be somewhere else.


Stefan Neufeind added a comment - 02/Jun/06 11:12 PM
But if one plugin fails in 0.8-dev, isn't the next used? I understand that in the default-config the text-parser would be used as the last resort fallback.

Also I'm not sure where the summary-text comes from if I use the patch above to prevent generating an exception but return empty parse-data.


Stefan Groschupf added a comment - 02/Jun/06 11:31 PM
As far I understand the code, the next parser is only used if the previous parser return with a unsuccessfully paring status. If the parser throws an expception these exception is not catched in the parseutil at all.
So the pdf parser should throw an expception and not report a unsucessfully status to solve this problem, isn't it?

Stefan Neufeind added a comment - 02/Jun/06 11:53 PM
But to my understanding of the plugin it still extracts as much as possible (meta-data) from the PDF. So if indexing is not allowed but this is a PDF, then returning empty text as the document-body should be fine - shouldn't it? Nothing else except a PDF-plugin will be able to handle PDF correclty in this case.

Stefan G., can you point out why in the summary I see binary data for a PDF as summary and if there is a possible fix for it in the context of this current bug here?


Stefan Neufeind added a comment - 07/Sep/06 10:49 PM
I think NUTCH-338 will fix this problem, since the "garbage" seems to come from the text-extractor.