|
I searched before but solution was not there alternative you have given and if you know any pdf properties that have to change then let me know
Thanks I think
The most common cause of this is a PDF that uses subset fonts. Short of OCR there is nothing we can do to recover the text.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Also there is an issue for this already:
http://issues.apache.org/jira/browse/NUTCH-290
The problem that still needs to be worked around imho is that no text should be shown instead - and I'd which a clarification why currently we the raw binary data is taken as summary.
PS: Next time please search before opening a new issue. (Meant just as information, not to make anybody angry ...)