Issue Details (XML | Word | Printable)

Key: NUTCH-335
Type: Bug Bug
Status: Closed Closed
Resolution: Won't Fix
Priority: Major Major
Assignee: Unassigned
Reporter: Siddharudh nadgeri
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Nutch

Pdf summary corrupt issue

Created: 31/Jul/06 02:35 PM   Updated: 09/Oct/09 03:47 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

Environment: As it is web application it is not nessasary
Issue Links:
Cloners
 
Reference
 

Resolution Date: 09/Oct/09 03:47 PM


 Description  « Hide
I am using the Nutch search but for pdf it is giving summary as some garbage like

"!Unable to render embedded object: File ("#"#"#"#"#"#"#") not found.$%$%$#&##'$$$$$$$$$$$$$$$$$$ ("$$$$$$$$$$$$$$$$$$$

please provide the solution



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Stefan Neufeind added a comment - 31/Jul/06 02:44 PM
The problem that that in most cases I've come across the PDF is protected and does not allow text-extraction. Though this could be theoretically worked around, it's not really allowed afaik.

Also there is an issue for this already:
http://issues.apache.org/jira/browse/NUTCH-290

The problem that still needs to be worked around imho is that no text should be shown instead - and I'd which a clarification why currently we the raw binary data is taken as summary.

PS: Next time please search before opening a new issue. (Meant just as information, not to make anybody angry ...)


Siddharudh nadgeri added a comment - 01/Aug/06 01:15 PM
I searched before but solution was not there alternative you have given and if you know any pdf properties that have to change then let me know

Thanks


Stefan Neufeind added a comment - 07/Sep/06 10:50 PM
I think NUTCH-338 will fix this problem, since the "garbage" seems to come from the text-extractor.

Andrzej Bialecki added a comment - 09/Oct/09 03:47 PM
The most common cause of this is a PDF that uses subset fonts. Short of OCR there is nothing we can do to recover the text.