Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Won't Fix
-
None
-
None
-
None
Description
From the attached 72083_qdf.pdf file, this text (big letters on the top) is not extracted using PDFTextStripper:
AGGIE NIGHT AT ENRON FIELD FRIDAY, JUNE 15, 2001 at 7:05 HOUSTON ASTROS VS. TEXAS RANGERS
It does not work well in Acrobat Reader also. But, at the same time, it can be extracted properly by some PDF viewers.
Also, I found a workaround how to make it work, see it below.
1. Find this code block in LegacyPDFStreamEngine.java
if(unicode == null) { if(!(font instanceof PDSimpleFont)) { return; } char c = (char)code; unicode = new String(new char[]{c}); }
2. Insert this code block just before found one.
if (unicode == null) { if (font instanceof PDType1CFont) { String name = ((PDType1CFont) font).codeToName(code); try { Method method = PDType1CFont.class.getDeclaredMethod("readEncodingFromFont"); method.setAccessible(true); Encoding encoding = (Encoding) method.invoke(font); Integer newCode = encoding.getNameToCodeMap().get(name); if (newCode != null && newCode.intValue() != 0) { unicode = new String(new char[]{(char) newCode.byteValue()}); } } catch (NoSuchMethodException e) { e.printStackTrace(); } catch (IllegalAccessException e) { e.printStackTrace(); } catch (InvocationTargetException e) { e.printStackTrace(); } } }