Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
None
-
None
Description
[Issue from SourceForge]
http://sourceforge.net/tracker/index.php?func=detail&aid=1912364&group_id=78314&atid=552832
The spaces between words from the attached PDF file are removed upon text
extraction.
I traced the code and found that the cause seems to be a "division by 0"
bug in PDCIDFont.java
In PDCIDFont.getAverageFontWidth(), widths is returned as null from
COSArray widths = (COSArray)font.getDictionaryObject( COSName.getPDFName(
"W" ) );
,causing characterCount to be 0.
The result is that the following line
float average = totalWidths / characterCount;
returns a NaN, which gets propagated up the method calls to result in the
spaces being removed.
I suggest the following fix, instead of:
float average = totalWidths / characterCount;
Have:
float average = defaultWidth;
if (characterCount > 0) {
average = totalWidths / characterCount;
}
[Comment on SourceForge]
Date: 2008-03-12 03:01
Sender: choongyong
Logged In: YES
user_id=2033885
Originator: NO
Realised that I was considered not login when I raised the request.
Sending this comment so that the developer can contact me.
[Comment on SourceForge]
Date: 2008-03-17 21:50
Sender: nobody
Logged In: NO
I have noticed that there is no spaces between 2 words, if they are
separated by a new line (or the 2nd word is on the next line because it
reaches the right margin).
Could you correct please ?
Attachments
Issue Links
- duplicates
-
PDFBOX-349 Spaces between words ignored in scanned pdf files
- Closed