[PDFBOX-77] PDF-Extraction splits words by spaces - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.8.0-incubator
Component/s: Text extraction
Labels:
None

Description

[imported from SourceForge]
http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1251041
Originally submitted by mkrebs on 2005-08-03 06:14.

I'm currently working on an indexing-service using
jakarta lucene. Momentarily in a first step i have to
index about 80 PDF-Documents (later approximatly
1000 documents). Ca. 60 documents (of the 80) are
extracted by PDFBox without any problems. But for
around 20 documents, PDFBox generates too many
spaces, which means, that words like "hostname" are
extracted to "ho stna me". This happens by nearly all
words contained in these 20 documents.
Because i have to extract page-information i am using
PDFTextStripper with the methods setPageSeparator
(..), setStartPage(..), setEndPage(..) and writeText():

StringWriter s = new StringWriter();
PDDocument pddoc = null;
try {
pddoc = PDDocument.load(f);
int pageCount = pddoc.getPageCount();
PDFTextStripper stripper = new
PDFTextStripper();
stripper.setPageSeparator
(IndexFiles.DOCUMENT_PAGE_SEPARATOR);
stripper.setStartPage(1);
stripper.setEndPage(pageCount);
stripper.writeText(pddoc, s);
} finally {
if (pddoc != null)
pddoc.close();
}
StringBuffer contents = s.getBuffer();

In respect for my indexing-service it is impossible to
index these documents correctly.

I have tried to BugFix PDFBox
(PDFTextStripper.flushText()) and established, that the
width returned by TextPosition.getWidth() is incorrect.
When i multiply this width with TextPosition.getXScale
(), these documents are extracted correctly. But other
before correctly extracted documents loose nearly all
spaces, which means, that a complete sentence dont
contain any spaces between words.

I have tried: PDFBox-0.7.2-dev-20050730.jar, but the
problem still remains.

Example Text-Output:

3.5 SSL V erbindunge n
• JSSE (Java Sec ure So ck et E xt en sion)
• im po rt ja vax .n et .ss l. *
• W ese nt lich e ¨And erung im Cl ient Pr ogr
amm : (F act o ry
P att e rn )
Erse tze
Soc ket s = new Soc ket (ho stna me, port numb er)
du rc h
SSL Soc ketF acto ry s f = (SS LSoc ketF acto ry)
S SLS ocke tFac tory .get Def ault ();
SSL Soc ket s = (SSL Sock et) sf.c reat eSoc ket(
hos tnam e, p ortn umbe r);
• Daf ¨u r mus s SSL k o n fig u riert sein (si ehe
unten).
V o rl ¨aufige V.......

[attachment on SourceForge]
http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1251041&file_id=144276
03_2_SSL.pdf (application/pdf), 189828 bytes
university-lecture

Attachments

Issue Links

duplicates

PDFBOX-349 Spaces between words ignored in scanned pdf files

Closed

is related to

PDFBOX-444 Incorrect Diacritic Merging/Placement

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Jukka Zitting

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 03/Aug/05 13:14

Updated:: 21/Oct/09 10:28

Resolved:: 01/Apr/09 14:36