Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-77

PDF-Extraction splits words by spaces

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • None
    • 0.8.0-incubator
    • Text extraction
    • None

    Description

      [imported from SourceForge]
      http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1251041
      Originally submitted by mkrebs on 2005-08-03 06:14.

      I'm currently working on an indexing-service using
      jakarta lucene. Momentarily in a first step i have to
      index about 80 PDF-Documents (later approximatly
      1000 documents). Ca. 60 documents (of the 80) are
      extracted by PDFBox without any problems. But for
      around 20 documents, PDFBox generates too many
      spaces, which means, that words like "hostname" are
      extracted to "ho stna me". This happens by nearly all
      words contained in these 20 documents.
      Because i have to extract page-information i am using
      PDFTextStripper with the methods setPageSeparator
      (..), setStartPage(..), setEndPage(..) and writeText():

      StringWriter s = new StringWriter();
      PDDocument pddoc = null;
      try {
      pddoc = PDDocument.load(f);
      int pageCount = pddoc.getPageCount();
      PDFTextStripper stripper = new
      PDFTextStripper();
      stripper.setPageSeparator
      (IndexFiles.DOCUMENT_PAGE_SEPARATOR);
      stripper.setStartPage(1);
      stripper.setEndPage(pageCount);
      stripper.writeText(pddoc, s);
      } finally {
      if (pddoc != null)
      pddoc.close();
      }
      StringBuffer contents = s.getBuffer();

      In respect for my indexing-service it is impossible to
      index these documents correctly.

      I have tried to BugFix PDFBox
      (PDFTextStripper.flushText()) and established, that the
      width returned by TextPosition.getWidth() is incorrect.
      When i multiply this width with TextPosition.getXScale
      (), these documents are extracted correctly. But other
      before correctly extracted documents loose nearly all
      spaces, which means, that a complete sentence dont
      contain any spaces between words.

      I have tried: PDFBox-0.7.2-dev-20050730.jar, but the
      problem still remains.

      Example Text-Output:

      3.5 SSL V erbindunge n
      • JSSE (Java Sec ure So ck et E xt en sion)
      • im po rt ja vax .n et .ss l. *
      • W ese nt lich e ¨And erung im Cl ient Pr ogr
      amm : (F act o ry
      P att e rn )
      Erse tze
      Soc ket s = new Soc ket (ho stna me, port numb er)
      du rc h
      SSL Soc ketF acto ry s f = (SS LSoc ketF acto ry)
      S SLS ocke tFac tory .get Def ault ();
      SSL Soc ket s = (SSL Sock et) sf.c reat eSoc ket(
      hos tnam e, p ortn umbe r);
      • Daf ¨u r mus s SSL k o n fig u riert sein (si ehe
      unten).
      V o rl ¨aufige V.......

      [attachment on SourceForge]
      http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1251041&file_id=144276
      03_2_SSL.pdf (application/pdf), 189828 bytes
      university-lecture

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jukkaz Jukka Zitting
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: