Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3782

Text extraction loses whitespace

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.0.4, 2.0.5, 2.0.6
    • None
    • Text extraction
    • None
    • Java/Tika

    Description

      I have a PDF document that I am using Tika/PDFBox to extract the content. In several areas, the content extracted loses the whitespace, causing a tokenization problem for indexing/searching.

      I have attached the original document and the text output. If you search (Ctrl+f) the text document for "Another example". Here you will see no space after "is" and the Japanese text. The same issue shows for "whichmeans"eraser"" at the end of the sentence.
      Another example is消しゴム (Rō- maji: keshigomu) whichmeans“eraser”

      I get the warning "WARNING: No Unicode mapping for CID+0 (0) in font RGOFPX+IPAexMincho" during extraction but have been unable to find any information on it.

      Attachments

        Activity

          People

            Unassigned Unassigned
            TonyBray Tony Bray
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: