Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4313

PDFTextStripper groups unrelated chunks into words

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.0.11
    • None
    • Text extraction
    • None

    Description

      I have the text "10" and "11" and they get merged into to "1110" word.

      Coordinates are:

      1 575.36 x 227.4 w 4.447998 h 5.736
      1 579.752 x 227.4 w 4.447998 h 5.736
      1 526.2 x 227.4 w 4.447998 h 5.736
      0 530.59204 x 227.4 w 4.447998 h 5.736

      The bug is in in this PDFTextStripper chunk:

      {{
      // test if our TextPosition starts after a new word would be expected to start
      if (expectedStartOfNextWordX != EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
      && expectedStartOfNextWordX < positionX &&
      // only bother adding a space if the last character was not a space
      lastPosition.getTextPosition().getUnicode() != null
      && !lastPosition.getTextPosition().getUnicode().endsWith(" "))

      { line.add(LineItem.getWordSeparator()); }

      }}

      which seems to add a word separator only if the next char is "after" the current word. It never expects that the next char might be "before" the current word.

      I guess this could also be framed as a RTL problem, but the PDF is a plain PDF, it just seems that Oracle Reports generates these chunks in the reverse order.

      Attachments

        1. 1536938716546.pdf
          1 kB
          Emilian Bold
        2. crop-fisa-sintetica.png
          17 kB
          Emilian Bold
        3. details.pdf
          18 kB
          Paul Slootweg
        4. PDFBOX-4313.pdf
          0.8 kB
          Tilman Hausherr
        5. PDFBOX-4313-Test_sorted.txt
          0.3 kB
          Andreas Lehmkühler
        6. PDFBOX-4313-Test_unsorted.txt
          0.3 kB
          Andreas Lehmkühler
        7. PDFBOX4313Test.java
          7 kB
          Emilian Bold
        8. PDFBOX4313Test.java
          7 kB
          Emilian Bold
        9. PDFBOX-4313-Test.pdf
          1 kB
          Andreas Lehmkühler
        10. pdfbox-words.png
          54 kB
          Emilian Bold

        Activity

          People

            lehmi Andreas Lehmkühler
            emi Emilian Bold
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: