Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1804

PDFTextStripper Issue related to word positions not correctly being parsed

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.8.3
    • Fix Version/s: 1.8.4, 2.0.0
    • Component/s: Text extraction
    • Labels:
      None

      Description

      I found in a PDF I was pulling text from by using a custom PDFTextStripper subclass that overrides writeString(String text, List<TextPosition> textPositions) that i was getting the wrong textPositions that were not lined up with the text. I found that the test position of all “words” in a line always come over as the “last” text positions of the last word in the line. I found the issue in the PDFTextStripper class

      So here is the Code Issue:

      /**

      • Used within {@link #normalize(List, boolean, boolean)} to handle a {@link TextPosition}.
        * @return The StringBuilder that must be used when calling this method.
        */
        private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions> normalized,
        StringBuilder lineBuilder, List<TextPosition> wordPositions, TextPosition text)
        {
        if (text instanceof WordSeparator)
        { normalized.add(createWord(lineBuilder.toString(), wordPositions)); lineBuilder = new StringBuilder(); wordPositions.clear(); }
        else
        { lineBuilder.append(text.getCharacter()); wordPositions.add(text); }
        return lineBuilder;
        }


        When the normalizeAdd method, you create a new word passing the wordPositions. A reference to the wordPositions is stored in the new WordWithTextPositions in the normalized linked list, but in the next line, you clear(). Since the last wordPositions was passed as a reference, the wordPositions is cleared in the WordWithTextPositions you just created.

        Soo, i would suggest you do the following:
        /**
        * Used within {@link #normalize(List, boolean, boolean)}

        to handle a

        {@link TextPosition}

        .

      • @return The StringBuilder that must be used when calling this method.
        */
        private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions> normalized,
        StringBuilder lineBuilder, List<TextPosition> wordPositions, TextPosition text)
        Unknown macro: { if (text instanceof WordSeparator) { normalized.add(createWord(lineBuilder.toString(), new ArrayList<TextPosition>(wordPositions))); lineBuilder = new StringBuilder(); wordPositions.clear(); } else { lineBuilder.append(text.getCharacter()); wordPositions.add(text); } return lineBuilder; }

        Attachments

          Activity

            People

            • Assignee:
              lehmi Andreas Lehmkühler
              Reporter:
              andyphillips404 Andy Phillips

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment