Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1804

PDFTextStripper Issue related to word positions not correctly being parsed

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.8.3
    • 1.8.4, 2.0.0
    • Text extraction
    • None

    Description

      I found in a PDF I was pulling text from by using a custom PDFTextStripper subclass that overrides writeString(String text, List<TextPosition> textPositions) that i was getting the wrong textPositions that were not lined up with the text. I found that the test position of all “words” in a line always come over as the “last” text positions of the last word in the line. I found the issue in the PDFTextStripper class

      So here is the Code Issue:

      /**

      • Used within {@link #normalize(List, boolean, boolean)} to handle a {@link TextPosition}.
        * @return The StringBuilder that must be used when calling this method.
        */
        private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions> normalized,
        StringBuilder lineBuilder, List<TextPosition> wordPositions, TextPosition text)
        {
        if (text instanceof WordSeparator)
        { normalized.add(createWord(lineBuilder.toString(), wordPositions)); lineBuilder = new StringBuilder(); wordPositions.clear(); }
        else
        { lineBuilder.append(text.getCharacter()); wordPositions.add(text); }
        return lineBuilder;
        }


        When the normalizeAdd method, you create a new word passing the wordPositions. A reference to the wordPositions is stored in the new WordWithTextPositions in the normalized linked list, but in the next line, you clear(). Since the last wordPositions was passed as a reference, the wordPositions is cleared in the WordWithTextPositions you just created.

        Soo, i would suggest you do the following:
        /**
        * Used within {@link #normalize(List, boolean, boolean)}

        to handle a

        {@link TextPosition}

        .

      • @return The StringBuilder that must be used when calling this method.
        */
        private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions> normalized,
        StringBuilder lineBuilder, List<TextPosition> wordPositions, TextPosition text)
        Unknown macro: { if (text instanceof WordSeparator) { normalized.add(createWord(lineBuilder.toString(), new ArrayList<TextPosition>(wordPositions))); lineBuilder = new StringBuilder(); wordPositions.clear(); } else { lineBuilder.append(text.getCharacter()); wordPositions.add(text); } return lineBuilder; }

      Attachments

        1. PDFBOX-1804.patch
          0.7 kB
          Joe Hosteny

        Activity

          People

            lehmi Andreas Lehmkühler
            andyphillips404 Andy Phillips
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: