Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1874

PDFTextStripper.isParagraphSeparation(...)

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.8.8, 2.0.0
    • Fix Version/s: 1.8.9, 2.0.0
    • Component/s: Text extraction
    • Labels:
    • Environment:
      Eclipse

      Description

      PDFTextStripper.isParagraphSeparation(...) seems to have an issue with how it finds Y text indentation.

      PROBLEM:
      I believe the issue is due to precision in the the following logic:

                  float yGap = Math.abs(position.getTextPosition().getYDirAdj()-
                          lastPosition.getTextPosition().getYDirAdj());
                  float xGap = (position.getTextPosition().getXDirAdj()-
                          lastLineStartPosition.getTextPosition().getXDirAdj());
      
                  if(yGap > (getDropThreshold()*maxHeightForLine))
                  {
                              result = true;
      

      yGap has a precision to 1000th+, while (getDropThreshold()*maxHeightForLine) has a precision to 100,000th. Resulting in the following comparison (example):
      16.018 > 16.018005
      which evaluates to "True". However 16.018 > 16.018 would evaluate to "False".

      EFFECT OF THE PROBLEM:
      every line in the output is marked as "isParagraphStart = true" and "writeParagraphEnd() ... = true".
      I.E.

      NEW_LINE
      PARAGRAPH_START PDFBox has been designed to represent PDF documents using familiar object-oriented paradigms. The data NEW_LINE

      contained in a PDF document is a collection of basic object types: arrays, booleans, dictionaries, numbers,|||NEW_LINE|||

      PARAGRAPH_END NEW_LINE
      PARAGRAPH_START strings and binary streams. PDFBox captures these basic object types in the org.pdfbox.cos package (the NEW_LINE

      COS Model). While it's possible to create any desired interactions with a PDF document using only these|||NEW_LINE|||

      PARAGRAPH_END NEW_LINE

      In the source PDF these lines appear as such:
      "PDFBox has been designed to represent PDF documents using familiar object-oriented paradigms. The data
      contained in a PDF document is a collection of basic object types: arrays, booleans, dictionaries, numbers,
      strings and binary streams. PDFBox captures these basic object types in the org.pdfbox.cos package (the
      COS Model). While it's possible to create any desired interactions with a PDF document using only these"

      MY WORKAROUND:
      NOTE: there is a small performance hit with this workaround.

      	 float yGap = Math.abs(position.getTextPosition().getYDirAdj()
      	 - lastPosition.getTextPosition().getYDirAdj());
      	
      	 DecimalFormat df = new DecimalFormat("#.00");
      	 float yGapTruncated = Float.valueOf(df.format(yGap));
      	
      	 float newYVal = Float.valueOf(df.format(getDropThreshold()
      	 * maxHeightForLine));
      

        Attachments

          Activity

            People

            • Assignee:
              lehmi Andreas Lehmkühler
              Reporter:
              YuriBurrows@gmail.com Yuri Burrows
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: