Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5545

PDFTextStripper - Expose a setter for the TextPositionComparator

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Workaround
    • 3.0.0 PDFBox
    • None
    • Text extraction
    • None

    Description

      I process a lot of medical related PDF files with a lot of superscripts, subscripts, out of order characters etc.

      We tend to have trouble with the sortByPosition flag in PDFTextStripper.

      If it's not enabled, we end up with characters which are out of order in some PDFs.

      If we do enable it it sometimes messes up superscript and subscript positions.

      Can you expose a setter for the comparator instance, so that I can try to correct it ? E.g. 

       

      private Comparator<TextPosition> textPositionComparator = new TextPositionComparator();
      
          /**
           *
           * @param newTextPositionComparator
           */
          public void setTextPositionComparator(final Comparator<TextPosition> newTextPositionComparator) {
              this.textPositionComparator = newTextPositionComparator;
          }
       

      Then in the writePage() method, just use that comparator?

       

      Users can then potentially inject their own comparator implementation in.

      I want to try to implement a comparator that fixes sorting with subscript/superscript tolerances, eg. something like this (in Kotlin)

       

      import mu.KLogging
      import org.apache.pdfbox.text.TextPosition
      import kotlin.math.abs
      
      class TextPositionSubscriptComparator : Comparator<TextPosition>, KLogging() {
      
          override fun compare(pos1: TextPosition, pos2: TextPosition): Int {
      
              val textDir = pos1.dir.compareTo(pos2.dir)
              return if (textDir != 0) {
                  textDir
              } else {
                  val x1 = pos1.xDirAdj
                  val x2 = pos2.xDirAdj
                  val pos1YBottom = pos1.yDirAdj
                  val pos2YBottom = pos2.yDirAdj
                  val yDifference = abs(pos1YBottom - pos2YBottom)
      
                  val result = if (yDifference < 0.1f) {
                      x1.compareTo(x2)
                  } else {
                      val range1 = Pair(pos1.yDirAdj - OUT_OF_LINE_TOLERANCE, pos1.yDirAdj + pos1.heightDir + OUT_OF_LINE_TOLERANCE)
                      val range2 = Pair(pos2.yDirAdj - OUT_OF_LINE_TOLERANCE, pos2.yDirAdj + pos2.heightDir + OUT_OF_LINE_TOLERANCE)
      
                      if (range1.overlap(range2) || range2.overlap(range1)) {
                          x1.compareTo(x2)
                      } else {
                          if (pos1YBottom < pos2YBottom) -1 else 1
                      }
                  }
      
      //            logger.info { "result = $result, [${pos1.unicode}], x1=${pos1.x}, y1=${pos1.y} ---- [${pos2.unicode}], x2=${pos2.x}, y2=${pos1.y}"  }
      
                  return result
              }
      
          }
      
          companion object {
              private const val OUT_OF_LINE_TOLERANCE = 2f
          }
      }
      
      /**
       * Checks whether a numeric range overlaps with another
       */
      fun Pair<Float, Float>.overlap(other: Pair<Float, Float>) =
              !(first > other.second || second < other.first)
      
      

       

      It could greatly help if the sorting comparator was configurable.

       

      regards,

      Owen

       

      Attachments

        Activity

          People

            lehmi Andreas Lehmkühler
            omcgovern Owen McGovern
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: