Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-599

PDFBox performance issue: TextPosition performance tweak

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.8.0-incubator, 1.0.0
    • 1.0.0
    • Text extraction
    • None
    • All

    Description

      During text extraction, the TextPosition.getX() and TextPosition.getY() methods are invoked multiple times on each TextPosition object.

      The current code recalculate these values each time the accessor is invoked, even thought the underlying state from which the values are derived has not changed.

      This is slow.

      The getters (getX() and getY()) should be changed to retain the X and Y attributes in instance fields and only calculate their values once.

      Specificaly the following two fields should be added:

      private float x = Float.NEGATIVE_INFINITY;
      private float y = Float.NEGATIVE_INFINITY;

      And the two methods changed to look like so:

      public float getX()
      {
      if(x==Float.NEGATIVE_INFINITY)

      { x = getXRot(rot); }

      return x;
      }

      public float getY()
      {
      if(y==Float.NEGATIVE_INFINITY){
      if ((rot == 0) || (rot == 180))

      { y = pageHeight - getYLowerLeftRot(rot); }

      else

      { y = pageWidth - getYLowerLeftRot(rot); }

      }
      return y;
      }

      This provides a very noticeable speedup in the text extraction.

      I'll attach a version of the TextPosition.java class that includes this mod.

      Attachments

        1. TextPosition.java
          20 kB
          Mel Martinez

        Activity

          People

            jukkaz Jukka Zitting
            m.martinez Mel Martinez
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: