Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3498

Unexpected spaces in text extraction

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.2, 2.0.3
    • Fix Version/s: 2.0.4, 3.0.0 PDFBox
    • Component/s: Text extraction
    • Labels:
      None

      Description

      In the tests by Tim Allison regressions were found with files from Delaware courts, see reduced file attached.

      The extracted text with 2.0.2 and 2.0.3 is

      IN THE  COUR T OF  CHAN CER Y O F TH E STA TE OF  D ELA WARE
      

      in 2.0.1 and 1.8 it was

      IN THE COURT OF CHANCERY OF THE STATE OF DELAWARE
      

      The cause is the /W ranges table:

      /W [1 1 0 2 3 250 4 10 0 11
      12 333 13 14 0 15 15 250 16 16
      333 17 17 250 18 18 277 19 19 0
      20 23 500 24 35 0 36 36 722 37
      37 666 38 39 722 40 40 666 41 41
      610 42 43 777 44 44 389 45 45 0
      46 46 777 47 47 666 48 48 943 49
      49 722 50 50 777 51 51 610 52 52
      0 53 53 722 54 54 556 55 55 666
      56 57 722 59 59 0 60 60 722 61
      67 0 68 68 500 69 69 556 70 70
      443 71 71 556 72 72 443 73 73 333
      74 74 500 75 75 556 76 76 277 77
      77 0 78 78 556 79 79 277 80 80
      833 81 81 556 82 82 500 83 84 556
      85 85 443 86 86 389 87 87 333 88
      88 556 89 89 0 90 90 722 91 92
      500 93 178 0 179 180 500 181 181 0
      182 182 333 183 751 0 752 752 198 753
      794 0 795 795 612 796 1126 0 1127 1127
      125 1129 1129 2000 1130 65534 0]
      

      For text extraction, the width of a space is caculated by taking an average of all widths. However these files have a lot (over 60000) of widths that are 0. So I'll just ignore widths <= 0, as it is already done in PDFont.

        Attachments

          Activity

            People

            • Assignee:
              tilman Tilman Hausherr
              Reporter:
              tilman Tilman Hausherr
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: