Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3435

Text extraction - words on same line detection failing in 2.x

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0, 2.0.1, 2.0.2, 2.0.3, 3.0.0 PDFBox
    • Fix Version/s: 2.0.3, 3.0.0 PDFBox
    • Component/s: Text extraction
    • Labels:
      None

      Description

      The ability to extract a line of text as it appears in the PDF is no longer working in the 2.x version of pdfbox.

      java -jar pdfbox-app-1.8.4.jar ExtractText -console -sort ~/Desktop/text-extraction-issues.pdf

      results in:

      . . .
      Your Code        Our Code                            Description                                              Qty    Price Ex   Total Ex  
      11SP             100129630       IRWIN VICE-GRIP 11 C-CLAMP SWIVEL PAD           4         00.00      000.00
      IR-0352          100094584       IRWIN 600MM TOOL BAG                            1         00.00       00.00
      EM81.9           100088913       EMPIRE TORPEDO LEVEL ALUMINIUM                  1         00.00       00.00
      20566-618R       100023443       LENOX RECIPRO BLADE 150X20X0.9MM 18TPI 5P        3          0.00       00.00
      . . .
      

      while
      java -jar pdfbox-app-2.0.2.jar ExtractText -console -sort ~/Desktop/text-extraction-issues.pdf

      results in:

      . . .
      Your Code        Our Code                            Description                                              Qty    Price Ex   Total Ex  
      IRWIN VICE-GRIP 11 C-CLAMP SWIVEL PAD    
      11SP             100129630              4         00.00      000.00
      IRWIN 600MM TOOL BAG                     
      IR-0352          100094584              1         00.00       00.00
      EMPIRE TORPEDO LEVEL ALUMINIUM           
      EM81.9           100088913              1         00.00       00.00
      LENOX RECIPRO BLADE 150X20X0.9MM 18TPI 5P
      20566-618R       100023443              3          0.00       00.00
      . . .
      

        Attachments

        1. text-extraction-issues.pdf
          227 kB
          Lee van Hooff
        2. PDFBOX-3435-20.txt
          3 kB
          Tilman Hausherr
        3. PDFBOX-3435-18.txt
          3 kB
          Tilman Hausherr

          Activity

            People

            • Assignee:
              tilman Tilman Hausherr
              Reporter:
              leevanh Lee van Hooff
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: