Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3435

Text extraction - words on same line detection failing in 2.x

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.0.0, 2.0.1, 2.0.2, 2.0.3, 3.0.0 PDFBox
    • 2.0.3, 3.0.0 PDFBox
    • Text extraction
    • None

    Description

      The ability to extract a line of text as it appears in the PDF is no longer working in the 2.x version of pdfbox.

      java -jar pdfbox-app-1.8.4.jar ExtractText -console -sort ~/Desktop/text-extraction-issues.pdf

      results in:

      . . .
      Your Code        Our Code                            Description                                              Qty    Price Ex   Total Ex  
      11SP             100129630       IRWIN VICE-GRIP 11 C-CLAMP SWIVEL PAD           4         00.00      000.00
      IR-0352          100094584       IRWIN 600MM TOOL BAG                            1         00.00       00.00
      EM81.9           100088913       EMPIRE TORPEDO LEVEL ALUMINIUM                  1         00.00       00.00
      20566-618R       100023443       LENOX RECIPRO BLADE 150X20X0.9MM 18TPI 5P        3          0.00       00.00
      . . .
      

      while
      java -jar pdfbox-app-2.0.2.jar ExtractText -console -sort ~/Desktop/text-extraction-issues.pdf

      results in:

      . . .
      Your Code        Our Code                            Description                                              Qty    Price Ex   Total Ex  
      IRWIN VICE-GRIP 11 C-CLAMP SWIVEL PAD    
      11SP             100129630              4         00.00      000.00
      IRWIN 600MM TOOL BAG                     
      IR-0352          100094584              1         00.00       00.00
      EMPIRE TORPEDO LEVEL ALUMINIUM           
      EM81.9           100088913              1         00.00       00.00
      LENOX RECIPRO BLADE 150X20X0.9MM 18TPI 5P
      20566-618R       100023443              3          0.00       00.00
      . . .
      

      Attachments

        1. PDFBOX-3435-20.txt
          3 kB
          Tilman Hausherr
        2. PDFBOX-3435-18.txt
          3 kB
          Tilman Hausherr
        3. text-extraction-issues.pdf
          227 kB
          Lee van Hooff

        Activity

          People

            tilman Tilman Hausherr
            leevanh Lee van Hooff
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: