Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5090

Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.18, 2.0.19, 2.0.20, 2.0.21, 2.0.22
    • Fix Version/s: 2.0.23, 3.0.0 PDFBox
    • Component/s: Text extraction
    • Labels:
    • Environment:
      jdk 1.8, apache pdfbox, fontbox 2.0.18~, windows 10

      Description

      When calling PDFTextStripper.getText() function on pdfbox 2.0.18 or later, it fails to extract text with any condition.

      It is suspected that the missing text extraction phenomenon is associated with either the font type or the font size or text's width and height.

       I have attached the text extraction results of version 2.0.17 and version 2.0.18 and the sample data used for the test.

      code

       

      PDDocument pdDocument = PDDocument.load(new File(path));
      PDFTextStripper stripper = new PDFTextStripper();
      

      dependencies

       

      <properties>
          <apache.pdfbox.version>2.0.18</apache.pdfbox.version>
      </properties>
      
      <dependencies>
          <dependency>
              <groupId>org.apache.pdfbox</groupId>
              <artifactId>pdfbox</artifactId>
              <version>${apache.pdfbox.version}</version>
          </dependency>
          <dependency>
              <groupId>org.apache.pdfbox</groupId>
              <artifactId>fontbox</artifactId>
              <version>${apache.pdfbox.version}</version>
          </dependency>
          <dependency>
              <groupId>org.apache.pdfbox</groupId>
              <artifactId>xmpbox</artifactId>
              <version>${apache.pdfbox.version}</version>
          </dependency>
      </dependencies>
      

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                lehmi Andreas Lehmkühler
                Reporter:
                sungwon kim sungwon kim
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: