Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5090

Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.0.18, 2.0.19, 2.0.20, 2.0.21, 2.0.22
    • 2.0.23, 3.0.0 PDFBox
    • Text extraction
    • jdk 1.8, apache pdfbox, fontbox 2.0.18~, windows 10

    Description

      When calling PDFTextStripper.getText() function on pdfbox 2.0.18 or later, it fails to extract text with any condition.

      It is suspected that the missing text extraction phenomenon is associated with either the font type or the font size or text's width and height.

       I have attached the text extraction results of version 2.0.17 and version 2.0.18 and the sample data used for the test.

      code

       

      PDDocument pdDocument = PDDocument.load(new File(path));
      PDFTextStripper stripper = new PDFTextStripper();
      

      dependencies

       

      <properties>
          <apache.pdfbox.version>2.0.18</apache.pdfbox.version>
      </properties>
      
      <dependencies>
          <dependency>
              <groupId>org.apache.pdfbox</groupId>
              <artifactId>pdfbox</artifactId>
              <version>${apache.pdfbox.version}</version>
          </dependency>
          <dependency>
              <groupId>org.apache.pdfbox</groupId>
              <artifactId>fontbox</artifactId>
              <version>${apache.pdfbox.version}</version>
          </dependency>
          <dependency>
              <groupId>org.apache.pdfbox</groupId>
              <artifactId>xmpbox</artifactId>
              <version>${apache.pdfbox.version}</version>
          </dependency>
      </dependencies>
      

       

      Attachments

        Issue Links

          Activity

            People

              lehmi Andreas Lehmkühler
              sungwon kim sungwon kim
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: