Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5290

ClassCastException during Text Extraction

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Bug
    • 2.0.20, 2.0.24
    • None
    • Text extraction
    • None
    • Important

    Description

      I am getting: 

       

      java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be cast to org.apache.pdfbox.cos.COSArray

      When executing the following code:

       

      public byte[] extractTextPDFBox(String fileNamePath) throws PQException {

      String UTF_8 = "UTF-8";

      PDFLibraryProperties pdfLibraryProperties = PDFLibraryProperties.getInstance();
      String regex = pdfLibraryProperties.getAsString(PDFLibraryConstants.REGEX_TO_REMOVE_FROM_EXTRACTED_TEXT);

      byte[] bytesToReturn;
      try

      { FileInputStream fis = new FileInputStream(new File(fileNamePath)); PDDocument pdfDoc = PDDocument.load(fis); PDFTextStripper pdfStripper = new PDFTextStripper(); String textFromPDF = pdfStripper.getText(pdfDoc); pdfDoc.close(); bytesToReturn = textFromPDF.getBytes(UTF_8); String textStr = new String(bytesToReturn).replaceAll(regex, PDFLibraryConstants.BLANK_SPACE); bytesToReturn = textStr.getBytes(); fis.close(); }

      catch (IOException e)

      { pqUtilityLogger.logError(e.getMessage()); throw new PQException("e.getMessage()); }

      return bytesToReturn;
      }

       

      It dies on String textFromPDF = pdfStripper.getText(pdfDoc);

       

      Attachments

        1. newBroke.pdf
          2.74 MB
          Eric R Manzitti
        2. newBroke.txt
          271 kB
          Maruan Sahyoun

        Activity

          People

            Unassigned Unassigned
            eric292 Eric R Manzitti
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: