Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4539

Cache CharsetDecoder

    XMLWordPrintableJSON

    Details

      Description

      We were using PDFBox to parse and process a large number of PDFs, which could potentially contains thousands of pages in total, so performance mattered to us.

      Thus, we'd like to suggest to cache the CharsetDecoder, which is currently instantiated on each call of `isValidUTF8(byte[])`.

      Our suggestion in BaseParser.java

      private static final CharsetDecoder csUTF_8 = Charsets.UTF_8.newDecoder();
      
      /**
       * Returns true if a byte sequence is valid UTF-8.
       */
      private boolean isValidUTF8(byte[] input)
      {
          try
          {
              csUTF_8.decode(ByteBuffer.wrap(input));
              return true;
          }
          catch (CharacterCodingException e)
          {
              return false;
          }
      }
      

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tilman Tilman Hausherr
                Reporter:
                Rahn2 Jonathan
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: