Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4507

OutOfMemoryError - tika1.19.1.jar

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Bug
    • 2.0.12, 2.0.14
    • None
    • Parsing
    • None

    Description

      I am trying to parse a pdf file and i am getting OOM.

      Please find below stacktrace, i was facing similar issue with docx as well, but that is working now, with changes suggested in attached ticket.

      https://issues.apache.org/jira/browse/TIKA-2847

      PS : this issue happens only if i have -Xmx512m configured, if i change it to 1g it starts working fine.

      Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
      at java.nio.HeapCharBuffer.<init>(HeapCharBuffer.java:57)
      at java.nio.CharBuffer.allocate(CharBuffer.java:335)
      at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:795)
      at org.apache.pdfbox.pdfparser.BaseParser.isValidUTF8(BaseParser.java:782)
      at org.apache.pdfbox.pdfparser.BaseParser.parseCOSName(BaseParser.java:762)
      at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:278)
      at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:212)
      at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:862)
      at org.apache.pdfbox.pdfparser.PDFObjectStreamParser.parse(PDFObjectStreamParser.java:84)
      at org.apache.pdfbox.pdfparser.COSParser.parseObjectStream(COSParser.java:994)
      at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:880)
      at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
      at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
      at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1160)
      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1133)
      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:154)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
      

      Attachments

        1. testCmplData.pdf
          28.63 MB
          Ashish Tiwari

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ashishdch Ashish Tiwari
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: