Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-895

Infinite recursion when trying to extract text from specific types of PDFs

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 1.3.1
    • 1.7.0
    • Text extraction
    • None

    Description

      Hello and thanks for PDFBox.

      We just started using PDFBox for text extraction(through Tika)
      and it fails to finish text extraction falling in an infinite loop
      and never returning the text.

      Please note that this happens only for a specific type of PDF
      documents(used for hand writing recognition) such as the one attached.
      Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
      but I think that PDFBox should at least break out if extraction is not possible.

      I wish I could give you more information but I know nothing about PDF format, parsing, etc.
      Please let me know if you need any information or my help in any way.

      Thanks a lot for your time.

      Attachments

        1. test.pdf
          542 kB
          Panayiotis Vlissidis

        Issue Links

          Activity

            People

              lehmi Andreas Lehmkühler
              pvlissidis Panayiotis Vlissidis
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: