Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1671

Wrapped lines in PDF files not processed correctly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 1.9
    • None
    • parser

    Description

      Text that wraps over multiple lines in PDF documents is not extracted correctly by Tika. The expected behaviour would be for it to be extracted as a single line, but instead a line break is inserted at each wrap point.

      This makes it hard, if not impossible, to reassemble text into it's intended form, as it is not known whether a line break in the extracted text is one that appeared in the document or one that was inserted by Tika.

      Attachments

        1. Test Document.pdf
          179 kB
          James Baker

        Issue Links

          Activity

            People

              Unassigned Unassigned
              james.d.baker James Baker
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: