Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-548

PDF content extracted as single line

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.8
    • Fix Version/s: 0.9
    • Component/s: parser
    • Labels:
      None

      Description

      Rev 1029510 introduces a regression in PDF content parsing, now present in 0.8 RC. Paragraphs from the PDF are no longer separated by newline. This is a problem both for reading and for indexing. See the attached test.

      Note that it seems like PDFBox 1.3.1 extracts correctly, at least from command line. Here's from a sample file with a headline followed by a one word paragraph:
      $> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf
      1 - untitled 3 - 2010-02-13 09:52 - Staffan Olsson
      PDF Title For Short Document
      veryshortpdfcontents

      But Tika prints:
      $> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf
      ...
      <p>1 - untitled 3 - 2010-02-13 09:52 - Staffan OlssonPDF
      Title For Short Documentveryshortpdfcontents</p>

        Attachments

        1. tika-PDF-content-regression-test.patch
          1 kB
          Staffan Olsson
        2. test.pdf
          37 kB
          Reinhard Schwab

          Issue Links

            Activity

              People

              • Assignee:
                jukkaz Jukka Zitting
                Reporter:
                solsson Staffan Olsson
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: