Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-548

PDF content extracted as single line

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.8
    • 0.9
    • parser
    • None

    Description

      Rev 1029510 introduces a regression in PDF content parsing, now present in 0.8 RC. Paragraphs from the PDF are no longer separated by newline. This is a problem both for reading and for indexing. See the attached test.

      Note that it seems like PDFBox 1.3.1 extracts correctly, at least from command line. Here's from a sample file with a headline followed by a one word paragraph:
      $> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf
      1 - untitled 3 - 2010-02-13 09:52 - Staffan Olsson
      PDF Title For Short Document
      veryshortpdfcontents

      But Tika prints:
      $> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf
      ...
      <p>1 - untitled 3 - 2010-02-13 09:52 - Staffan OlssonPDF
      Title For Short Documentveryshortpdfcontents</p>

      Attachments

        1. tika-PDF-content-regression-test.patch
          1 kB
          Staffan Olsson
        2. test.pdf
          37 kB
          Reinhard Pötz

        Issue Links

          Activity

            People

              jukkaz Jukka Zitting
              solsson Staffan Olsson
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: