Tika
  1. Tika
  2. TIKA-548

PDF content extracted as single line

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8
    • Fix Version/s: 0.9
    • Component/s: parser
    • Labels:
      None

      Description

      Rev 1029510 introduces a regression in PDF content parsing, now present in 0.8 RC. Paragraphs from the PDF are no longer separated by newline. This is a problem both for reading and for indexing. See the attached test.

      Note that it seems like PDFBox 1.3.1 extracts correctly, at least from command line. Here's from a sample file with a headline followed by a one word paragraph:
      $> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf
      1 - untitled 3 - 2010-02-13 09:52 - Staffan Olsson
      PDF Title For Short Document
      veryshortpdfcontents

      But Tika prints:
      $> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf
      ...
      <p>1 - untitled 3 - 2010-02-13 09:52 - Staffan OlssonPDF
      Title For Short Documentveryshortpdfcontents</p>

      1. test.pdf
        37 kB
        Reinhard Schwab
      2. tika-PDF-content-regression-test.patch
        1 kB
        Staffan Olsson

        Issue Links

          Activity

          No work has yet been logged on this issue.

            People

            • Assignee:
              Jukka Zitting
              Reporter:
              Staffan Olsson
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development