Tika
  1. Tika
  2. TIKA-548

PDF content extracted as single line

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8
    • Fix Version/s: 0.9
    • Component/s: parser
    • Labels:
      None

      Description

      Rev 1029510 introduces a regression in PDF content parsing, now present in 0.8 RC. Paragraphs from the PDF are no longer separated by newline. This is a problem both for reading and for indexing. See the attached test.

      Note that it seems like PDFBox 1.3.1 extracts correctly, at least from command line. Here's from a sample file with a headline followed by a one word paragraph:
      $> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf
      1 - untitled 3 - 2010-02-13 09:52 - Staffan Olsson
      PDF Title For Short Document
      veryshortpdfcontents

      But Tika prints:
      $> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf
      ...
      <p>1 - untitled 3 - 2010-02-13 09:52 - Staffan OlssonPDF
      Title For Short Documentveryshortpdfcontents</p>

      1. tika-PDF-content-regression-test.patch
        1 kB
        Staffan Olsson
      2. test.pdf
        37 kB
        Reinhard Schwab

        Issue Links

          Activity

          Staffan Olsson created issue -
          Staffan Olsson made changes -
          Field Original Value New Value
          Attachment tika-PDF-content-regression-test.patch [ 12459373 ]
          Jukka Zitting made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Assignee Jukka Zitting [ jukkaz ]
          Fix Version/s 0.9 [ 12315488 ]
          Resolution Fixed [ 1 ]
          Reinhard Schwab made changes -
          Attachment test.pdf [ 12464820 ]
          Jukka Zitting made changes -
          Link This issue is duplicated by TIKA-583 [ TIKA-583 ]
          Jukka Zitting made changes -
          Link This issue is duplicated by TIKA-584 [ TIKA-584 ]
          Jukka Zitting made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Jukka Zitting
              Reporter:
              Staffan Olsson
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development