Tika
  1. Tika
  2. TIKA-548

PDF content extracted as single line

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8
    • Fix Version/s: 0.9
    • Component/s: parser
    • Labels:
      None

      Description

      Rev 1029510 introduces a regression in PDF content parsing, now present in 0.8 RC. Paragraphs from the PDF are no longer separated by newline. This is a problem both for reading and for indexing. See the attached test.

      Note that it seems like PDFBox 1.3.1 extracts correctly, at least from command line. Here's from a sample file with a headline followed by a one word paragraph:
      $> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf
      1 - untitled 3 - 2010-02-13 09:52 - Staffan Olsson
      PDF Title For Short Document
      veryshortpdfcontents

      But Tika prints:
      $> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf
      ...
      <p>1 - untitled 3 - 2010-02-13 09:52 - Staffan OlssonPDF
      Title For Short Documentveryshortpdfcontents</p>

      1. tika-PDF-content-regression-test.patch
        1 kB
        Staffan Olsson
      2. test.pdf
        37 kB
        Reinhard Schwab

        Issue Links

          Activity

          Hide
          Paul Pearcy added a comment -

          Just wanted to say that I don't believe there is a stable version of TIKA available because of this issue. This issue is fixed on the trunk, but the trunk has a file handle leak problem that prevents large scale usage of this fix:
          https://issues.apache.org/jira/browse/TIKA-567

          Thanks

          Show
          Paul Pearcy added a comment - Just wanted to say that I don't believe there is a stable version of TIKA available because of this issue. This issue is fixed on the trunk, but the trunk has a file handle leak problem that prevents large scale usage of this fix: https://issues.apache.org/jira/browse/TIKA-567 Thanks
          Hide
          Paul Pearcy added a comment -

          +1 for a 8.1 release, unless the 9.0 is imminent.

          Thanks!

          Show
          Paul Pearcy added a comment - +1 for a 8.1 release, unless the 9.0 is imminent. Thanks!
          Hide
          Jukka Zitting added a comment -

          Good point, thanks! I fixed the problem with missing word separators in 1042338.

          Show
          Jukka Zitting added a comment - Good point, thanks! I fixed the problem with missing word separators in 1042338.
          Hide
          Reinhard Schwab added a comment -

          i have generated this document with openoffice and pdf export.
          a tabulator is missing.

          Show
          Reinhard Schwab added a comment - i have generated this document with openoffice and pdf export. a tabulator is missing.
          Hide
          Reinhard Schwab added a comment -

          this is a sample pdf document to reproduce the regression.

          Show
          Reinhard Schwab added a comment - this is a sample pdf document to reproduce the regression.
          Hide
          Reinhard Schwab added a comment -

          there is still a regression there:
          i miss some white spaces comparing the trunk from today with an earlier snapshot of tika from august
          and comparing with the output from pdf text stripper
          i can not provide my sample pdf file, but may be i will find another.
          i can only give an example line of text

          snapshot tika-0.8 from august, pdf text stripper:
          Familienstand: ledig

          trunk:
          Familienstand:ledig

          Show
          Reinhard Schwab added a comment - there is still a regression there: i miss some white spaces comparing the trunk from today with an earlier snapshot of tika from august and comparing with the output from pdf text stripper i can not provide my sample pdf file, but may be i will find another. i can only give an example line of text snapshot tika-0.8 from august, pdf text stripper: Familienstand: ledig trunk: Familienstand:ledig
          Hide
          Chris A. Mattmann added a comment -

          +1 to a patch release if we need to Jukka let me know...

          Show
          Chris A. Mattmann added a comment - +1 to a patch release if we need to Jukka let me know...
          Hide
          Staffan Olsson added a comment -

          Verified to work with Solr. Thanks for the fix.

          Show
          Staffan Olsson added a comment - Verified to work with Solr. Thanks for the fix.
          Hide
          Jukka Zitting added a comment -

          Fixed in revision 1036562. We may want to do a 0.8.1 patch release with this and perhaps some other fixes.

          Show
          Jukka Zitting added a comment - Fixed in revision 1036562. We may want to do a 0.8.1 patch release with this and perhaps some other fixes.

            People

            • Assignee:
              Jukka Zitting
              Reporter:
              Staffan Olsson
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development