Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-583

Tika 0.8 line break removal is faulty (misses space when concatenating lines) for PDF file

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 0.8
    • None
    • parser
    • None
    • Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4

    Description

      The included PDF (a legal filing from the web) when parsed by Tika 0.7 has the following as its first several lines of plain text:
      ------- start ---------------
      IN THE COURT OF APPEALS OF THE STATE OF WASHINGTON
      DIVISION ONE
      SERGEY SAVCHUK, )
      ) No. 64269-3-I
      Appellant, )
      v. )
      ) UNPUBLISHED OPINION
      STEVEN G. JERDE and )
      DARLYCE J. JERDE, husband and wife )
      )
      Respondents. )
      _______________________________ ) FILED: November 1, 2010
      --------------- end ---------------------

      Tika 0.8 has this instead:
      -------------- start ---------------------
      IN THE COURT OF APPEALS OF THE STATE OF WASHINGTONDIVISION ONESERGEYSAVCHUK,))No. 64269-3-IAppellant,)v.))UNPUBLISHED OPINIONSTEVENG. JERDE and )DARLYCE J. JERDE, husband and wife))Respondents.)_______________________________ )FILED: November 1, 2010schindler, j
      --------------- end ---------------------

      Notice that as part of the improved paragraph breaking for PDF files, the "header" of the document had lines catenated together without spaces in between, creating run-on words (e.g. "WASHINGTONDIVISION" and "ONESERGEYSAVCHUK"). See the original PDF for more details and compare to the text.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            jukkaz Jukka Zitting
            dennisad Dennis Adler
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment