Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-583

Tika 0.8 line break removal is faulty (misses space when concatenating lines) for PDF file

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 0.8
    • None
    • parser
    • None
    • Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4

    Description

      The included PDF (a legal filing from the web) when parsed by Tika 0.7 has the following as its first several lines of plain text:
      ------- start ---------------
      IN THE COURT OF APPEALS OF THE STATE OF WASHINGTON
      DIVISION ONE
      SERGEY SAVCHUK, )
      ) No. 64269-3-I
      Appellant, )
      v. )
      ) UNPUBLISHED OPINION
      STEVEN G. JERDE and )
      DARLYCE J. JERDE, husband and wife )
      )
      Respondents. )
      _______________________________ ) FILED: November 1, 2010
      --------------- end ---------------------

      Tika 0.8 has this instead:
      -------------- start ---------------------
      IN THE COURT OF APPEALS OF THE STATE OF WASHINGTONDIVISION ONESERGEYSAVCHUK,))No. 64269-3-IAppellant,)v.))UNPUBLISHED OPINIONSTEVENG. JERDE and )DARLYCE J. JERDE, husband and wife))Respondents.)_______________________________ )FILED: November 1, 2010schindler, j
      --------------- end ---------------------

      Notice that as part of the improved paragraph breaking for PDF files, the "header" of the document had lines catenated together without spaces in between, creating run-on words (e.g. "WASHINGTONDIVISION" and "ONESERGEYSAVCHUK"). See the original PDF for more details and compare to the text.

      Attachments

        1. Savchuk v. Jerde.pdf
          72 kB
          Dennis Adler

        Issue Links

          Activity

            People

              jukkaz Jukka Zitting
              dennisad Dennis Adler
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: