Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2758

Possible error charset detection

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.18
    • 2.0.0-BETA
    • core
    • None

    Description

      I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran all 995 unit tests and observed three failures, two encoding issues and one other weird thing. The tests use real HTML.

      Where we previously extracted text such as 'Spokane, Wash. [— The solar' we now got 'Spokane, Wash. [â€" The solar' in one test. The other had 'could take ["weeks, or' but we not get 'could take [“weeks, or' extracted. Our tests pass with 1.17 but fail with 1.18 and 1.19.1.

      Attached are the two HTML files.

      Reading our tests again, i see an old note besides the indepedent test complaining about the character encoding being incorrect. It seems somewhere before 1.17 it was faultly just as it is now with 1.18 and higher.

      Attachments

        1. independent.html
          216 kB
          Markus Jelsma
        2. grep_charsets.csv
          24 kB
          Tim Allison
        3. detroidnews.html
          127 kB
          Markus Jelsma

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            markus17 Markus Jelsma

            Dates

              Created:
              Updated:

              Slack

                Issue deployment