Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2737

regression in charset detection

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.17, 1.18, 1.19
    • None
    • detector
    • None

    Description

      The attached text file is a test csv file (cbp12pr_ia_st.txt) I'm using for testing of csv parser. from version 1.13 to 1.16 - the test was working. I'm trying to upgrade to the latest version 1.19. The test started failing with version 1.17 (see attachments for matches in version 1.16 as well as 1.17). The attached test file contain method testFailure (the last one) that show the wrong detection the expected is UTF-8 detected IBM500.

      Attachments

        1. charset- match-tike1.16.png
          65 kB
          rdamir
        2. charset- match-tike1.17.png
          86 kB
          rdamir
        3. cbp12pr_ia_st.txt
          7 kB
          rdamir
        4. CharsetDetectorTest.java
          5 kB
          rdamir

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rdamir rdamir
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: