Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2737

regression in charset detection

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.17, 1.18, 1.19
    • Fix Version/s: None
    • Component/s: detector
    • Labels:
      None

      Description

      The attached text file is a test csv file (cbp12pr_ia_st.txt) I'm using for testing of csv parser. from version 1.13 to 1.16 - the test was working. I'm trying to upgrade to the latest version 1.19. The test started failing with version 1.17 (see attachments for matches in version 1.16 as well as 1.17). The attached test file contain method testFailure (the last one) that show the wrong detection the expected is UTF-8 detected IBM500.

        Attachments

        1. charset- match-tike1.17.png
          86 kB
          rdamir
        2. charset- match-tike1.16.png
          65 kB
          rdamir
        3. CharsetDetectorTest.java
          5 kB
          rdamir
        4. cbp12pr_ia_st.txt
          7 kB
          rdamir

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                rdamir rdamir
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated: