[TIKA-2475] discrepancy between CharsetDetector APIs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.14, 1.15, 1.16
Fix Version/s: None
Component/s: None
Labels:
None
Environment:

Mac OSX 10.12.6, Java 1.8.0_111

Description

Problem

I ran into this trying to use CharsetDetector to detect charsets of attachments on emails when the mail client doesn't specify one. This used to work for us in tika 1.10, but in a recent upgrade to 1.14, behavior seems to have changed. I've attached a sample file, whose charset is ISO-8859-1, and was detected as such with Tika 1.10. When we updated our tika dependency, we noticed that this sample data (a mix of English, Portuguese, and Spanish language) was getting output as a lot of junk Chinese characters. Upon inspection, it was determined that this was because our usage of the newer tika dep was detecting the file as UTF-16LE, instead of ISO-8859-1.

I've attached a sample file (multi-language.txt)

Below is a Spock test that demonstrates the issue:

    def "test charset detection on multilingual file"(){
        setup:
        def file = new File("src/test/resources/data/multi-language.txt")

        when: "using the InputStream api"
        def detector = new CharsetDetector()
        detector.setText(file.newInputStream())
        def fileCharSet = detector.detect()

        then: "successfully detects the charset"
        fileCharSet.name.startsWith("ISO")

        when: "using the byte[] api, and munging the input"
        detector = new CharsetDetector()
        detector.setText(file.newInputStream().bytes)
        detector.MungeInput()
        fileCharSet = detector.detect()

        then: "sucessfully detects the charset"
        fileCharSet.name.startsWith("ISO")

        when: "using the byte[] api alone"
        detector = new CharsetDetector()
        detector.setText(file.newInputStream().bytes)
        fileCharSet = detector.detect()

        then: "this will fail - detects UTF-16LE instead"
        fileCharSet.name.startsWith("ISO")
    }

As is shown in the above test, I believe the issue is that the CharsetDetector's various setText() functions do not delegate to one another, and in one the MungeInput() function is called, and in the other it is not.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

multi-language.txt
10/Oct/17 20:32
2 kB
Sean Story

Issue Links

links to

GitHub Pull Request #210

Activity

People

Assignee:: Unassigned

Reporter:: Sean Story

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 10/Oct/17 20:31

Updated:: 11/Oct/17 15:10

Resolved:: 11/Oct/17 13:21