[TIKA-729] TIKA CharsetDetector not detecting UTF-16BE/UTF-16LE encodings - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 0.9
Fix Version/s: None
Component/s: parser
Labels:
None

Description

Came across this bug when trying to convert Unicode files to UTF-16. For files written in UTF-16BE or UTF-16LE, CharsetDetector detects it as "ISO-8859-1".

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStreamWriter;
import java.io.Writer;

import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.txt.CharsetDetector;
import org.apache.tika.parser.txt.CharsetMatch;
import org.xml.sax.SAXException;

public class TikaTextConverter {
  public static void main(String args[]) throws IOException, SAXException, TikaException {
    String inputPath = "/tmp/input.csv";
      
    Writer writer = new OutputStreamWriter(new FileOutputStream(inputPath), "UTF-16LE");
    writer.write("Line1, Some text, Some more text");
    writer.close();
    
    InputStream inputStream = TikaInputStream.get(new File(inputPath).toURI().toURL(), new Metadata());
    
    CharsetDetector detector = new CharsetDetector();
    detector.setText(inputStream);
    
    CharsetMatch[] matches = detector.detectAll();
    for (CharsetMatch match : matches) {
      System.out.println(match.getName());
    }
  }
}

Attachments

Issue Links

duplicates

TIKA-721 UTF16-LE not detected

Open

Activity

People

Assignee:: Unassigned

Reporter:: Abhishek Jain

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 24/Sep/11 07:42

Updated:: 24/Sep/11 09:48

Resolved:: 24/Sep/11 09:48