Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-729

TIKA CharsetDetector not detecting UTF-16BE/UTF-16LE encodings

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 0.9
    • None
    • parser
    • None

    Description

      Came across this bug when trying to convert Unicode files to UTF-16. For files written in UTF-16BE or UTF-16LE, CharsetDetector detects it as "ISO-8859-1".

      import java.io.File;
      import java.io.FileOutputStream;
      import java.io.IOException;
      import java.io.InputStream;
      import java.io.OutputStreamWriter;
      import java.io.Writer;
      
      import org.apache.tika.exception.TikaException;
      import org.apache.tika.io.TikaInputStream;
      import org.apache.tika.metadata.Metadata;
      import org.apache.tika.parser.txt.CharsetDetector;
      import org.apache.tika.parser.txt.CharsetMatch;
      import org.xml.sax.SAXException;
      
      public class TikaTextConverter {
        public static void main(String args[]) throws IOException, SAXException, TikaException {
          String inputPath = "/tmp/input.csv";
            
          Writer writer = new OutputStreamWriter(new FileOutputStream(inputPath), "UTF-16LE");
          writer.write("Line1, Some text, Some more text");
          writer.close();
          
          InputStream inputStream = TikaInputStream.get(new File(inputPath).toURI().toURL(), new Metadata());
          
          CharsetDetector detector = new CharsetDetector();
          detector.setText(inputStream);
          
          CharsetMatch[] matches = detector.detectAll();
          for (CharsetMatch match : matches) {
            System.out.println(match.getName());
          }
        }
      }
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              abhishekjain Abhishek Jain
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: