Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
0.9
-
None
-
None
Description
Came across this bug when trying to convert Unicode files to UTF-16. For files written in UTF-16BE or UTF-16LE, CharsetDetector detects it as "ISO-8859-1".
import java.io.File; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStream; import java.io.OutputStreamWriter; import java.io.Writer; import org.apache.tika.exception.TikaException; import org.apache.tika.io.TikaInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.txt.CharsetDetector; import org.apache.tika.parser.txt.CharsetMatch; import org.xml.sax.SAXException; public class TikaTextConverter { public static void main(String args[]) throws IOException, SAXException, TikaException { String inputPath = "/tmp/input.csv"; Writer writer = new OutputStreamWriter(new FileOutputStream(inputPath), "UTF-16LE"); writer.write("Line1, Some text, Some more text"); writer.close(); InputStream inputStream = TikaInputStream.get(new File(inputPath).toURI().toURL(), new Metadata()); CharsetDetector detector = new CharsetDetector(); detector.setText(inputStream); CharsetMatch[] matches = detector.detectAll(); for (CharsetMatch match : matches) { System.out.println(match.getName()); } } }
Attachments
Issue Links
- duplicates
-
TIKA-721 UTF16-LE not detected
- Open