Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.3
-
OS: Win 7 x64
JDK: 1.7.03
Description
Hi,
The BOMInputStream works great for most UTF encoded files when detecting Byte Order Marks. However, if a file is UTF-32LE encoded with BOM the class takes it as UTF-16LE instead. This is not expected behavior.
The problem comes from method getBOM(). And the first two bytes for UTF-16LE and UTF-32LE are the same, which might be the root cause of the problem.
The following lists the bytes for UTF encodings for reference. The content is a BOM followed by letter 't'.
Encoding | Byte 1 | Byte 2 | Byte 3 | Byte 4 | ||||
---|---|---|---|---|---|---|---|---|
UTF8 | EF | BB | BF | 74 | ||||
UTF16-LE | FF | FE | 74 | 00 | ||||
UTF16-BE | FE | FF | 00 | 74 | ||||
UTF32-LE | FF | FE | 00 | 00 | 74 | 00 | 00 | 00 |
UTF32-BE | 00 | 00 | FE | FF | 00 | 00 | 00 | 74 |
I personally used the following code to work around this problem at the moment. Hope it helps.
private void detectBOM(InputStream in) throws IOException{ List<ByteOrderMark> all=availableBOMs(); int max=0; for (ByteOrderMark bom : all) { max = Math.max(max, bom.length()); } byte[] firstBytes=new byte[max]; for (int i = 0; i < max; i++) { firstBytes[i]=(byte) in.read(); System.out.print(Integer.toHexString(firstBytes[i] & 0xff).toUpperCase()+" "); } boolean found=false; for (int j = max; j >1; j--) { byte[] _copy=Arrays.copyOf(firstBytes, j); for (ByteOrderMark mark : all) { found=Arrays.equals(_copy, mark.getBytes()); if (found) { System.out.println("\nBOM is: "+mark.getCharsetName()); break; } } if (found) break; } } private static List<ByteOrderMark> availableBOMs(){ List<ByteOrderMark> all=new ArrayList<ByteOrderMark>(); all.add(ByteOrderMark.UTF_8); all.add(ByteOrderMark.UTF_16BE); all.add(ByteOrderMark.UTF_16LE); all.add(ByteOrderMark.UTF_32BE); all.add(ByteOrderMark.UTF_32LE); return all; }