[IO-331] BOMInputStream wrongly detects UTF-32LE_BOM files as UTF-16LE_BOM files in method getBOM() - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3
Fix Version/s: 2.4
Component/s: Streams/Writers
Labels:
- BOMInputStream
- UTF-32LE
Environment:

OS: Win 7 x64
JDK: 1.7.03

Description

Hi,

The BOMInputStream works great for most UTF encoded files when detecting Byte Order Marks. However, if a file is UTF-32LE encoded with BOM the class takes it as UTF-16LE instead. This is not expected behavior.

The problem comes from method getBOM(). And the first two bytes for UTF-16LE and UTF-32LE are the same, which might be the root cause of the problem.

The following lists the bytes for UTF encodings for reference. The content is a BOM followed by letter 't'.

Encoding	Byte 1	Byte 2	Byte 3	Byte 4
UTF8	EF	BB	BF	74
UTF16-LE	FF	FE	74	00
UTF16-BE	FE	FF	00	74
UTF32-LE	FF	FE	00	00	74	00	00	00
UTF32-BE	00	00	FE	FF	00	00	00	74

I personally used the following code to work around this problem at the moment. Hope it helps.

	private void detectBOM(InputStream in) throws IOException{
		List<ByteOrderMark> all=availableBOMs();
		int max=0;
        for (ByteOrderMark bom : all) {
            max = Math.max(max, bom.length());
        }
		byte[] firstBytes=new byte[max];
		for (int i = 0; i < max; i++) {
			firstBytes[i]=(byte) in.read();
			System.out.print(Integer.toHexString(firstBytes[i] & 0xff).toUpperCase()+" ");
		}
		
		boolean found=false;
		for (int j = max; j >1; j--) {
			byte[] _copy=Arrays.copyOf(firstBytes, j);
			for (ByteOrderMark mark : all) {
				found=Arrays.equals(_copy, mark.getBytes());
				if (found) {
					System.out.println("\nBOM is: "+mark.getCharsetName());
					break;
				}
			}
			if (found) break;
		}
	}
	
	private static List<ByteOrderMark> availableBOMs(){
		List<ByteOrderMark> all=new ArrayList<ByteOrderMark>();
		all.add(ByteOrderMark.UTF_8);
		all.add(ByteOrderMark.UTF_16BE);
		all.add(ByteOrderMark.UTF_16LE);
		all.add(ByteOrderMark.UTF_32BE);
		all.add(ByteOrderMark.UTF_32LE);
		return all;
	}

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

UTF-32LE_Y.txt
01/Jun/12 04:21
0.6 kB
David Gao

Activity

People

Assignee:: Unassigned

Reporter:: David Gao

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 01/Jun/12 04:20

Updated:: 08/Nov/16 17:57

Resolved:: 05/Jun/12 14:48