Uploaded image for project: 'Commons IO'
  1. Commons IO
  2. IO-331

BOMInputStream wrongly detects UTF-32LE_BOM files as UTF-16LE_BOM files in method getBOM()

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.3
    • 2.4
    • Streams/Writers
    • OS: Win 7 x64
      JDK: 1.7.03

    Description

      Hi,

      The BOMInputStream works great for most UTF encoded files when detecting Byte Order Marks. However, if a file is UTF-32LE encoded with BOM the class takes it as UTF-16LE instead. This is not expected behavior.

      The problem comes from method getBOM(). And the first two bytes for UTF-16LE and UTF-32LE are the same, which might be the root cause of the problem.

      The following lists the bytes for UTF encodings for reference. The content is a BOM followed by letter 't'.

      Encoding Byte 1 Byte 2 Byte 3 Byte 4        
      UTF8 EF BB BF 74        
      UTF16-LE FF FE 74 00        
      UTF16-BE FE FF 00 74        
      UTF32-LE FF FE 00 00 74 00 00 00
      UTF32-BE 00 00 FE FF 00 00 00 74

      I personally used the following code to work around this problem at the moment. Hope it helps.

      	private void detectBOM(InputStream in) throws IOException{
      		List<ByteOrderMark> all=availableBOMs();
      		int max=0;
              for (ByteOrderMark bom : all) {
                  max = Math.max(max, bom.length());
              }
      		byte[] firstBytes=new byte[max];
      		for (int i = 0; i < max; i++) {
      			firstBytes[i]=(byte) in.read();
      			System.out.print(Integer.toHexString(firstBytes[i] & 0xff).toUpperCase()+" ");
      		}
      		
      		boolean found=false;
      		for (int j = max; j >1; j--) {
      			byte[] _copy=Arrays.copyOf(firstBytes, j);
      			for (ByteOrderMark mark : all) {
      				found=Arrays.equals(_copy, mark.getBytes());
      				if (found) {
      					System.out.println("\nBOM is: "+mark.getCharsetName());
      					break;
      				}
      			}
      			if (found) break;
      		}
      	}
      	
      	private static List<ByteOrderMark> availableBOMs(){
      		List<ByteOrderMark> all=new ArrayList<ByteOrderMark>();
      		all.add(ByteOrderMark.UTF_8);
      		all.add(ByteOrderMark.UTF_16BE);
      		all.add(ByteOrderMark.UTF_16LE);
      		all.add(ByteOrderMark.UTF_32BE);
      		all.add(ByteOrderMark.UTF_32LE);
      		return all;
      	}
      

      Attachments

        1. UTF-32LE_Y.txt
          0.6 kB
          David Gao

        Activity

          People

            Unassigned Unassigned
            davidgjm David Gao
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: