Uploaded image for project: 'Commons IO'
  1. Commons IO
  2. IO-178

BOMInputStream - an InputStream for detected and optionally excludeing an initial Byte Order mark

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.4
    • 2.0
    • Streams/Writers
    • None

    Description

      Microsoft tools have the unpleasant habit of writing a byte order mark (the three-byte sequence 0xEF 0xBB 0xBF) at the start of a UTF-8 encoded file.

      The CharsetDecoder supplied with the JDK does not simply discard these bytes, but instead returns the BOM character (0xFEFF); see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6378911 for discussion on this.

      This makes life unpleasant for anyone who is processing text data, as the program must look for this character and ignore it.

      The BOMExclusionInputStream class is a work-around: it recognizes the BOM at the start of the stream, and skips over it.

      Attachments

        1. TestBOMExclusionInputStream.java
          8 kB
          Keith D Gregory
        2. BOMExclusionInputStream.patch
          13 kB
          Keith D Gregory
        3. BOMExclusionInputStream.java
          4 kB
          Keith D Gregory

        Issue Links

          Activity

            People

              niallp Niall Pemberton
              kdgregory Keith D Gregory
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: