Details
-
Bug
-
Status: Triage Needed
-
P3
-
Resolution: Fixed
-
2.15.0
-
None
Description
TextSource in the org.apache.beam.sdk.io package can handle UTF-8 encoded files, and when the file contains byte order mark (BOM), it will preserve it in the output. According to Unicode standard (http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf): "Use of a BOM is neither required nor recommended for UTF-8". UTF-8 with a BOM will also be a potential problem for some Java implementations (e.g., https://bugs.java.com/bugdatabase/view_bug.do?bug_id=4508058). As a general practice, it's suggested to use UTF-8 without BOM.
Proposal: remove BOM bytes in the output from TextSource.
Attachments
Issue Links
- links to