[BEAM-8579] Strip UTF-8 BOM bytes (if present) in TextSource. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Triage Needed
Priority: P3
Resolution: Fixed
Affects Version/s: 2.15.0
Fix Version/s: 2.18.0
Component/s: io-java-text
Labels:
None

Description

TextSource in the org.apache.beam.sdk.io package can handle UTF-8 encoded files, and when the file contains byte order mark (BOM), it will preserve it in the output. According to Unicode standard (http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf): "Use of a BOM is neither required nor recommended for UTF-8". UTF-8 with a BOM will also be a potential problem for some Java implementations (e.g., https://bugs.java.com/bugdatabase/view_bug.do?bug_id=4508058). As a general practice, it's suggested to use UTF-8 without BOM.

Proposal: remove BOM bytes in the output from TextSource.

Attachments

Issue Links

links to

GitHub Pull Request #10046

Activity

People

Assignee:: Changming Ma

Reporter:: Changming Ma

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 07/Nov/19 19:33

Updated:: 13/Apr/23 10:58

Resolved:: 11/Nov/19 22:19

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 20m