Details
-
Improvement
-
Status: Open
-
P3
-
Resolution: Unresolved
-
None
-
None
-
None
-
Google Compute Plaform DataFlow
Description
From customer:
We've updated our DataFlow templates to read and write with gzip compression. I noticed when .gz file is written the object's metadata defaults to "application/octet-stream" for Content-Type because it doesn't know what it is. I would like to have each file be plain/text for content-type and gzip for content-encoding. We may also add other metadata key/value pairs. I can't find a way to programmatically set these and other metadata values per object within DataFlow. I'm using TextIO right now and just doing .withCompression. I didn't see any other functions to achieve this or any DataFlow doc on it. Am I missing something?
The MIME type of the output file can be set by supplying your own WritableByteChannelFactory to TextIO which sets the MIME type to your desired value[0].
The default WritableByteChannelFactory for TextIO is "text/plain", but when "withCompression" is used, this becomes "application/octet-stream"[1][2].
Unfortunately, FileSystems.create does not support setting a content-encoding on the output channel. I will ensure that this specific point is captured in the feature request, though at this point it becomes an upstream change to Beam rather than a change to Dataflow.