Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-8180

Files managed by beam should have associated AVPs such as content-type and content-encoding instead of merely mimeType

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: P3
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: io-java-text
    • Labels:
      None
    • Environment:
      Google Compute Plaform DataFlow

      Description

      From customer:

       

      We've updated our DataFlow templates to read and write with gzip compression. I noticed when .gz file is written the object's metadata defaults to "application/octet-stream" for Content-Type because it doesn't know what it is. I would like to have each file be plain/text for content-type and gzip for content-encoding. We may also add other metadata key/value pairs. I can't find a way to programmatically set these and other metadata values per object within DataFlow. I'm using TextIO right now and just doing .withCompression. I didn't see any other functions to achieve this or any DataFlow doc on it. Am I missing something?

       

      The MIME type of the output file can be set by supplying your own WritableByteChannelFactory to TextIO which sets the MIME type to your desired value[0].

      The default WritableByteChannelFactory for TextIO is "text/plain", but when "withCompression" is used, this becomes "application/octet-stream"[1][2].

      Unfortunately, FileSystems.create does not support setting a content-encoding on the output channel. I will ensure that this specific point is captured in the feature request, though at this point it becomes an upstream change to Beam rather than a change to Dataflow.

      [0] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1175

      [1] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java#L874

      [2] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/MimeTypes.java

      [3] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L224

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              cjac C.J. Collier
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: