Details
Description
Hi,
I am using a PTransform class to retrieve Google Cloud Storage files with FileIO that were working very well before version 2.20.0.
I have upgraded my Beam library last week, to 2.20.0 & 2.21.0 and now I have an unexpected Exception when I retrieve some files with space inside the path:
Error message from worker: java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.io.FileNotFoundException: Item not found: 'gs://[MY_BUCKET]/2017/09/12/3d9d7cc8-e970-42f8-9f24-7d9b70989033/31/a9/ba/<1710RH600@optimashipbroking.com /body.txt'. If you enabled STRICT generation consistency, it is possible that the live version is still available but the intended generation is deleted. org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:184)
Please note that the gcloud following gcloud command works:
gsutil ls "gs://[MY_BUCKET]/2017/09/12/3d9d7cc8-e970-42f8-9f24-7d9b70989033/31/a9/ba/<1710RH600@optimashipbroking.com /body.txt"
Here is my code:
public PCollection<KV<String, byte[]>> expand(PBegin begin) { PCollection<KV<String, byte[]>> files = begin .apply(FileIO.match().filepattern("gs://[MY_BUCKET]/**/body.txt").withEmptyMatchTreatment(EmptyMatchTreatment.ALLOW)) .apply(FileIO.readMatches()) .apply("Extract key", ParDo.of( new DoFn<ReadableFile, KV<String, byte[]>>() { @ProcessElement public void processElement(ProcessContext c) throws IOException { ReadableFile f = c.element(); c.output(KV.of(f.getMetadata().resourceId().toString(), f.readFullyAsBytes())); } } ) ); return files; }
Maybe I just need to find a way to escape the file path but I don't know how.
I hope you can help me.
Xavier