-
Type:
Bug
-
Status: Triage Needed
-
Priority:
P2
-
Resolution: Unresolved
-
Affects Version/s: 2.25.0
-
Fix Version/s: None
-
Component/s: io-java-azure, runner-dataflow
-
Labels:None
-
Environment:Beam v2.25
Google Dataflow runner v2.25
-
Flags:Important
I am attempting to parse a very large CSV (65 million lines) with BEAM (version 2.25) from an Azure Blob and have created a pipeline for this. I am running the pipeline on dataflow and testing with a smaller version of the file (10'000 lines).
I am using FileIO and the filesystem prefix "azfs" to read from azure blobs.
The pipeline works with the small test file, but when I run this on the bigger file I am getting an exception "Stream Mark Expired" (pasted below). Reading the same file from a GCP bucket works just fine, including when running with dataflow.
The CSV file I am attempting to ingest is 54.2 GB and can be obtained here: https://obis.org/manual/access/