Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-11313

FileIO azfs Stream mark expired

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Triage Needed
    • Priority: P2
    • Resolution: Unresolved
    • Affects Version/s: 2.25.0
    • Fix Version/s: None
    • Labels:
      None
    • Environment:
      Beam v2.25
      Google Dataflow runner v2.25
    • Flags:
      Important

      Description

      I am attempting to parse a very large CSV (65 million lines) with BEAM (version 2.25) from an Azure Blob and have created a pipeline for this. I am running the pipeline on dataflow and testing with a smaller version of the file (10'000 lines).

      I am using FileIO and the filesystem prefix "azfs" to read from azure blobs.

      The pipeline works with the small test file, but when I run this on the bigger file I am getting an exception "Stream Mark Expired" (pasted below). Reading the same file from a GCP bucket works just fine, including when running with dataflow.

      The CSV file I am attempting to ingest is 54.2 GB and can be obtained here: https://obis.org/manual/access/

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              thomafred90@gmail.com Thomas Li Fredriksen
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: