Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-12555

Revisit process of dependency staging in Beam Python

Details

    • Wish
    • Status: Open
    • P3
    • Resolution: Unresolved
    • None
    • None
    • sdk-py-core
    • None

    Description

      There are a few issues:

      1) Including Beam itself in requirements.txt is causing unnecessary friction, and is suboptimal, because Beam takes care to stage itself to the workers, and Beam workers include Beam dependencies. This is not clear from https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/. Yet from a user's perspective including Beam into requirements.txt seems natural.

      2) Staging sources of all dependencies mentioned in requirements.txt, and their transitive dependencies, in some cases involves a hidden package recompilation, initiated by pip. The reason is that pip cannot reliably identify dependencies of a package without recompiling a package in certain cases, see [1-3] for pointers. This increases time it takes to launch a Beam job, and may require additional software (such as linux packages with header libraries or gcc deps) to be available. This causes friction, confusion, is not obvious and beyond Beam's control.

      [1] https://github.com/pypa/pip/issues/8387
      [2] https://github.com/pypa/pip/issues/7995
      [3] https://discuss.python.org/t/pip-download-just-the-source-packages-no-building-no-metadata-etc/4651

      Attachments

        Activity

          People

            Unassigned Unassigned
            tvalentyn Valentyn Tymofieiev
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: