Details
-
New Feature
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
Description
python-archives currently only takes zip.
In our use case, we want to package the whole conda environment into python-archives, similar to how the docs suggest about using venv (Python virtual environment). As we use PyFlink for ML, there are inevitably a few large dependencies (tensorflow, torch, pyarrow), as well as a lot of small dependencies.
This pattern is not friendly for zip. According to the post, zip compresses each file independently, and it is not performing good when dealing with a lot of small files. On the other hand, tar simply bundles all files into a tarball, then we can apply gzip to the whole tarball to achieve smaller size. This may explain why the official packaging tool - conda pack conda pack produces tar.gz by default, even though zip is an option if we really want to.
To further prove the idea, I use my laptop and conda env to run an experiment. My OS: macOS 10.15.7
- Create an environment.yaml as well as a requirements.txt
- Run `conda env create -f environment.yaml` to create the conda env
- Run conda pack to produce a tar.gz
- Run conda pack faetflow-ml-env.zip to produce a zip
More details:
environment.yaml
name: featflow-ml-env channels: - pytorch - conda-forge - defaults dependencies: - python=3.7 - pytorch=1.8.0 - scikit-learn=0.23.2 - pip - pip: - -r file:requirements.txt
requirements.txt
apache-flink==1.12.0 deepctr-torch==0.2.6 black==20.8b1 confluent-kafka==1.6.0 pytest==6.2.2 testcontainers==3.4.0 kafka-python==2.0.2
End result: the tar.gz is 854M, the zip is 1.6G
So, long story short, python-archives only support zip, while zip is not a good choice for packaging ML libs. Let's change this by adding python-archives tar.gz support.
Change will happen in this way: In ProcessPythonEnvironmentManager.java, check the suffix. If tar.gz, unarchive it using gzip decompresser.
Attachments
Issue Links
- links to